## “Предобработка текста с помощью Python”

Осуществим предобработку данных с Твиттера, чтобы очищенные данные в дальнейшем
использовать для задачи классификации. Данный датасет содержит негативные (label = 1)
и нейтральные (label = 0) высказывания. Для работы объединим train_df и test_df.

In [293]:
import pandas as pd
import numpy as np
import re

In [294]:
train_df = pd.read_csv('/Users/admin/Desktop/GeekBrains/NLP/1-Введение в предобработку языка/hw1_actual/train_tweets.csv')
train_df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [295]:
test_df = pd.read_csv('/Users/admin/Desktop/GeekBrains/NLP/1-Введение в предобработку языка/hw1_actual/test_tweets.csv')
test_df.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


### 1. Удалим @user из всех твитов с помощью паттерна "@[\w]*".
Для этого создадим
функцию:

● для того, чтобы найти все вхождения паттерна в тексте, необходимо
использовать re.findall(pattern, input_txt)

● для для замены @user на пробел, необходимо использовать re.sub()

In [296]:
result = re.findall(r'@[user]*', train_df['tweet'][0])
result

['@user']

In [297]:
regex = re.compile('@[user]*')
def del_user(text, regex=regex):
    try:
        return re.sub(regex, ' ',text.lower())
    except:
        return []

In [298]:
print(train_df.tweet[0])

 @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run


In [299]:
print(del_user(train_df.tweet[0]))

   when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run


In [300]:
train_df['tweet'] = train_df['tweet'].apply(del_user)
train_df.head()

Unnamed: 0,id,label,tweet
0,1,0,when a father is dysfunctional and is so se...
1,2,0,thanks for #lyft credit i can't use cause ...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [301]:
test_df['tweet'] = test_df['tweet'].apply(del_user)
test_df.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,#white #supremacists want everyone to see t...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


### 2. Изменим регистр твитов на нижний с помощью .lower().

In [302]:
# сделала это выше, в предыдущей функции del_user

### 3. Заменим сокращения с апострофами (пример: ain't, can't) на пробел,
используя apostrophe_dict. Для этого необходимо сделать функцию: для каждого слова в
тексте проверить (for word in text.split()), если слово есть в словаре apostrophe_dict в
качестве ключа (сокращенного слова), то заменить ключ на значение (полную
версию слова).

In [303]:
apostrophe_dict = {"ain't": "am not", "can't":"can not", "isn't": "is not", "don't": "do not", 
                 "doesn't": "does not", "won't": "will not",
                  "aren't": "are not", "wasn't": "was not", "didn't": "did not", "I'm":"I am",
                  "hasn't": "has not"}

In [304]:
def replace_apostrophe(text):
    return " ".join(apostrophe_dict.get(x, x) for x in text.split())

In [305]:
train_df['tweet'][1].split()

['thanks',
 'for',
 '#lyft',
 'credit',
 'i',
 "can't",
 'use',
 'cause',
 'they',
 "don't",
 'offer',
 'wheelchair',
 'vans',
 'in',
 'pdx.',
 '#disapointed',
 '#getthanked']

In [306]:
# проверка
replace_apostrophe(train_df['tweet'][1])

'thanks for #lyft credit i can not use cause they do not offer wheelchair vans in pdx. #disapointed #getthanked'

In [307]:
train_df['tweet'] = train_df['tweet'].apply(replace_apostrophe)
train_df.head()

Unnamed: 0,id,label,tweet
0,1,0,when a father is dysfunctional and is so selfi...
1,2,0,thanks for #lyft credit i can not use cause th...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ur...
4,5,0,factsguide: society now #motivation


In [308]:
test_df['tweet'] = test_df['tweet'].apply(replace_apostrophe)
test_df.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,#white #supremacists want everyone to see the ...
2,31965,safe ways to heal your #acne!! #altwaystoheal ...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew e..."


### 4. Заменим сокращения на их полные формы, используя short_word_dict. Для этого
воспользуемся функцией, используемой в предыдущем пункте.

In [309]:
short_word_dict = {"some1":"someone", "any1":"anyone", "be4":"before", "2day": "today", "4u":"for you", "gr8":"great",
                   "w8":"wait", "2u":"to you", "u2":"you too", "asap":"as soon as possible", "pls": "please",
                  "plz": "please", "thx":"thanks", "lol":"laughing out loud", "rofl":"rolling on the floor", "idk":"i do not know",
                  "diky":"do I know you"}

In [310]:
def replace_short_word(text):
    return " ".join(short_word_dict.get(x, x) for x in text.split())

In [311]:
train_df['tweet'] = train_df['tweet'].apply(replace_short_word)
train_df.head(3)

Unnamed: 0,id,label,tweet
0,1,0,when a father is dysfunctional and is so selfi...
1,2,0,thanks for #lyft credit i can not use cause th...
2,3,0,bihday your majesty


In [312]:
test_df['tweet'] = test_df['tweet'].apply(replace_short_word)
test_df.head(3)

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,#white #supremacists want everyone to see the ...
2,31965,safe ways to heal your #acne!! #altwaystoheal ...


### 5. Заменим эмотиконы (пример: ":)" = "happy") на пробелы, используя emoticon_dict.
Для этого воспользуемся функцией, используемой в предыдущем пункте.

In [313]:
# reference : https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py

emoticon_dict = {
    u":‑\)":"Happy face or smiley",
    u":\)":"Happy face or smiley",
    u":-\]":"Happy face or smiley",
    u":\]":"Happy face or smiley",
    u":-3":"Happy face smiley",
    u":3":"Happy face smiley",
    u":->":"Happy face smiley",
    u":>":"Happy face smiley",
    u"8-\)":"Happy face smiley",
    u":o\)":"Happy face smiley",
    u":-\}":"Happy face smiley",
    u":\}":"Happy face smiley",
    u":-\)":"Happy face smiley",
    u":c\)":"Happy face smiley",
    u":\^\)":"Happy face smiley",
    u"=\]":"Happy face smiley",
    u"=\)":"Happy face smiley",
    u":‑D":"Laughing, big grin or laugh with glasses",
    u":D":"Laughing, big grin or laugh with glasses",
    u"8‑D":"Laughing, big grin or laugh with glasses",
    u"8D":"Laughing, big grin or laugh with glasses",
    u"X‑D":"Laughing, big grin or laugh with glasses",
    u"XD":"Laughing, big grin or laugh with glasses",
    u"=D":"Laughing, big grin or laugh with glasses",
    u"=3":"Laughing, big grin or laugh with glasses",
    u"B\^D":"Laughing, big grin or laugh with glasses",
    u":-\)\)":"Very happy",
    u":‑\(":"Frown, sad, andry or pouting",
    u":-\(":"Frown, sad, andry or pouting",
    u":\(":"Frown, sad, andry or pouting",
    u":‑c":"Frown, sad, andry or pouting",
    u":c":"Frown, sad, andry or pouting",
    u":‑<":"Frown, sad, andry or pouting",
    u":<":"Frown, sad, andry or pouting",
    u":‑\[":"Frown, sad, andry or pouting",
    u":\[":"Frown, sad, andry or pouting",
    u":-\|\|":"Frown, sad, andry or pouting",
    u">:\[":"Frown, sad, andry or pouting",
    u":\{":"Frown, sad, andry or pouting",
    u":@":"Frown, sad, andry or pouting",
    u">:\(":"Frown, sad, andry or pouting",
    u":'‑\(":"Crying",
    u":'\(":"Crying",
    u":'‑\)":"Tears of happiness",
    u":'\)":"Tears of happiness",
    u"D‑':":"Horror",
    u"D:<":"Disgust",
    u"D:":"Sadness",
    u"D8":"Great dismay",
    u"D;":"Great dismay",
    u"D=":"Great dismay",
    u"DX":"Great dismay",
    u":‑O":"Surprise",
    u":O":"Surprise",
    u":‑o":"Surprise",
    u":o":"Surprise",
    u":-0":"Shock",
    u"8‑0":"Yawn",
    u">:O":"Yawn",
    u":-\*":"Kiss",
    u":\*":"Kiss",
    u":X":"Kiss",
    u";‑\)":"Wink or smirk",
    u";\)":"Wink or smirk",
    u"\*-\)":"Wink or smirk",
    u"\*\)":"Wink or smirk",
    u";‑\]":"Wink or smirk",
    u";\]":"Wink or smirk",
    u";\^\)":"Wink or smirk",
    u":‑,":"Wink or smirk",
    u";D":"Wink or smirk",
    u":‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"X‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"XP":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"d:":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"=p":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u">:P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":-[.]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":S":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":‑\|":"Straight face",
    u":\|":"Straight face",
    u":$":"Embarrassed or blushing",
    u":‑x":"Sealed lips or wearing braces or tongue-tied",
    u":x":"Sealed lips or wearing braces or tongue-tied",
    u":‑#":"Sealed lips or wearing braces or tongue-tied",
    u":#":"Sealed lips or wearing braces or tongue-tied",
    u":‑&":"Sealed lips or wearing braces or tongue-tied",
    u":&":"Sealed lips or wearing braces or tongue-tied",
    u"O:‑\)":"Angel, saint or innocent",
    u"O:\)":"Angel, saint or innocent",
    u"0:‑3":"Angel, saint or innocent",
    u"0:3":"Angel, saint or innocent",
    u"0:‑\)":"Angel, saint or innocent",
    u"0:\)":"Angel, saint or innocent",
    u":‑b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"0;\^\)":"Angel, saint or innocent",
    u">:‑\)":"Evil or devilish",
    u">:\)":"Evil or devilish",
    u"\}:‑\)":"Evil or devilish",
    u"\}:\)":"Evil or devilish",
    u"3:‑\)":"Evil or devilish",
    u"3:\)":"Evil or devilish",
    u">;\)":"Evil or devilish",
    u"\|;‑\)":"Cool",
    u"\|‑O":"Bored",
    u":‑J":"Tongue-in-cheek",
    u"#‑\)":"Party all night",
    u"%‑\)":"Drunk or confused",
    u"%\)":"Drunk or confused",
    u":-###..":"Being sick",
    u":###..":"Being sick",
    u"<:‑\|":"Dump",
    u"\(>_<\)":"Troubled",
    u"\(>_<\)>":"Troubled",
    u"\(';'\)":"Baby",
    u"\(\^\^>``":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(\^_\^;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(~_~;\) \(・\.・;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-\)zzz":"Sleeping",
    u"\(\^_-\)":"Wink",
    u"\(\(\+_\+\)\)":"Confused",
    u"\(\+o\+\)":"Confused",
    u"\(o\|o\)":"Ultraman",
    u"\^_\^":"Joyful",
    u"\(\^_\^\)/":"Joyful",
    u"\(\^O\^\)／":"Joyful",
    u"\(\^o\^\)／":"Joyful",
    u"\(__\)":"Kowtow as a sign of respect, or dogeza for apology",
    u"_\(\._\.\)_":"Kowtow as a sign of respect, or dogeza for apology",
    u"<\(_ _\)>":"Kowtow as a sign of respect, or dogeza for apology",
    u"<m\(__\)m>":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(__\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(_ _\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"\('_'\)":"Sad or Crying",
    u"\(/_;\)":"Sad or Crying",
    u"\(T_T\) \(;_;\)":"Sad or Crying",
    u"\(;_;":"Sad of Crying",
    u"\(;_:\)":"Sad or Crying",
    u"\(;O;\)":"Sad or Crying",
    u"\(:_;\)":"Sad or Crying",
    u"\(ToT\)":"Sad or Crying",
    u";_;":"Sad or Crying",
    u";-;":"Sad or Crying",
    u";n;":"Sad or Crying",
    u";;":"Sad or Crying",
    u"Q\.Q":"Sad or Crying",
    u"T\.T":"Sad or Crying",
    u"QQ":"Sad or Crying",
    u"Q_Q":"Sad or Crying",
    u"\(-\.-\)":"Shame",
    u"\(-_-\)":"Shame",
    u"\(一一\)":"Shame",
    u"\(；一_一\)":"Shame",
    u"\(=_=\)":"Tired",
    u"\(=\^\·\^=\)":"cat",
    u"\(=\^\·\·\^=\)":"cat",
    u"=_\^=	":"cat",
    u"\(\.\.\)":"Looking down",
    u"\(\._\.\)":"Looking down",
    u"\^m\^":"Giggling with hand covering mouth",
    u"\(\・\・?":"Confusion",
    u"\(?_?\)":"Confusion",
    u">\^_\^<":"Normal Laugh",
    u"<\^!\^>":"Normal Laugh",
    u"\^/\^":"Normal Laugh",
    u"\（\*\^_\^\*）" :"Normal Laugh",
    u"\(\^<\^\) \(\^\.\^\)":"Normal Laugh",
    u"\(^\^\)":"Normal Laugh",
    u"\(\^\.\^\)":"Normal Laugh",
    u"\(\^_\^\.\)":"Normal Laugh",
    u"\(\^_\^\)":"Normal Laugh",
    u"\(\^\^\)":"Normal Laugh",
    u"\(\^J\^\)":"Normal Laugh",
    u"\(\*\^\.\^\*\)":"Normal Laugh",
    u"\(\^—\^\）":"Normal Laugh",
    u"\(#\^\.\^#\)":"Normal Laugh",
    u"\（\^—\^\）":"Waving",
    u"\(;_;\)/~~~":"Waving",
    u"\(\^\.\^\)/~~~":"Waving",
    u"\(-_-\)/~~~ \($\·\·\)/~~~":"Waving",
    u"\(T_T\)/~~~":"Waving",
    u"\(ToT\)/~~~":"Waving",
    u"\(\*\^0\^\*\)":"Excited",
    u"\(\*_\*\)":"Amazed",
    u"\(\*_\*;":"Amazed",
    u"\(\+_\+\) \(@_@\)":"Amazed",
    u"\(\*\^\^\)v":"Laughing,Cheerful",
    u"\(\^_\^\)v":"Laughing,Cheerful",
    u"\(\(d[-_-]b\)\)":"Headphones,Listening to music",
    u'\(-"-\)':"Worried",
    u"\(ーー;\)":"Worried",
    u"\(\^0_0\^\)":"Eyeglasses",
    u"\(\＾ｖ\＾\)":"Happy",
    u"\(\＾ｕ\＾\)":"Happy",
    u"\(\^\)o\(\^\)":"Happy",
    u"\(\^O\^\)":"Happy",
    u"\(\^o\^\)":"Happy",
    u"\)\^o\^\(":"Happy",
    u":O o_O":"Surprised",
    u"o_0":"Surprised",
    u"o\.O":"Surpised",
    u"\(o\.o\)":"Surprised",
    u"oO":"Surprised",
    u"\(\*￣m￣\)":"Dissatisfied",
    u"\(‘A`\)":"Snubbed or Deflated"
}


In [314]:
# defining a function which can remove emoticons from our dataframe 
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in emoticon_dict) + u')')
    return emoticon_pattern.sub(r'', text)

# let's check if it is working or not
print(remove_emoticons("Hello :-)"))

Hello 


In [315]:
train_df['tweet'] = train_df['tweet'].apply(remove_emoticons)
train_df.head()

Unnamed: 0,id,label,tweet
0,1,0,when a father is dysfunctional and is so selfi...
1,2,0,thanks for #lyft credit i can not use cause th...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ur...
4,5,0,factsguide: society now #motivation


In [316]:
test_df['tweet'] = test_df['tweet'].apply(remove_emoticons)
test_df.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,#white #supremacists want everyone to see the ...
2,31965,safe ways to heal your #acne!! #altwaystoheal ...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew e..."


In [317]:
#функция для удаления эмоджи

def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

# we will test this function on 2 sample sentences before applying this on our dataframe 
sent1 = "game is on 🔥🔥"
sent2 = "Hilarious😂"

print(remove_emoji(sent1))
print(remove_emoji(sent2))

game is on 
Hilarious


In [318]:
train_df['tweet'] = train_df['tweet'].apply(remove_emoji)
train_df.head()

Unnamed: 0,id,label,tweet
0,1,0,when a father is dysfunctional and is so selfi...
1,2,0,thanks for #lyft credit i can not use cause th...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ur...
4,5,0,factsguide: society now #motivation


In [319]:
test_df['tweet'] = test_df['tweet'].apply(remove_emoji)
test_df.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,#white #supremacists want everyone to see the ...
2,31965,safe ways to heal your #acne!! #altwaystoheal ...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew e..."


### 6. Заменим пунктуацию на пробелы, используя re.sub() и паттерн r'[^\w\s]'.

In [320]:
regex = re.compile('[^\w\s]')
def del_func(text, regex=regex):
    try:
        return re.sub(regex, ' ',text)
    except:
        return []

In [321]:
print(train_df.tweet[0])

when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run


In [322]:
print(del_punct_symb(train_df.tweet[0]))

when a father is dysfunctional and is so selfish he drags his kids into his dysfunction   run


In [323]:
train_df['tweet'] = train_df['tweet'].apply(del_func)
train_df.head()

Unnamed: 0,id,label,tweet
0,1,0,when a father is dysfunctional and is so selfi...
1,2,0,thanks for lyft credit i can not use cause th...
2,3,0,bihday your majesty
3,4,0,model i love u take with u all the time in ur...
4,5,0,factsguide society now motivation


In [324]:
test_df['tweet'] = test_df['tweet'].apply(del_func)
test_df.head()

Unnamed: 0,id,tweet
0,31963,studiolife aislife requires passion dedic...
1,31964,white supremacists want everyone to see the ...
2,31965,safe ways to heal your acne altwaystoheal ...
3,31966,is the hp and the cursed child book up for res...
4,31967,3rd bihday to my amazing hilarious nephew e...


### 7. Заменим спец. символы на пробелы, используя re.sub() и паттерн r'[^a-zA-Z0-9]'.

In [325]:
regex = re.compile('[^a-zA-Z0-9]')

In [326]:
train_df['tweet'] = train_df['tweet'].apply(del_func)
train_df.head()

Unnamed: 0,id,label,tweet
0,1,0,when a father is dysfunctional and is so selfi...
1,2,0,thanks for lyft credit i can not use cause th...
2,3,0,bihday your majesty
3,4,0,model i love u take with u all the time in ur...
4,5,0,factsguide society now motivation


In [327]:
test_df['tweet'] = test_df['tweet'].apply(del_func)
test_df.head()

Unnamed: 0,id,tweet
0,31963,studiolife aislife requires passion dedic...
1,31964,white supremacists want everyone to see the ...
2,31965,safe ways to heal your acne altwaystoheal ...
3,31966,is the hp and the cursed child book up for res...
4,31967,3rd bihday to my amazing hilarious nephew e...


### 8. Заменим числа на пробелы, используя re.sub() и паттерн r'[^a-zA-Z]'.

In [328]:
regex = re.compile('[^a-zA-Z]')

In [329]:
train_df['tweet'] = train_df['tweet'].apply(del_func)
train_df.head(3)

Unnamed: 0,id,label,tweet
0,1,0,when a father is dysfunctional and is so selfi...
1,2,0,thanks for lyft credit i can not use cause th...
2,3,0,bihday your majesty


In [330]:
test_df['tweet'] = test_df['tweet'].apply(del_func)
test_df.head(3)

Unnamed: 0,id,tweet
0,31963,studiolife aislife requires passion dedic...
1,31964,white supremacists want everyone to see the ...
2,31965,safe ways to heal your acne altwaystoheal ...


### 9. Удалим из текста слова длиной в 1 символ, 
используя ' '.join([w for w in x.split() if
len(w)>1]).

In [331]:
def del_short_word(text):
    return " ".join([w for w in text.split() if len(w)>1])

In [332]:
train_df['tweet'] = train_df['tweet'].apply(del_short_word)
train_df.head(3)

Unnamed: 0,id,label,tweet
0,1,0,when father is dysfunctional and is so selfish...
1,2,0,thanks for lyft credit can not use cause they ...
2,3,0,bihday your majesty


In [333]:
test_df['tweet'] = test_df['tweet'].apply(del_short_word)
test_df.head(3)

Unnamed: 0,id,tweet
0,31963,studiolife aislife requires passion dedication...
1,31964,white supremacists want everyone to see the ne...
2,31965,safe ways to heal your acne altwaystoheal heal...


### 10. Поделим твиты на токены с помощью nltk.tokenize.word_tokenize, 
создав новый столбец 'tweet_token'.

In [334]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, wordpunct_tokenize

[nltk_data] Downloading package punkt to /Users/admin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [335]:
print(train_df.tweet[0])

when father is dysfunctional and is so selfish he drags his kids into his dysfunction run


In [336]:
word_tokenize(train_df.tweet[0])

['when',
 'father',
 'is',
 'dysfunctional',
 'and',
 'is',
 'so',
 'selfish',
 'he',
 'drags',
 'his',
 'kids',
 'into',
 'his',
 'dysfunction',
 'run']

In [337]:
def tokenize_text(text):
    return word_tokenize(text)

In [338]:
train_df['tweet_token'] = train_df['tweet'].apply(tokenize_text)
train_df.head()

Unnamed: 0,id,label,tweet,tweet_token
0,1,0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,..."
1,2,0,thanks for lyft credit can not use cause they ...,"[thanks, for, lyft, credit, can, not, use, cau..."
2,3,0,bihday your majesty,"[bihday, your, majesty]"
3,4,0,model love take with all the time in urð,"[model, love, take, with, all, the, time, in, ..."
4,5,0,factsguide society now motivation,"[factsguide, society, now, motivation]"


In [339]:
test_df['tweet_token'] = test_df['tweet'].apply(tokenize_text)
test_df.head()

Unnamed: 0,id,tweet,tweet_token
0,31963,studiolife aislife requires passion dedication...,"[studiolife, aislife, requires, passion, dedic..."
1,31964,white supremacists want everyone to see the ne...,"[white, supremacists, want, everyone, to, see,..."
2,31965,safe ways to heal your acne altwaystoheal heal...,"[safe, ways, to, heal, your, acne, altwaystohe..."
3,31966,is the hp and the cursed child book up for res...,"[is, the, hp, and, the, cursed, child, book, u..."
4,31967,3rd bihday to my amazing hilarious nephew eli ...,"[3rd, bihday, to, my, amazing, hilarious, neph..."


### 11. Удалим стоп-слова из токенов, используя nltk.corpus.stopwords. 
Создадим столбец
'tweet_token_filtered' без стоп-слов.

In [340]:
from nltk.corpus import stopwords

In [341]:
mystopwords = set(stopwords.words("english"))

def remove_stopwords(text):
    filtered_words = [w for w in text if not w in mystopwords]
    return filtered_words

In [361]:
# mystopwords

In [343]:
train_df['tweet_token_filtered'] = train_df['tweet_token'].apply(remove_stopwords)
train_df.head()

Unnamed: 0,id,label,tweet,tweet_token,tweet_token_filtered
0,1,0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...","[father, dysfunctional, selfish, drags, kids, ..."
1,2,0,thanks for lyft credit can not use cause they ...,"[thanks, for, lyft, credit, can, not, use, cau...","[thanks, lyft, credit, use, cause, offer, whee..."
2,3,0,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]"
3,4,0,model love take with all the time in urð,"[model, love, take, with, all, the, time, in, ...","[model, love, take, time, urð]"
4,5,0,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]"


In [344]:
test_df['tweet_token_filtered'] = test_df['tweet_token'].apply(remove_stopwords)
test_df.head()

Unnamed: 0,id,tweet,tweet_token,tweet_token_filtered
0,31963,studiolife aislife requires passion dedication...,"[studiolife, aislife, requires, passion, dedic...","[studiolife, aislife, requires, passion, dedic..."
1,31964,white supremacists want everyone to see the ne...,"[white, supremacists, want, everyone, to, see,...","[white, supremacists, want, everyone, see, new..."
2,31965,safe ways to heal your acne altwaystoheal heal...,"[safe, ways, to, heal, your, acne, altwaystohe...","[safe, ways, heal, acne, altwaystoheal, health..."
3,31966,is the hp and the cursed child book up for res...,"[is, the, hp, and, the, cursed, child, book, u...","[hp, cursed, child, book, reservations, alread..."
4,31967,3rd bihday to my amazing hilarious nephew eli ...,"[3rd, bihday, to, my, amazing, hilarious, neph...","[3rd, bihday, amazing, hilarious, nephew, eli,..."


### 12. Применим стемминг к токенам с помощью nltk.stem.PorterStemmer. 
Создадим
столбец 'tweet_stemmed' после применения стемминга.

In [345]:
from nltk.stem import PorterStemmer

In [346]:
stemmer = PorterStemmer()

In [347]:
def stem_words(tokens):
    new_tokens = [stemmer.stem(word) for word in tokens]
    new_text = " ".join(new_tokens)
    return new_text

In [348]:
train_df['tweet_stemmed'] = train_df['tweet_token_filtered'].apply(stem_words)
train_df.head()

Unnamed: 0,id,label,tweet,tweet_token,tweet_token_filtered,tweet_stemmed
0,1,0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...","[father, dysfunctional, selfish, drags, kids, ...",father dysfunct selfish drag kid dysfunct run
1,2,0,thanks for lyft credit can not use cause they ...,"[thanks, for, lyft, credit, can, not, use, cau...","[thanks, lyft, credit, use, cause, offer, whee...",thank lyft credit use caus offer wheelchair va...
2,3,0,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]",bihday majesti
3,4,0,model love take with all the time in urð,"[model, love, take, with, all, the, time, in, ...","[model, love, take, time, urð]",model love take time urð
4,5,0,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]",factsguid societi motiv


In [349]:
test_df['tweet_stemmed'] = test_df['tweet_token_filtered'].apply(stem_words)
test_df.head()

Unnamed: 0,id,tweet,tweet_token,tweet_token_filtered,tweet_stemmed
0,31963,studiolife aislife requires passion dedication...,"[studiolife, aislife, requires, passion, dedic...","[studiolife, aislife, requires, passion, dedic...",studiolif aislif requir passion dedic willpow ...
1,31964,white supremacists want everyone to see the ne...,"[white, supremacists, want, everyone, to, see,...","[white, supremacists, want, everyone, see, new...",white supremacist want everyon see new birdsâ ...
2,31965,safe ways to heal your acne altwaystoheal heal...,"[safe, ways, to, heal, your, acne, altwaystohe...","[safe, ways, heal, acne, altwaystoheal, health...",safe way heal acn altwaystoh healthi heal
3,31966,is the hp and the cursed child book up for res...,"[is, the, hp, and, the, cursed, child, book, u...","[hp, cursed, child, book, reservations, alread...",hp curs child book reserv alreadi ye harrypott...
4,31967,3rd bihday to my amazing hilarious nephew eli ...,"[3rd, bihday, to, my, amazing, hilarious, neph...","[3rd, bihday, amazing, hilarious, nephew, eli,...",3rd bihday amaz hilari nephew eli ahmir uncl d...


### 13. Применим лемматизацию к токенам
с помощью
nltk.stem.wordnet.WordNetLemmatizer. Создадим столбец 'tweet_lemmatized' после
применения лемматизации.

In [350]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/admin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [351]:
lemmatizer = WordNetLemmatizer()

In [352]:
def lemmatizer_words(tokens):
    new_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    new_text = " ".join(new_tokens)
    return new_text 

In [353]:
train_df['tweet_lemmatized'] = train_df['tweet_token_filtered'].apply(lemmatizer_words)
train_df.head()

Unnamed: 0,id,label,tweet,tweet_token,tweet_token_filtered,tweet_stemmed,tweet_lemmatized
0,1,0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...","[father, dysfunctional, selfish, drags, kids, ...",father dysfunct selfish drag kid dysfunct run,father dysfunctional selfish drag kid dysfunct...
1,2,0,thanks for lyft credit can not use cause they ...,"[thanks, for, lyft, credit, can, not, use, cau...","[thanks, lyft, credit, use, cause, offer, whee...",thank lyft credit use caus offer wheelchair va...,thanks lyft credit use cause offer wheelchair ...
2,3,0,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]",bihday majesti,bihday majesty
3,4,0,model love take with all the time in urð,"[model, love, take, with, all, the, time, in, ...","[model, love, take, time, urð]",model love take time urð,model love take time urð
4,5,0,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]",factsguid societi motiv,factsguide society motivation


In [354]:
test_df['tweet_lemmatized'] = test_df['tweet_token_filtered'].apply(lemmatizer_words)
test_df.head()

Unnamed: 0,id,tweet,tweet_token,tweet_token_filtered,tweet_stemmed,tweet_lemmatized
0,31963,studiolife aislife requires passion dedication...,"[studiolife, aislife, requires, passion, dedic...","[studiolife, aislife, requires, passion, dedic...",studiolif aislif requir passion dedic willpow ...,studiolife aislife requires passion dedication...
1,31964,white supremacists want everyone to see the ne...,"[white, supremacists, want, everyone, to, see,...","[white, supremacists, want, everyone, see, new...",white supremacist want everyon see new birdsâ ...,white supremacist want everyone see new birdsâ...
2,31965,safe ways to heal your acne altwaystoheal heal...,"[safe, ways, to, heal, your, acne, altwaystohe...","[safe, ways, heal, acne, altwaystoheal, health...",safe way heal acn altwaystoh healthi heal,safe way heal acne altwaystoheal healthy healing
3,31966,is the hp and the cursed child book up for res...,"[is, the, hp, and, the, cursed, child, book, u...","[hp, cursed, child, book, reservations, alread...",hp curs child book reserv alreadi ye harrypott...,hp cursed child book reservation already yes h...
4,31967,3rd bihday to my amazing hilarious nephew eli ...,"[3rd, bihday, to, my, amazing, hilarious, neph...","[3rd, bihday, amazing, hilarious, nephew, eli,...",3rd bihday amaz hilari nephew eli ahmir uncl d...,3rd bihday amazing hilarious nephew eli ahmir ...


### 14. Сохраним результат предобработки в pickle-файл.

In [355]:
import pickle

#сохраняем train_df
with open ('train_tweets.pickle', 'wb') as f:
    pickle.dump(train_df, f)

In [356]:
# проверка
with open('train_tweets.pickle', 'rb') as f:
    train_df = pickle.load(f)

In [357]:
train_df.head(3)

Unnamed: 0,id,label,tweet,tweet_token,tweet_token_filtered,tweet_stemmed,tweet_lemmatized
0,1,0,when father is dysfunctional and is so selfish...,"[when, father, is, dysfunctional, and, is, so,...","[father, dysfunctional, selfish, drags, kids, ...",father dysfunct selfish drag kid dysfunct run,father dysfunctional selfish drag kid dysfunct...
1,2,0,thanks for lyft credit can not use cause they ...,"[thanks, for, lyft, credit, can, not, use, cau...","[thanks, lyft, credit, use, cause, offer, whee...",thank lyft credit use caus offer wheelchair va...,thanks lyft credit use cause offer wheelchair ...
2,3,0,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]",bihday majesti,bihday majesty


In [358]:
# сохраняем test_df
with open ('test_tweets.pickle', 'wb') as f:
    pickle.dump(test_df, f)

In [359]:
#проверка
with open ('test_tweets.pickle', 'rb') as f:
    test_df = pickle.load(f)

In [360]:
test_df.head(3)

Unnamed: 0,id,tweet,tweet_token,tweet_token_filtered,tweet_stemmed,tweet_lemmatized
0,31963,studiolife aislife requires passion dedication...,"[studiolife, aislife, requires, passion, dedic...","[studiolife, aislife, requires, passion, dedic...",studiolif aislif requir passion dedic willpow ...,studiolife aislife requires passion dedication...
1,31964,white supremacists want everyone to see the ne...,"[white, supremacists, want, everyone, to, see,...","[white, supremacists, want, everyone, see, new...",white supremacist want everyon see new birdsâ ...,white supremacist want everyone see new birdsâ...
2,31965,safe ways to heal your acne altwaystoheal heal...,"[safe, ways, to, heal, your, acne, altwaystohe...","[safe, ways, heal, acne, altwaystoheal, health...",safe way heal acn altwaystoh healthi heal,safe way heal acne altwaystoheal healthy healing
