In [41]:
import numpy as np
import re
import pandas as pd

## Формируем список авторов

Возьмем данные из `Sentiment140 dataset with 1.6 million tweets` (https://www.kaggle.com/kazanova/sentiment140). Все твиты нам не нужны, мы воспользуемся только группой из нескольких пользователей.

In [15]:
data_tweets = pd.read_csv('training.1600000.processed.noemoticon.csv', 
                          names = ['0', 'id', 'date', 'q', 'user', 'text'],
                          encoding='latin-1')

In [16]:
data_tweets.head()

Unnamed: 0,0,id,date,q,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [17]:
data_tweets.shape

(1600000, 6)

Нам нужны только имена пользователей и твиты, от остальной информации избавимся.

In [19]:
data_tweets = data_tweets.drop(columns = ['0', 'id', 'date', 'q'])

In [20]:
data_tweets.head()

Unnamed: 0,user,text
0,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,scotthamilton,is upset that he can't update his Facebook by ...
2,mattycus,@Kenichan I dived many times for the ball. Man...
3,ElleCTF,my whole body feels itchy and like its on fire
4,Karoli,"@nationwideclass no, it's not behaving at all...."


Для простоты рассмотрим первые `100 000` строк, будем экономить память и время работы. Ниже приведенный код работает для любого размера таблицы.

In [23]:
data_tweets = data_tweets[:100000]

In [24]:
data_tweets.shape

(100000, 2)

In [25]:
data_tweets.user.unique().shape

(77892,)

Как видно, имена пользователей повторяются, это гарантирует нам то, что для некоторых пользователей в нашем срезе таблицы более одной записи.

In [44]:
data_snake_case = data_tweets[data_tweets.user.apply(lambda x: len(list(filter(lambda y: len(y) > 0, x.split('_')))) > 1)]

In [45]:
data_snake_case.head()

Unnamed: 0,user,text
5,joy_wolf,@Kwesidei not the whole crew
19,gi_gi_bee,@FakerPattyPattz Oh dear. Were you drinking ou...
36,crosland_12,@cocomix04 ill tell ya the story later not a ...
39,Anthony_Nguyen,Bed. Class 8-12. Work 12-3. Gym 3-5 or 6. Then...
52,crzy_cdn_bulas,our duck and chicken are taking wayyy too long...


In [36]:
data_tweets[data_tweets.user.apply(lambda x: len(x.split(' ')) > 1)].shape

(0, 2)

In [38]:
data_tweets[data_tweets.user.apply(lambda x: len(x.split('-')) > 1)].shape

(0, 2)

Имена пользователей записаны либо в `camelCase`, где слова разделены заглавными буквами, либо в `snake_case`, где слова разделены нижним подчеркиванием. Здесь мы не учитывали `camelCase`, воспользуемся для этого регулярными выражениями.

In [43]:
data_tweets[data_tweets.user.apply(lambda name: len(re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', name)).split()) > 1)]

Unnamed: 0,user,text
0,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
3,ElleCTF,my whole body feels itchy and like its on fire
7,coZZ,@LOLTrish hey long time no see! Yes.. Rains a...
8,2Hood4Hollywood,@Tatiana_K nope they didn't have it
12,TLeC,@caregiving I couldn't bear to watch it. And ...
...,...,...
99970,MarinaMartin,@TimNeill It won't let me download from a Mac ...
99971,NatElev,@Dannymcfly Stop complaining At least you don...
99977,JCDichant,"@Turone aarggh, comprends pas"
99993,Jesika_Rose,revising buisiness studys


## Самые короткие твиты для авторов с username из двух слов

In [49]:
usernames = data_snake_case.user.unique()

In [65]:
data_snake_case['ind'] = list(range(data_snake_case.shape[0]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_snake_case['ind'] = list(range(data_snake_case.shape[0]))


In [72]:
# data_snake_case = data_snake_case.set_index('ind')

In [74]:
shortest = []
for username in usernames:
    id_val = data_snake_case[data_snake_case.user == username].text.str.len().idxmin()
    try:
        shortest.append(data_snake_case.iloc[id_val])
    except:
        print(id_val)
    

In [78]:
shortest = pd.DataFrame(shortest)

In [80]:
shortest.head()

Unnamed: 0,user,text
0,joy_wolf,@Kwesidei not the whole crew
1529,gi_gi_bee,@Author82
2,crosland_12,@cocomix04 ill tell ya the story later not a ...
3,Anthony_Nguyen,Bed. Class 8-12. Work 12-3. Gym 3-5 or 6. Then...
4,crzy_cdn_bulas,our duck and chicken are taking wayyy too long...


Получили таблицу с самыми короткими твитами.

In [85]:
shortest.text

0                           @Kwesidei not the whole crew 
1529                                           @Author82 
2       @cocomix04 ill tell ya the story later  not a ...
3       Bed. Class 8-12. Work 12-3. Gym 3-5 or 6. Then...
4       our duck and chicken are taking wayyy too long...
                              ...                        
9426    I think im dyin. But i don't wanna go to the d...
9427                                     @_leirion Why  ?
9428    It looks like it is going to rain. I don't lik...
9429                                      At work at 7am 
9430                           revising buisiness studys 
Name: text, Length: 7210, dtype: object

Данных по `emoji` в таблице нет.

## Сентименты

Как определить какое слово полижительное, а какое отрицательное? Вообще, это задача машинного обучения, но в данном случае мы будем использовать некоторый список стоп слов. 

In [99]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\79217\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [95]:
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()

Лемматизируем все твиты.

In [118]:
new_texts = shortest.text.apply(lambda x: [lemmatizer.lemmatize(w) for w in nltk.word_tokenize(x.lower())])

In [119]:
new_texts

0                    [@, kwesidei, not, the, whole, crew]
1529                                        [@, author82]
2       [@, cocomix04, ill, tell, ya, the, story, late...
3       [bed, ., class, 8-12., work, 12-3., gym, 3-5, ...
4       [our, duck, and, chicken, are, taking, wayyy, ...
                              ...                        
9426    [i, think, im, dyin, ., but, i, do, n't, wan, ...
9427                                [@, _leirion, why, ?]
9428    [it, look, like, it, is, going, to, rain, ., i...
9429                                  [at, work, at, 7am]
9430                         [revising, buisiness, study]
Name: text, Length: 7210, dtype: object

Определим хорошие и плохие слова.

In [120]:
bad_words = ['bad', 'poor', 'ill', 'low', 'inferior', 'wretched']

In [121]:
good_words = ['good', 'boon', 'blessing', 'weal']

Посчитаем долю сенитментов.

In [123]:
sentiment = new_texts.apply(lambda x: np.sum([x.count(word) for word in good_words]) -  np.sum([x.count(word) for word in bad_words]))

In [125]:
sentiment.value_counts()

 0    6683
-1     292
 1     222
-2       7
 2       5
-3       1
Name: text, dtype: int64

Большинство твитов носят нейтральных характер. Если расширить списки слов, то получется более точно оценить сенитименты, но гораздо проще работать с размеченными данными, тем более это давно решенная задача с точки зрения машинного обучения.

In [130]:
shortest['sentiment'] = sentiment

In [131]:
shortest.head()

Unnamed: 0,user,text,sentiment
0,joy_wolf,@Kwesidei not the whole crew,0
1529,gi_gi_bee,@Author82,0
2,crosland_12,@cocomix04 ill tell ya the story later not a ...,-1
3,Anthony_Nguyen,Bed. Class 8-12. Work 12-3. Gym 3-5 or 6. Then...,0
4,crzy_cdn_bulas,our duck and chicken are taking wayyy too long...,0


## Самые плохие и самые хорошие твиты

In [132]:
shortest[shortest.sentiment == 2].text

1493    i am not rocking shop at texas holdem   Here's...
3452    Ranny, wet, and dull. NOT good weather today  ...
6167    @energyUK I'm good - London is good, although ...
8292            good morning... still sun burnt  not good
7539    @angelbabybop thats good my knees hurt  do you...
Name: text, dtype: object

In [133]:
shortest[shortest.sentiment <= -2].text

754     @missezrenee Poor baby, I have a bad throat al...
1249    Back accchhhe super bad!  not the film Superba...
1625    @artoni Swindle could probably sell me that fa...
2259    Andrea Is At Home Bored!! Has Got College Tomm...
5763    I can see the sun!! Too bad Heels and Hills ha...
3786    How sad is it that i'm resorting to being an a...
4250    Owie  my poor wrist hurts &amp; I'm moving roo...
6633    feels bad for poor jacob kitty who we had to l...
Name: text, dtype: object

Как видно из таблиц, в большинстве случаев мы находим твиты с негативной семантической составлющей.

## Наиболее популярные слова

In [139]:
del new_texts['sentiment']

In [141]:
shortest['text'] = new_texts

In [142]:
shortest.head()

Unnamed: 0,user,text,sentiment
0,joy_wolf,"[@, kwesidei, not, the, whole, crew]",0
1529,gi_gi_bee,"[@, author82]",0
2,crosland_12,"[@, cocomix04, ill, tell, ya, the, story, late...",-1
3,Anthony_Nguyen,"[bed, ., class, 8-12., work, 12-3., gym, 3-5, ...",0
4,crzy_cdn_bulas,"[our, duck, and, chicken, are, taking, wayyy, ...",0


In [146]:
list_good = shortest[shortest.sentiment >= 1].text.sum()

In [147]:
for word in good_words:
    print(word, ': ', list_good.count(word))

good :  231
boon :  0
blessing :  1
weal :  0


In [149]:
list_bad = shortest[shortest.sentiment <= -1].text.sum()

In [150]:
for word in bad_words:
    print(word, ': ', list_bad.count(word))

bad :  199
poor :  60
ill :  34
low :  16
inferior :  1
wretched :  0


Нашли наиболее популярные слова.