# Semeval 2017 Task 4: Sentiment Analysis in Twitter
## Subtask C: Topic-Based Message Polarity Classification
### Given a message, classify sentiment conveyed in the tweet towards the topic on a five-point scale (-2,-1,0,1,2).

https://competitions.codalab.org/competitions/15937

In [1]:
import os
import numpy as np
import pandas as pd

from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

import preprocessor as p
p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.RESERVED, p.OPT.EMOJI,p.OPT.SMILEY,p.OPT.NUMBER,p.OPT.HASHTAG)

import warnings
warnings.simplefilter('ignore')

## Загрузка и первичный анализ данных

* Выгрузим тренировочные и тестовые данные и посмотрим на них.

In [2]:
with open("train.txt") as f:
    train_data = f.readlines()
with open("test.txt") as f:
    test_data = f.readlines()

In [3]:
train_data = pd.DataFrame([i.split('\t') for i in train_data])
test_data = pd.DataFrame([i.split('\t') for i in test_data])

train_data.columns = ['id','topic','polarity','text']
test_data.columns = ['id','topic','polarity','text']

train_data = train_data[['id','topic','text','polarity']]
test_data = test_data[['id','topic','text','polarity']]

In [4]:
train_data.head()

Unnamed: 0,id,topic,text,polarity
0,628949369883000832,@microsoft,dear @Microsoft the newOoffice for Mac is grea...,-1
1,628976607420645377,@microsoft,@Microsoft how about you make a system that do...,-2
2,629023169169518592,@microsoft,I may be ignorant on this issue but... should ...,-1
3,629179223232479232,@microsoft,"Thanks to @microsoft, I just may be switching ...",-1
4,629186282179153920,@microsoft,If I make a game as a #windows10 Universal App...,0


In [5]:
test_data.head()

Unnamed: 0,id,topic,text,polarity
0,681563394940473347,amy schumer,@MargaretsBelly Amy Schumer is the stereotypic...,-1
1,675847244747177984,amy schumer,@dani_pitter I mean I get the hype around JLaw...,-1
2,672827854279843840,amy schumer,Amy Schumer at the #GQmenoftheyear2015 party i...,-1
3,671502639671042048,amy schumer,"""Amy Schumer may have brought us Trainwreck, b...",-1
4,677359143108214784,amy schumer,I just think that sports are stupid &amp;anyon...,-1


- Как видно, данные содержат `id, topic, text и polarity`.   
- Для классификации нам не нужны поля `id` - так как твит уже выгружен, и `topic` в этом подходе использоваться не будет.  
- Кроме того, для удобства дальнейшей категоризации значений целевой переменной сделаем категории от 0 до 4 вместо -2 до 2

In [6]:
train_data = train_data.drop(['id','topic'], axis = 1)
test_data = test_data.drop(['id','topic'], axis = 1)

train_data['polarity'] = [int(x) + 2 for x in train_data['polarity']]
test_data['polarity'] = [int(x) + 2 for x in test_data['polarity']]

In [7]:
train_data.head()

Unnamed: 0,text,polarity
0,dear @Microsoft the newOoffice for Mac is grea...,1
1,@Microsoft how about you make a system that do...,0
2,I may be ignorant on this issue but... should ...,1
3,"Thanks to @microsoft, I just may be switching ...",1
4,If I make a game as a #windows10 Universal App...,2


In [8]:
test_data.head()

Unnamed: 0,text,polarity
0,@MargaretsBelly Amy Schumer is the stereotypic...,1
1,@dani_pitter I mean I get the hype around JLaw...,1
2,Amy Schumer at the #GQmenoftheyear2015 party i...,1
3,"""Amy Schumer may have brought us Trainwreck, b...",1
4,I just think that sports are stupid &amp;anyon...,1


- Теперь, когда данные приведены к единому виду, необходимо выполнить предобработку текста: привести текст к нижнему регистру, очистить его от всего лишнего - спец символов, чисел, ссылок и зарезервированных слов.

## Предобработка текста

* Возьмем твит из тренировочной выборки и посмотрим на него.

In [9]:
train_data['text'][7]

'@Microsoft 2nd computer with same error!!! #Windows10fail Guess we will shelve this until SP1! http://t.co/QCcHlKuy8Q\n'

* Текст необходимо очистить. Для этого воспользуемся библиотекой для предобработки твитов - https://github.com/s/preprocessor  
Ее особенность в том, что она позволяет удалять ссылки, хэштеги, упоминания и т.д. 
Код применения ниже:

In [10]:
p.clean(train_data['text'][7].lower())

'nd computer with same error!!! guess we will shelve this until sp1!'

* Теперь воспользуемся токенайзером, который очистит текст от оставшихся ненужных символов и соединим получившиеся токены в предложения

In [11]:
' '.join(CountVectorizer().build_tokenizer()(p.clean(train_data['text'][7].lower())))

'nd computer with same error guess we will shelve this until sp1'

* Теперь сделаем то же самое со всеми данными

In [12]:
train_data['text'] = train_data['text'].apply(lambda t: ' '.join(CountVectorizer().build_tokenizer()(p.clean(t.lower()))))
test_data['text'] = test_data['text'].apply(lambda t: ' '.join(CountVectorizer().build_tokenizer()(p.clean(t.lower()))))

In [13]:
train_data.head()

Unnamed: 0,text,polarity
0,dear the newooffice for mac is great and all b...,1
1,how about you make system that doesn eat my fr...,0
2,may be ignorant on this issue but should we ce...,1
3,thanks to just may be switching over to,1
4,if make game as universal app will owners be a...,2


In [14]:
test_data.head()

Unnamed: 0,text,polarity
0,amy schumer is the stereotypical st world laci...,1
1,mean get the hype around jlaw may not like her...,1
2,amy schumer at the party in dress we pretty mu...,1
3,amy schumer may have brought us trainwreck but...,1
4,just think that sports are stupid amp anyone w...,1


## TF-IDF представление твитов

* Далее, представим твиты в виде TF-IDF. Для этого, воспользуемся TfidfVectorizer'ом.
* Установим значения min_df = 5, чтобы отбросить слова, которые появляются реже 5 раз.

In [15]:
TF_IDF = TfidfVectorizer()
TF_IDF.fit(list(train_data['text']) + list(test_data['text']))
X_train = TF_IDF.transform(list(train_data['text']))
X_test = TF_IDF.transform(list(test_data['text']))

y_train = list(train_data['polarity'])
y_test = list(test_data['polarity'])

### *Логистическая регрессия *

In [35]:
model = LogisticRegression()
model.fit(X_train, y_train)

prediction = model.predict(X_test)

In [36]:
print(metrics.classification_report(prediction,y_test,target_names=['-2','-1','0','1','2']))

             precision    recall  f1-score   support

         -2       0.00      0.00      0.00         0
         -1       0.02      0.36      0.03       105
          0       0.19      0.62      0.29      3106
          1       0.93      0.42      0.58     17409
          2       0.01      0.42      0.03        12

avg / total       0.81      0.45      0.53     20632



### *Деревья принятия решений *

In [37]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [38]:
prediction = model.predict(X_test)

In [39]:
print(metrics.classification_report(prediction,y_test,target_names=['-2','-1','0','1','2']))

             precision    recall  f1-score   support

         -2       0.00      0.00      0.00        46
         -1       0.06      0.15      0.08       809
          0       0.23      0.52      0.32      4469
          1       0.71      0.39      0.50     14358
          2       0.12      0.05      0.07       950

avg / total       0.55      0.39      0.42     20632



### *Градиентный бустинг над решающими деревьями*

In [26]:
model = GradientBoostingClassifier()
model.fit(X_train, y_train)

prediction = model.predict(X_test.toarray())

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)

In [32]:
print(metrics.classification_report(prediction,y_test,target_names=['-2','-1','0','1','2']))

             precision    recall  f1-score   support

         -2       0.02      0.03      0.02       103
         -1       0.01      0.34      0.02        58
          0       0.02      0.60      0.05       416
          1       0.98      0.38      0.55     19974
          2       0.05      0.25      0.09        81

avg / total       0.95      0.39      0.53     20632



### *Метод ближайших соседей*

In [41]:
model = KNeighborsClassifier()
model.fit(X_train, y_train)

prediction = model.predict(X_test)

In [42]:
print(metrics.classification_report(prediction,y_test,target_names=['-2','-1','0','1','2']))

             precision    recall  f1-score   support

         -2       0.02      0.04      0.03        67
         -1       0.16      0.22      0.18      1636
          0       0.29      0.56      0.38      5208
          1       0.71      0.42      0.53     13257
          2       0.14      0.12      0.13       464

avg / total       0.55      0.43      0.45     20632



### *Многослойные перцептроны *

In [43]:
model = MLPClassifier()
model.fit(X_train, y_train)

prediction = model.predict(X_test)

In [44]:
print(metrics.classification_report(prediction,y_test,target_names=['-2','-1','0','1','2']))

             precision    recall  f1-score   support

         -2       0.03      0.08      0.04        49
         -1       0.21      0.22      0.21      2118
          0       0.37      0.55      0.44      6727
          1       0.60      0.43      0.50     10774
          2       0.23      0.09      0.13       964

avg / total       0.46      0.43      0.43     20632



### *Наивный Байесовский классификатор*

In [51]:
model = MultinomialNB()
model.fit(X_train, y_train)

prediction = model.predict(X_test)

In [52]:
print(metrics.classification_report(prediction,y_test,target_names=['-2','-1','0','1','2']))

             precision    recall  f1-score   support

         -2       0.00      0.00      0.00         0
         -1       0.00      0.00      0.00         0
          0       0.02      0.63      0.04       357
          1       0.99      0.38      0.55     20275
          2       0.00      0.00      0.00         0

avg / total       0.98      0.39      0.54     20632



### *Случайный лес*

In [53]:
model = RandomForestClassifier()
model.fit(X_train, y_train)

prediction = model.predict(X_test)

In [54]:
print(metrics.classification_report(prediction,y_test,target_names=['-2','-1','0','1','2']))

             precision    recall  f1-score   support

         -2       0.00      0.00      0.00         3
         -1       0.03      0.19      0.05       312
          0       0.18      0.55      0.27      3302
          1       0.86      0.40      0.55     16963
          2       0.02      0.13      0.03        52

avg / total       0.74      0.42      0.49     20632



### *Метод опорных векторов*

In [55]:
model = SVC(kernel = 'linear')
model.fit(X_train, y_train)

prediction = model.predict(X_test)

* *linear*

In [56]:
print(metrics.classification_report(prediction,y_test,target_names=['-2','-1','0','1','2']))

             precision    recall  f1-score   support

         -2       0.00      0.00      0.00         0
         -1       0.08      0.37      0.13       450
          0       0.29      0.61      0.39      4736
          1       0.87      0.44      0.59     15436
          2       0.01      0.40      0.02        10

avg / total       0.72      0.48      0.53     20632



In [57]:
model = SVC(kernel = 'rbf')
model.fit(X_train, y_train)

prediction = model.predict(X_test)

* *rbf*

In [58]:
print(metrics.classification_report(prediction,y_test,target_names=['-2','-1','0','1','2']))

             precision    recall  f1-score   support

         -2       0.00      0.00      0.00         0
         -1       0.00      0.00      0.00         0
          0       0.00      0.00      0.00         0
          1       1.00      0.38      0.55     20632
          2       0.00      0.00      0.00         0

avg / total       1.00      0.38      0.55     20632

