# Homework: Sentiment analysis

Для заданной тестовой выборки построить модель для предсказания тональности твита.  
На заданном разбиении (df_train, df_test) ваша модель должна превзойти baseline'ы, приведенные ниже.  

Чем больше baseline'ов вы превзойдете, тем выше ваша оценка.  
Метрика качества f1 (f1_macro) (чем больше, тем лучше)

baseline 0: 0.3319      random  
baseline 1: 0.6941      text norm + word embedding + logistic regression  
baseline 2: 0.6990      tf-idf over words + logistic regression  
baseline 3: 0.7418      tf-idf over symbols + logistic regression  

Пока мы рассмотрели только линейные модели - поэтому в примерах есть только они. Желательно при решении домашнего задания пользоваться линейными моделями. Таким образом, основные цели задания - feature engineering, hyperparam tuning & model selection.

! Your results must be reproducible. Если ваша модель - стохастическая, то вы явно должны задавать все seed и random_state в параметрах моделей  
! Вы должны использовать df_test только для измерения качества конечной обученной модели. 

bonus, think about:
1. why we selected f1 with macro averaging as our classification quality measure instead of others? look in docs  
2. why word embeddings perform so poorly with linear models?  
3. other ideas how to get text2vec from word2vec. look in docs  

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

SEED = 1337

df = pd.read_csv('Tweets.csv')

## Look at the data

In [2]:
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline,retweet_count,text
0,570306133677760513,neutral,Virgin America,0,@VirginAmerica What @dhepburn said.
1,570301130888122368,positive,Virgin America,0,@VirginAmerica plus you've added commercials t...
2,570301083672813571,neutral,Virgin America,0,@VirginAmerica I didn't today... Must mean I n...
3,570301031407624196,negative,Virgin America,0,@VirginAmerica it's really aggressive to blast...
4,570300817074462722,negative,Virgin America,0,@VirginAmerica and it's a really big bad thing...


In [3]:
# main reason we have chosen f1 is that class distribution is imbalanced
df.airline_sentiment.value_counts(normalize=True)

negative    0.626913
neutral     0.211680
positive    0.161407
Name: airline_sentiment, dtype: float64

In [4]:
# we can notice that negative < neutral < positive
# let's encode that appropreately
df.loc[df.airline_sentiment == 'negative', 'airline_sentiment'] = 0
df.loc[df.airline_sentiment == 'neutral', 'airline_sentiment'] = 1
df.loc[df.airline_sentiment == 'positive', 'airline_sentiment'] = 2

In [7]:
# encode airline as categorial variable
airline_le = LabelEncoder()
df['airline'] = airline_le.fit_transform(df.airline)
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline,retweet_count,text
0,570306133677760513,neutral,5,0,@VirginAmerica What @dhepburn said.
1,570301130888122368,positive,5,0,@VirginAmerica plus you've added commercials t...
2,570301083672813571,neutral,5,0,@VirginAmerica I didn't today... Must mean I n...
3,570301031407624196,negative,5,0,@VirginAmerica it's really aggressive to blast...
4,570300817074462722,negative,5,0,@VirginAmerica and it's a really big bad thing...


In [8]:
y = df.airline_sentiment.values
df_train, df_test, y_train, y_test = train_test_split(df, y, test_size=0.25, 
                                                                      stratify=y,
                                                                      random_state=SEED, 
                                                                      shuffle=True)

print('train', df_train.shape[0])
print('test', df_test.shape[0])

train 10980
test 3660


In [9]:
# увеличила длину n-грамм в признаках и использовала SVM

from sklearn.svm import LinearSVC

vec = TfidfVectorizer(analyzer='char', lowercase=True, min_df=5, ngram_range=(1,4), norm='l2')

X_train = vec.fit_transform(df_train.text)
X_test = vec.transform(df_test.text)


params = {
    'tol': np.logspace(-4,0,base=10,num=5),
    'C': np.logspace(0,4,base=10,num=5),
    'class_weight': ['balanced',None]
}

model = GridSearchCV(LinearSVC(multi_class='ovr', random_state=SEED),
                     params,verbose=1,scoring='f1_macro',cv=5,error_score=0)
model.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed: 39.5min finished


GridSearchCV(cv=5, error_score=0,
       estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=1337, tol=0.0001,
     verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'tol': array([  1.00000e-04,   1.00000e-03,   1.00000e-02,   1.00000e-01,
         1.00000e+00]), 'C': array([  1.00000e+00,   1.00000e+01,   1.00000e+02,   1.00000e+03,
         1.00000e+04]), 'class_weight': ['balanced', None]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1_macro', verbose=1)

In [10]:
print('train', metrics.f1_score(y_train, model.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(X_test), average='macro'))

train 0.967034112326
test 0.758243352293


In [11]:
model.best_params_

{'C': 1.0, 'class_weight': 'balanced', 'tol': 0.10000000000000001}