# Respalizer (c). Data analysis.
## Made by Ilya Zakharkin (github.com/izaharkin).

## A bit of natural language processing (sentiment analysis).

### Data: Given data - responses of people on different bank companies (two features - sentiment (text, in russian) and mark (integer, 1 <= mark <= 5).

### Task: We must predict which mark will this person give consider his response.

#### Useful materials:
- scikit-learn example: http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py
- TJ texts classification: https://habrahabr.ru/post/327072/
- MIPT NLP course: https://github.com/canorbal/NLP_MIPT/blob/master/

### My solution

Import different utilities and tools:

In [1]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

import seaborn as sns

First, let`s have a look at the data - how much data do we have, and in what format is it:

In [2]:
data = pd.read_csv('./data/responses_dataset.csv')

In [3]:
len(data)

28916

In [4]:
data.head(5)

Unnamed: 0,mark,description
0,5,Я имею кредитную карту. Пользуюсь ею длительно...
1,5,"Всем привет! Я в этом банке , как только рухну..."
2,1,Добрый вечер.Был вашим вкладчиком на протяжени...
3,1,Очень разочарована банком ВТБ24. За смс уведом...
4,1,"Отвратительный банк, и обслуживание. 24.11.201..."


Let`s change 'description' to 'sentiment':

In [5]:
new_columns = data.columns.values
new_columns[1] = 'sentiment'
data.columns = new_columns

In [6]:
data.describe()

Unnamed: 0,mark
count,28916.0
mean,2.081685
std,1.591598
min,1.0
25%,1.0
50%,1.0
75%,3.0
max,5.0


We see that there are 29k responses, no blank fields, so it is data without holes in it, sentiments are in russian language, marks are from 1 to 5.

In [7]:
data.mark.value_counts()

1    17651
5     5749
2     3391
3     1484
4      641
Name: mark, dtype: int64

It can be critical for classifier to have such a large number of one class and a little number of other classes, so we remember it and may do oversampling later.

### Now we can try different approaches: classification with 5 classes, or regression with target 'mark'.
### Let`s try classification first:

Before classification or regression we must extract features from the data, I`ll use **TfidfVectorizer** (because usually it is better than regular *bag-of-words*):

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
X = data['sentiment']

vectorizer = TfidfVectorizer(sublinear_tf=True)

%time X = vectorizer.fit_transform(X)
print(X.shape)

CPU times: user 5.2 s, sys: 40 ms, total: 5.24 s
Wall time: 5.36 s
(28916, 134520)


Now we have a data to train our models. Let`s try all the classifiers:

In [10]:
from sklearn.cross_validation import cross_val_score

### 1). Linear models:

In [11]:
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

In [12]:
y = data['mark']

### LogisticRegression

In [39]:
%%time
best_acc = -1
best_C = -1
for cur_C in [0.01, 0.1, 0.5, 1, 5, 10, 100, 200, 500, 1000, 10000, 15000, 20000, 100000]:
    cls = LogisticRegression(C=cur_C)
    cv_scores = cross_val_score(cls, X, y, scoring="accuracy", cv=5)
    cur_avg_acc = np.mean(cv_scores)
    if cur_avg_acc > best_acc:
        best_acc = cur_avg_acc
        best_C = cur_C
    print('C={0}'.format(cur_C), '\t', 
          'accuracy={0:0.4f}'.format(cur_avg_acc), 
          '+-{0:0.4f}'.format(np.std(cv_scores)))

C=0.01 	 accuracy=0.6105 +-0.0002
C=0.1 	 accuracy=0.7511 +-0.0035
C=0.5 	 accuracy=0.7799 +-0.0028
C=1 	 accuracy=0.7844 +-0.0025
C=5 	 accuracy=0.7832 +-0.0009
C=10 	 accuracy=0.7791 +-0.0012
C=100 	 accuracy=0.7643 +-0.0014
C=200 	 accuracy=0.7612 +-0.0009
C=500 	 accuracy=0.7572 +-0.0013
C=1000 	 accuracy=0.7556 +-0.0015
C=10000 	 accuracy=0.7504 +-0.0007
C=15000 	 accuracy=0.7504 +-0.0010
C=20000 	 accuracy=0.7501 +-0.0007
C=100000 	 accuracy=0.7485 +-0.0021
CPU times: user 10min 5s, sys: 547 ms, total: 10min 6s
Wall time: 10min 6s


In [42]:
print('Best params are: C={0} with accuracy={1:0.4f}'.format(best_C, best_acc))

Best params are: C=1 with accuracy=0.7844


### StochasticGradientDescentClassifier

In [17]:
import itertools

In [23]:
%%time
best_acc = -1
best_alpha = -1
best_t = -1
for cur_alpha, cur_t in itertools.product([0.0001, 0.0005, 0.001, 0.01, 0.1, 1, 2, 4, 5, 10],
                                          [0.1, 0.5, 1, 1.5, 2, 3, 5, 10, 100, 500, 1000]):
    cls = SGDClassifier(alpha=cur_alpha, power_t=cur_t)
    cv_scores = cross_val_score(cls, X, y, scoring="accuracy", cv=5)
    cur_avg_acc = np.mean(cv_scores)
    if cur_avg_acc > best_acc:
        best_acc = cur_avg_acc
        best_alpha = cur_alpha
        best_t = cur_t
    print('alpha={0}, power_t={1}'.format(cur_alpha, cur_t), '\t', 
          'accuracy={0:0.4f}'.format(cur_avg_acc), 
          '+-{0:0.4f}'.format(np.std(cv_scores)))

alpha=0.0001, power_t=0.1 	 accuracy=0.7874 +-0.0024
alpha=0.0001, power_t=0.5 	 accuracy=0.7871 +-0.0028
alpha=0.0001, power_t=1 	 accuracy=0.7874 +-0.0023
alpha=0.0001, power_t=1.5 	 accuracy=0.7875 +-0.0025
alpha=0.0001, power_t=2 	 accuracy=0.7872 +-0.0031
alpha=0.0001, power_t=3 	 accuracy=0.7871 +-0.0032
alpha=0.0001, power_t=5 	 accuracy=0.7873 +-0.0027
alpha=0.0001, power_t=10 	 accuracy=0.7871 +-0.0030
alpha=0.0001, power_t=100 	 accuracy=0.7873 +-0.0025
alpha=0.0001, power_t=500 	 accuracy=0.7875 +-0.0021
alpha=0.0001, power_t=1000 	 accuracy=0.7870 +-0.0024
alpha=0.0005, power_t=0.1 	 accuracy=0.7706 +-0.0033
alpha=0.0005, power_t=0.5 	 accuracy=0.7703 +-0.0031
alpha=0.0005, power_t=1 	 accuracy=0.7707 +-0.0034
alpha=0.0005, power_t=1.5 	 accuracy=0.7708 +-0.0030
alpha=0.0005, power_t=2 	 accuracy=0.7704 +-0.0028
alpha=0.0005, power_t=3 	 accuracy=0.7711 +-0.0036
alpha=0.0005, power_t=5 	 accuracy=0.7709 +-0.0032
alpha=0.0005, power_t=10 	 accuracy=0.7703 +-0.0031
alpha=0.00

In [31]:
print('Best params are: alpha={0}, power_t={1} with accuracy={2:0.4f}'.format(best_alpha, best_t, best_acc))

Best params are: alpha=0.0001, power_t=1.5 with accuracy=0.7875


### PassiveAgressiveClassifier

In [32]:
%%time
best_acc = -1
best_C = -1
for cur_C in [0.01, 0.1, 0.5, 1, 5, 10, 100, 200, 500, 1000, 10000, 15000, 20000, 100000]:
    cls = PassiveAggressiveClassifier(C=cur_C)
    cv_scores = cross_val_score(cls, X, y, scoring="accuracy", cv=5)
    cur_avg_acc = np.mean(cv_scores)
    if cur_avg_acc > best_acc:
        best_acc = cur_avg_acc
        best_C = cur_C
    print('C={0}'.format(cur_C), '\t', 
          'accuracy={0:0.4f}'.format(cur_avg_acc), 
          '+-{0:0.4f}'.format(np.std(cv_scores)))

C=0.01 	 accuracy=0.7700 +-0.0031
C=0.1 	 accuracy=0.7895 +-0.0022
C=0.5 	 accuracy=0.7674 +-0.0045
C=1 	 accuracy=0.7556 +-0.0023
C=5 	 accuracy=0.7489 +-0.0046
C=10 	 accuracy=0.7454 +-0.0048
C=100 	 accuracy=0.7472 +-0.0056
C=200 	 accuracy=0.7441 +-0.0057
C=500 	 accuracy=0.7500 +-0.0024
C=1000 	 accuracy=0.7472 +-0.0033
C=10000 	 accuracy=0.7444 +-0.0053
C=15000 	 accuracy=0.7421 +-0.0043
C=20000 	 accuracy=0.7454 +-0.0069
C=100000 	 accuracy=0.7451 +-0.0066
CPU times: user 31 s, sys: 0 ns, total: 31 s
Wall time: 31 s


In [35]:
print('Best params are: C={0} with accuracy={1:0.4f}'.format(best_C, best_acc))

Best params are: C=0.1 with accuracy=0.7895


### RidgeClassifier

In [37]:
%%time
best_acc = -1
best_alpha = -1
for cur_alpha in [0.01,0.05, 0.1, 0.5, 1, 1.5, 3, 5, 10, 100, 250, 500, 1000]:
    cls = RidgeClassifier(alpha=cur_alpha)
    cv_scores = cross_val_score(cls, X, y, scoring="accuracy", cv=5)
    cur_avg_acc = np.mean(cv_scores)
    if cur_avg_acc > best_acc:
        best_acc = cur_avg_acc
        best_alpha = cur_alpha
    print('alpha={0}'.format(cur_alpha), '\t', 
          'accuracy={0:0.4f}'.format(cur_avg_acc), 
          '+-{0:0.4f}'.format(np.std(cv_scores)))

alpha=0.01 	 accuracy=0.7129 +-0.0064
alpha=0.05 	 accuracy=0.7332 +-0.0061
alpha=0.1 	 accuracy=0.7473 +-0.0040
alpha=0.5 	 accuracy=0.7783 +-0.0016
alpha=1 	 accuracy=0.7840 +-0.0017
alpha=1.5 	 accuracy=0.7864 +-0.0029
alpha=3 	 accuracy=0.7859 +-0.0026
alpha=5 	 accuracy=0.7848 +-0.0025
alpha=10 	 accuracy=0.7808 +-0.0027
alpha=100 	 accuracy=0.7147 +-0.0034
alpha=250 	 accuracy=0.6306 +-0.0011
alpha=500 	 accuracy=0.6104 +-0.0001
alpha=1000 	 accuracy=0.6104 +-0.0001
CPU times: user 6min 22s, sys: 983 ms, total: 6min 23s
Wall time: 6min 23s


In [40]:
print('Best params are: alpha={0} with accuracy={1:0.4f}'.format(best_alpha, best_acc))

Best params are: alpha=1.5 with accuracy=0.7844


### The winner among the linear models - PassiveAgressiveClassifier with accuracy=0.7895

### 3). Bayesian Classifiers:

In [58]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

In [None]:
%%time
best_acc = -1
best_alpha = -1
for cur_alpha in np.arange(0.01, 500, 0.05):
    cls = BernoulliNB(alpha=cur_alpha)
    cv_scores = cross_val_score(cls, X, y, scoring="accuracy", cv=5)
    cur_avg_acc = np.mean(cv_scores)
    if cur_avg_acc > best_acc:
        best_acc = cur_avg_acc
        best_alpha = cur_alpha

In [None]:
print('Best params are: alpha={0} with accuracy={1:0.4f}'.format(best_alpha, best_acc))

### 3). Metric Classifiers:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid

In [None]:
%%time
best_acc = -1
best_k = -1
for cur_k in range(50):
    cls = KNeighborsClassifier(n_)
    cv_scores = cross_val_score(cls, X, y, scoring="accuracy", cv=5)
    cur_avg_acc = np.mean(cv_scores)
    if cur_avg_acc > best_acc:
        best_acc = cur_avg_acc
        best_alpha = cur_alpha

### 4). Ensembles:

In [53]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomTreesEmbedding

In [47]:
%%time
clf = AdaBoostClassifier()
cv_scores = cross_val_score(cls, X, y, scoring="accuracy", cv=5)
print(cv_scores.mean())

0.748547663798
CPU times: user 56.2 s, sys: 3.33 ms, total: 56.2 s
Wall time: 56.2 s


In [48]:
%%time
clf = BaggingClassifier()
cv_scores = cross_val_score(cls, X, y, scoring="accuracy", cv=5)
print(cv_scores.mean())

0.748547663798
CPU times: user 56.7 s, sys: 3.33 ms, total: 56.7 s
Wall time: 56.7 s


In [49]:
%%time
clf = ExtraTreesClassifier()
cv_scores = cross_val_score(cls, X, y, scoring="accuracy", cv=5)
print(cv_scores.mean())

0.748547663798
CPU times: user 57.5 s, sys: 3.33 ms, total: 57.5 s
Wall time: 57.5 s


In [50]:
%%time
clf = RandomForestClassifier()
cv_scores = cross_val_score(cls, X, y, scoring="accuracy", cv=5)
print(cv_scores.mean())

0.748547663798
CPU times: user 58.9 s, sys: 3.33 ms, total: 58.9 s
Wall time: 58.9 s


In [51]:
%%time
clf = RandomTreesEmbedding()
cv_scores = cross_val_score(cls, X, y, scoring="accuracy", cv=5)
print(cv_scores.mean())

0.748547663798
CPU times: user 57.7 s, sys: 0 ns, total: 57.7 s
Wall time: 57.7 s


Let`s turn to the main candidates - **gradient boosting classifiers**

### Gradient Boosting Classifiers

In [13]:
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb

ImportError: No module named 'lightgbm'

### TODO: 
- parameter tuning for ensembles
- give it a try to reduce space dimension and try L1 feature_selection (http://scikit-learn.org/stable/modules/feature_selection.html)
- make the GridSearch for GradientBoostingClassifiers
- try LDA for feature transformation