Classification 2:
* Dataset: http://ai.stanford.edu/~amaas/data/sentiment/
* Target: pos/neg folders
* Metric: AUC-ROC
* Libraries: scikit‑learn + NLTK
* Text preprocessing – 3
    - Removing stop words
    - Stemming / Lemmatization
    - Bag of words / TF-IDF
    - N-grams
* Words importance - 2
* Hyperparameters tuning – 1
* Compare performance of models: SGDClassifier; SVM; Naive Bayes - 2

Некий план:
* придумать как в gs вычислять три метрики сразу
* придумать как в gs вместо кросс-валидации использовать тестовую выборку?
* заценить оригинальную статью
* text preprocessing
* feature engineering
* words importance?
* pipeline with sgd/svm/bayes + gridsearch

Прочитаем README:
* по 25к файлов на train/test, по 12.5к файлов на pos/neg
* pos – рейтинг >= 7, neg – рейтинг <= 4
* в название файлов включен рейтинг
* есть отдельный файл со ссылками на imdb, т.е. известно, какому фильму принадлежат рейтинги
* но толку от этого предположительно нуль, потому что фильмы в train/test не повторяются
* есть авторский bag of words в libsvm формате

Заменим тысячи файлов двумя train.json + test.json, куда включим собственно отзывы, таргет, рейтинг, и идентификатор фильма

In [1]:
import os
import json
from tqdm import tnrange, tqdm_notebook

In [2]:
def sample2json(sample):
    if sample not in ['train', 'test']:
        raise ValueError
    
    data = []
    for target in tqdm_notebook(['pos', 'neg'], desc='target', leave=False):
        with open('data/imdb/%s/urls_%s.txt' % (sample, target)) as file:
            urls = file.readlines()

        for index, filename in tqdm_notebook(enumerate(os.listdir('data/imdb/%s/%s' % (sample, target))), desc='file', leave=False):
            file_id, rating = filename.split('_')
            rating = rating[:rating.index('.')]
            
            with open('data/imdb/%s/%s/%s' % (sample, target, filename)) as file:
                data.append({
                    'file_id': file_id,
                    'rating': int(rating),
                    'review': file.read(),
                    'film_id': urls[index][28:35],
                    'target': target
                })
    
        with open('data/imdb/%s.json' % sample, 'w') as file:
            json.dump(data, file, indent=4)

In [3]:
for sample in tqdm_notebook(['train', 'test'], desc='sample'):
    if not os.path.exists('data/imdb/%s.json' % sample):
        sample2json(sample)

HBox(children=(IntProgress(value=0, description='sample', max=2), HTML(value='')))




Чудно, займемся делом. Импортируем библиотеки

In [4]:
import numpy as np

In [5]:
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, precision_score, recall_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from pipelinehelper import PipelineHelper

In [6]:
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

А также чудесные jsonы

In [7]:
train = json.load(open('data/imdb/train.json'))
test = json.load(open('data/imdb/test.json'))
X_train = [item['review'] for item in train]
y_train = [int(item['target'] == 'pos') for item in train]
X_test = [item['review'] for item in test]
y_test = [int(item['target'] == 'pos') for item in test]

Построим [baseline](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html). Однако для этого используем в некоем смысле универсальный Pipeline, используя на последнем шаге три разных классификатора при помощи [pipelinehelper](https://github.com/bmurauer/pipelinehelper). ~~также см [класс](http://www.davidsbatista.net/blog/2018/02/23/model_optimization/)~~

In [8]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', PipelineHelper([
        ('bayes', MultinomialNB()),
        ('sgd', SGDClassifier(max_iter=1000, tol=1e-3)),
        ('svc', LinearSVC())
    ]))
])
    
param_grid = {
    'clf__selected_model': text_clf.named_steps['clf'].generate({})
}

In [9]:
grid = GridSearchCV(text_clf, param_grid, scoring='roc_auc', cv=3, verbose=10, n_jobs=-1, return_train_score=True)
grid.fit(X_train, y_train)

print(grid.best_params_)
print(grid.best_score_)

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   9 | elapsed:   16.5s remaining:   57.9s
[Parallel(n_jobs=-1)]: Done   3 out of   9 | elapsed:   17.2s remaining:   34.5s
[Parallel(n_jobs=-1)]: Done   4 out of   9 | elapsed:   17.4s remaining:   21.8s
[Parallel(n_jobs=-1)]: Done   5 out of   9 | elapsed:   17.4s remaining:   13.9s
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed:   21.4s remaining:   10.7s
[Parallel(n_jobs=-1)]: Done   7 out of   9 | elapsed:   21.6s remaining:    6.2s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:   24.9s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:   24.9s finished


{'clf__selected_model': ('sgd', {})}
0.9577251329900329


Сделаем кастомный токенайзер на основе work_tokenize + WordNetLemmatizer, LancasterStemmer, SnowballStemmer, PorterStemmer. В отдельный файл пришлось выделить из-за ошибок в GridSearch, мол не могу запиклить

In [10]:
from local_utils import LemmaTokenizer, StemTokenizer

In [11]:
tfidf_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', PipelineHelper([
        ('bayes', MultinomialNB()),
        ('sgd', SGDClassifier(max_iter=1000, tol=1e-3)),
        ('svc', LinearSVC(dual=True))
    ]))
])
    
tfidf_param_grid = {
    'clf__selected_model': text_clf.named_steps['clf'].generate({}),
    'vect__tokenizer': [None,
                        LemmaTokenizer(),
                        StemTokenizer('lancaster')], # еще 'snowball', 'porter'
    'vect__ngram_range': [(1, 1), (1, 2)], # еще (1, 3), но я умру пока это досчитается
    'tfidf__smooth_idf': [True, False],
    'tfidf__sublinear_tf': [False, True]
}

Уберем TfidfTransformer и получим Bag of Words 

In [16]:
bow_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', PipelineHelper([
        ('bayes', MultinomialNB()),
        ('sgd', SGDClassifier(max_iter=1000, tol=1e-3)),
        ('svc', LinearSVC(dual=True))
    ]))
])
    
bow_param_grid = {
    'clf__selected_model': text_clf.named_steps['clf'].generate({}),
    'vect__tokenizer': [None,
                        LemmaTokenizer(),
                        StemTokenizer('lancaster')], # еще 'snowball', 'porter'
    'vect__ngram_range': [(1, 1), (1, 2), (1, 3)]
}

Попробуем эти махины на небольшой подвыборке

In [17]:
temp = list(zip(X_train, y_train))
X_train_subsample, y_train_subsample = zip(*np.concatenate([
    np.random.permutation(temp[:12500])[:250], # pos
    np.random.permutation(temp[12500:])[:250]  # neg
]))

In [19]:
bow_grid = GridSearchCV(bow_clf, param_grid, scoring='roc_auc', cv=3, verbose=10, n_jobs=-1, return_train_score=True)
bow_grid.fit(X_train_subsample, y_train_subsample)

print(bow_grid.best_params_)
print(bow_grid.best_score_)

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   9 | elapsed:    0.3s remaining:    1.2s
[Parallel(n_jobs=-1)]: Done   3 out of   9 | elapsed:    0.4s remaining:    0.8s
[Parallel(n_jobs=-1)]: Done   4 out of   9 | elapsed:    0.4s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done   5 out of   9 | elapsed:    0.4s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed:    0.4s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   7 out of   9 | elapsed:    0.5s remaining:    0.1s


{'clf__selected_model': ('bayes', {})}
0.8625200803212851


[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.6s finished


In [12]:
tfidf_grid = GridSearchCV(tfidf_text_clf, param_grid, scoring='roc_auc', cv=3, verbose=10, n_jobs=-1, return_train_score=True)
tfidf_grid.fit(X_train_subsample, y_train_subsample)

print(tfidf_grid.best_params_)
print(tfidf_grid.best_score_)

Fitting 3 folds for each of 180 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   11.5s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:   20.3s
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   30.5s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   38.5s
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:   49.0s
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done  69 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  97 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 112 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 129 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 165 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  3

{'clf__selected_model': ('bayes', {}), 'tfidf__smooth_idf': True, 'tfidf__sublinear_tf': True, 'vect__ngram_range': (1, 1), 'vect__tokenizer': <local_utils.StemTokenizer object at 0x7f67430a3390>}
0.905877223178428


Функция строит ROC, считает AUC, precision и recall

In [None]:
def evaluate(y_true, y_pred_proba):
#     y_pred = [int(item[0] <= item[1]) for item in y_pred_proba]
#     print(classification_report(y_true, y_pred))

#     precision = precision_score(y_true, y_pred)
#     recall = recall_score(y_true, y_pred)
    auc = roc_auc_score(y_true, y_pred_proba)
    
    fpr, tpr, thresholds = roc_curve(y_true, y_pred_proba)
    curve_roc = np.array([fpr, tpr])
    plt.plot(fpr, tpr, label='ROC curve: AUC=%0.3f' % auc, color='darkorange', lw=1)
#     plt.title('precision: %0.3f, recall: %0.3f' % (precision, recall))
    plt.xlabel('FPR')
    plt.ylabel('TPR')
    plt.ylim([0.0, 1.02])
    plt.xlim([0.0, 1.0])
    plt.grid(True)
    plt.legend(loc="lower right")
#     return auc 

In [None]:
proba = grid.best_estimator_.predict_proba(X_test)
proba

In [None]:
evaluate(y_test, proba)