**Корректность проверена на Python 3.7:**
+ pandas 0.23.0
+ numpy 1.14.5
+ sklearn 0.19.1
+ nltk 3.2.4

# Анализ тональности отзывов

Сначала возьмем выборку отзывов на фильмы из NLTK:

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import nltk
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

print(negids[:5])

['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt']


[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\stager\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [3]:
movie_reviews

<CategorizedPlaintextCorpusReader in 'C:\\Users\\stager\\AppData\\Roaming\\nltk_data\\corpora\\movie_reviews'>

Приготовим список текстов и классов как обучающую выборку:

In [4]:
negfeats = [" ".join(movie_reviews.words(fileids=[f])) for f in negids]
posfeats = [" ".join(movie_reviews.words(fileids=[f])) for f in posids]

texts = negfeats + posfeats
labels = [0] * len(negfeats) + [1] * len(posfeats)

In [5]:
print(texts[0])

plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what ' s the deal ? watch the movie and " sorta " find out . . . critique : a mind - fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn ' t snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it ' s simply too jumbled . it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no idea

Импортируем нужные нам модули

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

### Оценка качества работы разных классификаторов

In [7]:
def text_classifier(vectorizer, transformer, classifier):
    return Pipeline(
            [("vectorizer", vectorizer),
            ("transformer", transformer),
            ("classifier", classifier)]
        )

In [8]:
for clf in [LogisticRegression, LinearSVC, SGDClassifier]:
    print(clf)
    print(cross_val_score(text_classifier(CountVectorizer(), TfidfTransformer(), clf(max_iter=1000)), texts, labels).mean())
    print("\n")

<class 'sklearn.linear_model._logistic.LogisticRegression'>
0.8205


<class 'sklearn.svm._classes.LinearSVC'>
0.8545


<class 'sklearn.linear_model._stochastic_gradient.SGDClassifier'>
0.849




### Подготовка классификатора, обученного на всех данных

In [9]:
clf_pipeline = Pipeline(
            [("vectorizer", TfidfVectorizer()),
            ("classifier", LinearSVC())]
        )


clf_pipeline.fit(texts, labels)

print(clf_pipeline)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('classifier',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
   

In [10]:
print(clf_pipeline.predict(["Amazing film! I will advice it to all my friends. Genious",
                           "Awful film! The man who advised me to watch it is really crazy idiot."]))

[1 0]


In [11]:
print(clf_pipeline.predict(["The usual action movie. There is nothing else in the movie."]))

[0]


In [12]:
print(clf_pipeline.predict(["John Woo knows how to shoot a beautiful movie with monstrous shootings and explosions."]))

[1]


## Понижение размерности и ансамбли деревьев

In [13]:
from sklearn.decomposition import NMF, TruncatedSVD

In [14]:
%%time
v = CountVectorizer()
mx = v.fit_transform(texts)
mf = TruncatedSVD(10)
u = mf.fit_transform(mx)

Wall time: 4.14 s


In [15]:
for transform in [TruncatedSVD, NMF]:
    print(transform)
    print(cross_val_score(text_classifier(CountVectorizer(), 
                                          transform(n_components=10), 
                                          LinearSVC()), 
                          texts, labels).mean())
    print("\n")

<class 'sklearn.decomposition._truncated_svd.TruncatedSVD'>
0.5105000000000001


<class 'sklearn.decomposition._nmf.NMF'>
0.655







Если задать n_components=1000:

In [16]:
%%time
print(cross_val_score(text_classifier(TfidfVectorizer(), 
                                      TruncatedSVD(n_components=1000), 
                                      LinearSVC()),
                      texts, 
                      labels
                     ).mean())

0.851
Wall time: 2min 51s


## Ансамбли деревьев на преобразованных признаках

In [17]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
#!/usr/bin/env python -W ignore::DeprecationWarning

In [18]:
%%time
print(cross_val_score(
    Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", TruncatedSVD(100)),
            ("classifier", RandomForestClassifier(100))
        ]),
    texts,
    labels
    ))

[0.7225 0.7125 0.7325 0.7575 0.7075]
Wall time: 30.2 s


Больше компонент и больше деревьев:

In [19]:
%%time
print(cross_val_score(text_classifier(CountVectorizer(), 
                                      TruncatedSVD(n_components=1000), 
                                      RandomForestClassifier(1000)),
                      texts, 
                      labels
                     ).mean())

0.7270000000000001
Wall time: 5min 55s


Tf*Idf вместо частот слов:

In [20]:
%%time
print(cross_val_score(text_classifier(TfidfVectorizer(), 
                                      TruncatedSVD(n_components=1000), 
                                      RandomForestClassifier(1000)),
                      texts, 
                      labels
                     ).mean())

0.634
Wall time: 5min 43s


## Совмещаем Tf*Idf и SVD

In [21]:
from sklearn.pipeline import FeatureUnion

estimators = [('tfidf', TfidfTransformer()), ('svd', TruncatedSVD(1))]
combined = FeatureUnion(estimators)

In [22]:
%%time
print(cross_val_score(
    Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", combined),
            ("classifier", LinearSVC())
        ]),
    texts,
    labels
    ))

[0.6375 0.745  0.7625 0.6425 0.7875]
Wall time: 20.2 s
