## Implementação com Scikit-Learn

Utilizando a base de dados presente no repositório:

1. Escreva *pipeline de classificação de texto* para classificar reviews de filmes como positivos e negativos;
2. Encontre um bom conjunto de parâmetros utilizando `GridSearchCV`;
3. Avalie o classificador utilizando parte do conjunto de dados (previamente separado para testes).
4. Repita os passos 1, 2 e 3 utilizando um algoritmo de classificação diferente;
5. Escreva um pequeno texto comparando os resultados obtidos para cada algoritmo.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np

In [None]:
movie_reviews_data_folder = r"./data"
dataset = load_files(movie_reviews_data_folder, shuffle=False)
print("n_samples: %d" % len(dataset.data))

In [None]:
docs_train, docs_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, test_size=0.25, random_state=None)

#### 1. Escreva *pipeline de classificação de texto* para classificar reviews de filmes como positivos e negativos:

In [None]:
pipeline = Pipeline([
        ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
        ('clf', LinearSVC(C=1000)),
    ])

In [None]:
 parameters = {'vect__ngram_range': [(1, 1), (1, 2)],}

#### 2. Encontre um bom conjunto de parâmetros utilizando `GridSearchCV`:

In [None]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1)
grid_search.fit(docs_train, y_train)

In [None]:
n_candidates = len(grid_search.cv_results_['params'])
for i in range(n_candidates):
    print(i, 'params - %s; mean - %0.2f; std - %0.2f'
             % (grid_search.cv_results_['params'][i],
                grid_search.cv_results_['mean_test_score'][i],
                grid_search.cv_results_['std_test_score'][i]))

In [None]:
y_predicted = grid_search.predict(docs_test)
print(metrics.classification_report(y_test, y_predicted, target_names=dataset.target_names))
cm = metrics.confusion_matrix(y_test, y_predicted)
print(cm)

#### 3. Avalie o classificador utilizando parte do conjunto de dados (previamente separado para testes):

In [None]:
pipeline.fit(docs_train, y_train)
predict = pipeline.predict(docs_test)
np.mean(predict == y_test)

#### 4. Repita os passos 1, 2 e 3 utilizando um algoritmo de classificação diferente:

In [None]:
pipeline2 = Pipeline([
        ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
        ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, random_state=42,
                                           max_iter=5, tol=None)),
    ])

In [None]:
grid_search = GridSearchCV(pipeline2, parameters, n_jobs=-1, cv=2)
grid_search.fit(docs_train, y_train)

In [None]:
n_candidates = len(grid_search.cv_results_['params'])
for i in range(n_candidates):
    print(i, 'params - %s; mean - %0.2f; std - %0.2f'
             % (grid_search.cv_results_['params'][i],
                grid_search.cv_results_['mean_test_score'][i],
                grid_search.cv_results_['std_test_score'][i]))

In [None]:
y_predicted = grid_search.predict(docs_test)
print(metrics.classification_report(y_test, y_predicted, target_names=dataset.target_names))
cm = metrics.confusion_matrix(y_test, y_predicted)
print(cm)

In [None]:
pipeline2.fit(docs_train, y_train)
predict = pipeline2.predict(docs_test)
np.mean(predict == y_test)