## Imports

In [1]:
import itertools
import json
import matplotlib.pyplot as plt
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import numpy as np
import pandas as pd
import re
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from spellchecker import SpellChecker
from joblib import dump, load

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Khaled\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Khaled\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
import time
import datetime

In [3]:
# Beep when notebook stops running code
import winsound
def beep(reps=1, duration=500, freq=440, sleep=1):
    for _ in range(reps - 1):
        winsound.Beep(freq, duration)
        time.sleep(sleep)
    winsound.Beep(freq, duration)

## Data

Load the cleaned reviews:

In [4]:
%%time
data = "dataset/AmazonCellReviewsPreprocessed.csv"
df = pd.read_csv(data)
df = df[df.reviewText.notna()]
df.head()

Wall time: 6.75 s


In [5]:
df.shape

(1127630, 4)

See proportion of positive ($>3$) and critical ($\leq 3$) reviews:

In [6]:
df["positive"] = df.overall > 3

In [7]:
df.groupby("positive").size()

positive
False    236880
True     890750
dtype: int64

In [8]:
df.groupby("positive").size()/(df.shape[0])

positive
False    0.210069
True     0.789931
dtype: float64

The class to predict is highly unbalanced. We can sample in order to have a balanced class:

In [9]:
sample_size = 200000 # needs to be less than the number of observations in the minority class
sample_df = df.groupby('positive').apply(lambda x: x.sample(sample_size))

In [10]:
sample_df = sample_df.reset_index(level=0, drop=True) # remove outer level of multiindex

In [11]:
sample_df.groupby("positive").size()

positive
False    200000
True     200000
dtype: int64

#### Choice: Unbalanced or Balanced Classes

Definition of unbalanced `X` and `y` (class to predict). The classification with this choice should be better at predicting sentiment on the reviews from the Amazon dataset.

In [12]:
# X = df.reviewTextPreprocessed.values
# y = df.positive.values

Definition of balanced `X` and `y` (class to predict). The classification with this choice should be better at predicting sentiment on tweets (which might not be unbalanced in the same way as this dataset).

In [13]:
X = sample_df.reviewTextPreprocessed.values
y = sample_df.positive.values

## Order of operations from now on:

First of all, we define a list of stopwords.

The next step is the preprocessing needed to obtain a suitable representation of the reviews, which are:

- Tokenization
- Spelling correction
- Stop words removal
- Stemming

After these operations, the reviews are going to be passed to a vectorizer in order to obtain the final representation for the classifiers.

Stemming can be achieved using two different libraries: NLTK and PyStemmer. PyStemmer is faster, but needs Visual C++ Build Tools installed. Please choose the relevant code you prefer to run.

#### Rough execution times:

(Execution times might be different from the following, I ran the notebook again afterwards)

#### NLTK

- Tokenization, 3min 15s
- Spell check and correction, 14min 11s (`proprocessor` parameter)
- Stop words removal, 1min 6s (`stop_words` parameter)
- Stemming, 7min 10s (Porter) 5min 56s (Lancaster)
- Vectorization, 58.8 s (Porter), 56.6 s(Lancaster)

#### PyStemmer

Class that performs

- Tokenization
- Spell check and correction (parameter `preprocessor`)
- Stop words removal (parameter `stop_words`)
- Stemming (with `pystemmer`)
- Vectorization

8min 49s

### Definition of the list of Stop-words

In [14]:
from nltk.corpus import stopwords
stopws = stopwords.words("english")

The list of stop words needs to be preprocessed in the same way as the reviews. We define the  dictionaries needed for the preprocessing, as in the previous notebook:

In [15]:
emoticon_repl = {
    # positive emoticons
    r":-?d+": " good ",  r":[- ]?\)+": " good ", r";-?\)+": " good ",
    r"\(+-?:": " good ", r"=\)+" : " good ", r"<3" : " good ",
    # negative emoticons
    r"[\s\r\t\n]+:/+": " bad ", r":\\+": " bad ", r"[\s\r\t\n]+\)-?:": " bad ",
    r":-?\(+": " bad ", r"[\s\t\r\n]+d+-?:": " bad "
}

contracted_repl = {
    # casi particolari
    r"won\'t" : "will not", r"won\'" : "will not", r"can\'t": "can not", r"shan\'t": "shall not",
    r"shan\'": "shall not", r"ain\'t": "is not", r"ain\'": "is not",
    # casi generali
    r"n\'t": " not", r"\'t": " not", r"n\'": " not", r"\'s": " is", r"\'ve": " have", 
    r"\'re": " are", 
    r"\'ll": " will", r"\'d": " would",
}

with open('dataset/slang_subset_manual.json', 'r') as fid:
    slang_repl = json.load(fid)

Same preprocessing function as in the previous notebook:

In [16]:
def preprocess(sent, translate_slang = True):
    
    sent = sent.lower()
    sent = re.sub(r'^<div id="video.*>&nbsp;', '', sent) # Video-review part
    sent = re.sub('https?://[A-Za-z0-9./]+', '', sent) # URLs
    
    for k in emoticon_repl:
        sent = re.sub(k, emoticon_repl[k], sent)

    if translate_slang:
        for k in slang_repl:
            sent = re.sub(r"\b"+re.escape(k)+r"\b", slang_repl[k], sent)
        
    for k in contracted_repl:
        sent = re.sub(k, contracted_repl[k], sent)
    
    sent = re.sub('[/]+', ' ', sent) # word1/word2 to word1 word2
    sent = re.sub('[^A-Za-z0-9-_ ]+', '', sent)
    sent = re.sub('\b\d+\b', '', sent)
    
    return sent

In [17]:
prep_stopws = [preprocess(el) for el in stopws]

Words containing "not" are important for our tasks

In [18]:
np.array(prep_stopws[-36:])

array(['ain', 'aren', 'are not', 'couldn', 'could not', 'didn', 'did not',
       'doesn', 'does not', 'hadn', 'had not', 'hasn', 'has not', 'haven',
       'have not', 'isn', 'is not', 'ma', 'mightn', 'might not', 'mustn',
       'must not', 'needn', 'need not', 'shan', 'shall not', 'shouldn',
       'should not', 'wasn', 'was not', 'weren', 'were not', 'won',
       'will not', 'wouldn', 'would not'], dtype='<U10')

In [19]:
prep_stopws = prep_stopws[:-36]

Other words to remove from the stop words:

In [20]:
for word in ["not", "very", "don", "do not"]:
    prep_stopws.remove(word)

In [21]:
prep_stopws.extend(["youse", "would"]) # needed for consistency with spell checker

## Tokenization/Spell Correction/StopWordsRemoval/Stemming

### NLTK

In [22]:
from nltk.tokenize import word_tokenize
def tokenize_reviews(reviews):
    tokenized_reviews = [word_tokenize(review) for review in reviews]
    return tokenized_reviews

In [23]:
%%time
X_tokenized = tokenize_reviews(X)

Wall time: 1min 24s


Spelling correction:

In [24]:
def fix_spelling_mistakes(reviews, dist=1):
    spell = SpellChecker(distance=dist)
    reviews_with_right_spell = []
    for review in reviews:
        corrected_review = [spell.correction(word) for word in review]
        reviews_with_right_spell.append(corrected_review)
    return reviews_with_right_spell

In [25]:
%%time
X_spellchecked = fix_spelling_mistakes(X_tokenized)

Wall time: 8min 8s


Stop words removal:

In [26]:
%%time
X_noStopWords = []
for review in X_spellchecked:
        cleaned_review = [word for word in review if word not in prep_stopws]
        X_noStopWords.append(cleaned_review)

Wall time: 22.8 s


In [27]:
from nltk.stem import PorterStemmer, LancasterStemmer

In [28]:
def stem_reviews(reviews, stemmer_name="Porter"):
    if stemmer_name == "Porter":
        stemmer = PorterStemmer()
    elif stemmer_name == "Lancaster":
        stemmer = LancasterStemmer()
    else:
        raise SystemError
    stemmed_reviews = []
    for review in reviews:
        stemmed_reviews.append([stemmer.stem(word) for word in review])
    return stemmed_reviews

In [None]:
%%time
X_Porter = stem_reviews(X_noStopWords, stemmer_name = "Porter")

In [None]:
%%time
X_Lancaster = stem_reviews(X_noStopWords, stemmer_name = "Lancaster")

### PyStemmer (needs Visual C++ installed)

Definition of the class StemmedTdidfVectorized.

- `sklearn`'s `TfidfVectorizer` takes care of tokenization, stop-word removal, vectorization
- `pystemmer` takes care of stemming.

In [None]:
import Stemmer
english_stemmer = Stemmer.Stemmer('en')

In [None]:
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: english_stemmer.stemWords(analyzer(doc))

## Train-test split

In [None]:
from sklearn.model_selection import train_test_split

#### NLTK

Here we are train/test splitting the set of reviews that is already tokenized and stemmed, to be passed to TfidfVectorizer for representation.

In [None]:
X_train_Porter, X_test_Porter, y_train, y_test = train_test_split(X_Porter, y,
                                                                  test_size=0.33, random_state=42)
X_train_Lancaster, X_test_Lancaster, y_train, y_test = train_test_split(X_Lancaster, y,
                                                                        test_size=0.33, random_state=42)

#### Pystemmer

Here we are train/test splitting the preprocessed set of reviews to be passed to `StemmedTfidfVectorizer` for tokenization+stemming+representation.

In [None]:
X_train_pystemmer, X_test_pystemmer, y_train, y_test = train_test_split(X, y,
                                                                  test_size=0.33, random_state=42)

## Text Representation

#### NLTK

In [None]:
def rebuild_reviews(reviews):
    rebuilt_reviews = []
    for review in reviews:
        rebuilt_reviews.append(" ".join(review))
    return rebuilt_reviews

In [None]:
X_train_Porter = rebuild_reviews(X_train_Porter)
X_test_Porter = rebuild_reviews(X_test_Porter)
X_train_Lancaster = rebuild_reviews(X_train_Lancaster)
X_test_Lancaster = rebuild_reviews(X_test_Lancaster)

In [None]:
%%time
tfidf_vect_Porter = TfidfVectorizer(min_df= 5, max_features = 50000, ngram_range=(1,2))
X_train_tfidf_Porter = tfidf_vect_Porter.fit_transform(X_train_Porter)

In [None]:
%%time
X_test_tfidf_Porter = tfidf_vect_Porter.transform(X_test_Porter)

In [None]:
%%time
tfidf_vect_Lancaster = TfidfVectorizer(min_df= 5, max_features = 50000, ngram_range=(1,2))
X_train_tfidf_Lancaster = tfidf_vect_Lancaster.fit_transform(X_train_Lancaster)

In [None]:
%%time
X_test_tfidf_Lancaster = tfidf_vect_Lancaster.transform(X_test_Lancaster)

In [None]:
dump(tfidf_vect_Porter, 'joblib_data/tfidf_vect_Porter.joblib')
dump(tfidf_vect_Lancaster, 'joblib_data/tfidf_vect_Lancaster.joblib')

#### Pystemmer

In [None]:
%%time
spell = SpellChecker(distance=1)
tfidf_vect_pystemmer = StemmedTfidfVectorizer(min_df= 5, max_features = 50000, ngram_range=(1,2),
                                              preprocessor = spell.correction,
                                              stop_words = prep_stopws)
X_train_tfidf_pystemmer = tfidf_vect_pystemmer.fit_transform(X_train_pystemmer)

In [None]:
%%time
X_test_tfidf_pystemmer = tfidf_vect_pystemmer.transform(X_test_pystemmer)

In [None]:
dump(tfidf_vect_pystemmer, 'joblib_data/tfidf_vect_pystemmer.joblib') 

In [None]:
beep()

####  No stemming

In [None]:
X_train_nostemmer, X_test_nostemmer, y_train, y_test = train_test_split(X, y,
                                                                  test_size=0.33, random_state=42)

In [None]:
%%time
spell = SpellChecker(distance=1)
tfidf_vect_nostemmer = TfidfVectorizer(min_df= 5, max_features = 50000, ngram_range=(1,2),
                                              preprocessor = spell.correction,
                                              stop_words = prep_stopws)
X_train_tfidf_nostemmer = tfidf_vect_nostemmer.fit_transform(X_train_nostemmer)

In [None]:
%%time
X_test_tfidf_nostemmer = tfidf_vect_nostemmer.transform(X_test_nostemmer)

In [None]:
dump(tfidf_vect_nostemmer, 'joblib_data/tfidf_vect_nostemmer.joblib') 

In [None]:
beep()

# Classification

## NB Classifier

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_recall_curve, auc, confusion_matrix, f1_score, fbeta_score, precision_score, recall_score

Accessory functions:

In [None]:
def print_top_features(vectorizer, clf, n = 10):
    fnames = vectorizer.get_feature_names()
    top_pos = np.argsort(clf.coef_[0])[-n:]
    top_pos = top_pos[::-1]
    print("Most discriminative features:\n",
          ", ".join(fnames[j] for j in top_pos))

In [None]:
def score_NB(clf, X_train, X_test, y_train, y_test):
    train_score = clf.score(X_train, y_train) # Train Accuracy
    test_score = clf.score(X_test, y_test)    # Test Accuracy
    
    predictions = clf.predict(X_test)

    prec = precision_score(y_test, predictions) # Precision
    rec = recall_score(y_test, predictions) # Recall
    f1 = f1_score(y_test, predictions) # F1
    f2 = fbeta_score(y_test, predictions, 2) # F2
    cm = confusion_matrix(y_test, predictions)
    
    proba = clf.predict_proba(X_test)

    precision, recall, pr_thresholds = precision_recall_curve(y_test, proba[:,1])
    
    auc_score = auc(recall, precision)
    
    scores_strings = ["Train Accuracy", "Test Accuracy", "Test Precision",
                      "Test Recall", "F1", "F2", "P/R AUC"]
    
    scores = [train_score, test_score, prec, rec, f1, f2, auc_score]
    
    print(("{:20s} {:.5f}\n"*7)[:-1].format(*itertools.chain(*zip(scores_strings, scores))))
    
    print(classification_report(y_test,predictions))
    
    plt.plot(recall, precision, label='Precision-Recall curve')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title('Precision-Recall Curve: AUC=%0.2f' % auc_score)
    plt.legend(loc="lower left")
    plt.show()

###  No Stemmer

In [None]:
%%time
clf = MultinomialNB()
clf.fit(X_train_tfidf_nostemmer, y_train)

In [None]:
print_top_features(tfidf_vect_nostemmer, clf, 50)

In [None]:
%%time
score_NB(clf, X_train_tfidf_nostemmer, X_test_tfidf_nostemmer, y_train, y_test)

In [None]:
beep()

In [None]:
dump(clf, 'joblib_data/clf_nb_nostemmer.joblib') 

### NLTK

#### Porter

In [None]:
%%time
clf = MultinomialNB()
clf.fit(X_train_tfidf_Porter, y_train)

In [None]:
print_top_features(tfidf_vect_Porter, clf, 50)

In [None]:
%%time
score_NB(clf, X_train_tfidf_Porter, X_test_tfidf_Porter, y_train, y_test)

In [None]:
dump(clf, 'joblib_data/clf_nb_porter.joblib') 

#### Lancaster

In [None]:
%%time
clf = MultinomialNB()
clf.fit(X_train_tfidf_Lancaster, y_train)

In [None]:
print_top_features(tfidf_vect_Lancaster, clf, 50)

In [None]:
%%time
score_NB(clf, X_train_tfidf_Lancaster, X_test_tfidf_Lancaster, y_train, y_test)

In [None]:
dump(clf, 'joblib_data/clf_nb_lancaster.joblib') 

###  PyStemmer

In [None]:
%%time
clf = MultinomialNB()
clf.fit(X_train_tfidf_pystemmer, y_train)

In [None]:
print_top_features(tfidf_vect_pystemmer, clf, 50)

In [None]:
%%time
score_NB(clf, X_train_tfidf_pystemmer, X_test_tfidf_pystemmer, y_train, y_test)

In [None]:
dump(clf, 'joblib_data/clf_nb_pystemmer.joblib') 

## Random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

### No stemmer

In [None]:
%%time
clf = RandomForestClassifier(n_estimators=40, random_state=42, n_jobs=-1)
clf.fit(X_train_tfidf_nostemmer, y_train)

In [None]:
train_score = clf.score(X_train_tfidf_nostemmer, y_train) # Train Accuracy
test_score = clf.score(X_test_tfidf_nostemmer, y_test)    # Test Accuracy
print("Train accuracy: {}, test accuracy: {}".format(train_score, test_score))

In [None]:
%%time
predictions = clf.predict(X_test_tfidf_nostemmer)

In [None]:
print(classification_report(y_test,predictions))
dump(clf, "clf_random_forest_nostemmer.joblib")

### Porter

In [None]:
%%time
clf = RandomForestClassifier(n_estimators=40, random_state=42, n_jobs=-1)
clf.fit(X_train_tfidf_Porter, y_train)

In [None]:
train_score = clf.score(X_train_tfidf_Porter, y_train) # Train Accuracy
test_score = clf.score(X_test_tfidf_Porter, y_test)    # Test Accuracy
print("Train accuracy: {}, test accuracy: {}".format(train_score, test_score))

In [None]:
%%time
predictions = clf.predict(X_test_tfidf_Porter)

In [None]:
print(classification_report(y_test,predictions))
dump(clf, "clf_random_forest_porter.joblib")

Results are more encouraging! The problem is that it's way slower than Multinomial NB.

### Lancaster

In [None]:
%%time
clf = RandomForestClassifier(n_estimators=40, random_state=42, n_jobs=-1)
clf.fit(X_train_tfidf_Lancaster, y_train)

In [None]:
train_score = clf.score(X_train_tfidf_Lancaster, y_train) # Train Accuracy
test_score = clf.score(X_test_tfidf_Lancaster, y_test)    # Test Accuracy
print("Train accuracy: {}, test accuracy: {}".format(train_score, test_score))

In [None]:
%%time
predictions = clf.predict(X_test_tfidf_Lancaster)

In [None]:
print(classification_report(y_test,predictions))
dump(clf, "clf_random_forest_lancaster.joblib")

### Pystemmer

In [None]:
%%time
clf = RandomForestClassifier(n_estimators=40, random_state=42, n_jobs=-1)
clf.fit(X_train_tfidf_pystemmer, y_train)

In [None]:
train_score = clf.score(X_train_tfidf_pystemmer, y_train) # Train Accuracy
test_score = clf.score(X_test_tfidf_pystemmer, y_test)    # Test Accuracy
print("Train accuracy: {}, test accuracy: {}".format(train_score, test_score))

In [None]:
%%time
predictions = clf.predict(X_test_tfidf_pystemmer)

In [None]:
print(classification_report(y_test,predictions))
dump(clf, "clf_random_forest_pystemmer.joblib")

## TruncatedSVD
The X_train vector has around 20k features: for speeding up the training phase it may be good to use dimensionality reduction methods. Their goal is to preserve "expressive power" while reducing dataset dimensionality.
Because the TFIDF matrix is a sparse one, one of the best method for performing dimensionality reduction is "TruncatedSVD"

### No stemmer

In [None]:
%%time
from sklearn.decomposition import TruncatedSVD
tsvd = TruncatedSVD(n_components=500, random_state=42)
X_train_tfidf_nostemmer_svd = tsvd.fit_transform(X_train_tfidf_nostemmer)
X_test_tfidf_nostemmer_svd = tsvd.transform(X_test_tfidf_nostemmer)

### Porter

In [None]:
%%time
tsvd = TruncatedSVD(n_components=500, random_state=42)
X_train_tfidf_Porter_svd = tsvd.fit_transform(X_train_tfidf_Porter)
X_test_tfidf_Porter_svd = tsvd.transform(X_test_tfidf_Porter)

### Lancaster

In [None]:
%%time
tsvd = TruncatedSVD(n_components=500, random_state=42)
X_train_tfidf_Lancaster_svd = tsvd.fit_transform(X_train_tfidf_Lancaster
X_test_tfidf_Lancaster_svd = tsvd.transform(X_test_tfidf_Lancaster)

### Pystemmer

In [None]:
%%time
tsvd = TruncatedSVD(n_components=500, random_state=42)
X_train_tfidf_pystemmer_svd = tsvd.fit_transform(X_train_tfidf_pystemmer)
X_test_tfidf_pystemmer_svd = tsvd.transform(X_test_tfidf_pystemmer)

#### Store SVD-transformed dataset

In [None]:
dump(X_train_tfidf_nostemmer_svd, 'joblib_data/X_train_tfidf_nostemmer_svd.joblib')
dump(X_test_tfidf_nostemmer_svd, 'joblib_data/X_test_tfidf_nostemmer_svd.joblib')

In [None]:
dump(X_train_tfidf_Porter_svd, 'joblib_data/X_train_tfidf_Porter_svd.joblib')
dump(X_test_tfidf_Porter_svd, 'joblib_data/X_test_tfidf_Porter_svd.joblib')

In [None]:
dump(X_train_tfidf_Lancaster_svd, 'joblib_data/X_train_tfidf_Lancaster_svd.joblib')
dump(X_test_tfidf_Lancaster_svd, 'joblib_data/X_test_tfidf_Lancaster_svd.joblib')

In [None]:
dump(X_train_tfidf_pystemmer_svd, 'joblib_data/X_train_tfidf_pystemmer_svd.joblib')
dump(X_test_tfidf_pystemmer_svd, 'joblib_data/X_test_tfidf_pystemmer_svd.joblib')

## Random Forest with TruncatedSVD Dataset

### No stemmer

In [None]:
%%time
clf = RandomForestClassifier(n_estimators=40, random_state=42, n_jobs=-1)
clf.fit(X_train_tfidf_nostemmer_svd, y_train)

In [None]:
%%time
predictions = clf.predict(X_test_tfidf_nostemmer_svd)

In [None]:
print(classification_report(y_test,predictions))
dump(clf, "clf_random_forest_nostemmer_svd.joblib")

### Porter

In [None]:
%%time
clf = RandomForestClassifier(n_estimators=40, random_state=42, n_jobs=-1)
clf.fit(X_train_tfidf_Porter_svd, y_train)

In [None]:
%%time
predictions = clf.predict(X_test_tfidf_Porter_svd)

In [None]:
print(classification_report(y_test,predictions))
dump(clf, "clf_random_forest_porter_svd.joblib")

### Lancaster

In [None]:
%%time
clf = RandomForestClassifier(n_estimators=40, random_state=42, n_jobs=-1)
clf.fit(X_train_tfidf_Lancaster_svd, y_train)

In [None]:
%%time
predictions = clf.predict(X_test_tfidf_Lancaster_svd)

In [None]:
print(classification_report(y_test,predictions))
dump(clf, "clf_random_forest_lancaster_svd.joblib")

### Pystemmer

In [None]:
%%time
clf = RandomForestClassifier(n_estimators=40, random_state=42, n_jobs=-1)
clf.fit(X_train_tfidf_pystemmer_svd, y_train)

In [None]:
%%time
predictions = clf.predict(X_test_tfidf_pystemmer_svd)

In [None]:
print(classification_report(y_test,predictions))
dump(clf, "clf_random_forest_pystemmer_svd.joblib")

## SVM
### LinearSVC

In [None]:
from sklearn import svm

In [None]:
clf = svm.LinearSVC(random_state=42)

In [None]:
%%time
clf.fit(X_train_tfidf_Porter_svd, y_train)

In [None]:
predictions = clf.predict(X_test_tfidf_Porter_svd)
print(classification_report(y_test,predictions))

## SVC

In [None]:
clf = svm.SVC(random_state=42, max_iter=500)

In [None]:
%%time
clf.fit(X_train_tfidf_Porter_svd, y_train)

In [None]:
%%time
predictions = clf.predict(X_test_tfidf_Porter_svd)
print(classification_report(y_test,predictions))

## Adaboost
### 10 estimators

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
%%time
clf = AdaBoostClassifier(n_estimators=10, random_state=0)
clf.fit(X_train_tfidf_Porter_svd, y_train)

In [None]:
%%time
predictions = clf.predict(X_test_tfidf_Porter_svd)
print(classification_report(y_test,predictions))

### 15 estimators

In [None]:
%%time
clf = AdaBoostClassifier(n_estimators=15, random_state=0)
clf.fit(X_train_tfidf_Porter_svd, y_train)

In [None]:
%%time
predictions = clf.predict(X_test_tfidf_Porter_svd)
print(classification_report(y_test,predictions))

Increasing the number of estimators did not lead to an improvement in performances: let's see what happens when we reduce them.
## 5 estimators

In [None]:
%%time
clf = AdaBoostClassifier(n_estimators=5, random_state=0)
clf.fit(X_train_tfidf_Porter_svd, y_train)

In [None]:
%%time
predictions = clf.predict(X_test_tfidf_Porter_svd)
print(classification_report(y_test,predictions))

Performances are a bit worse

# TODO

- Add *short* examples after some steps.
- Decide what to do with slang. Probably very necessary for preprocessing tweets. If we want to use it for the Amazon dataset, we might reduce the size of the dict by checking which terms are actually present in the reviews, and only keep the ones that are present in many reviews.
- Tweets part