## ML competition
### Try 2

_Marilyn, Shiva, Olivier_

The goal of this file is to try some different (less restrictive) sanitization procedures.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Setup chunk

import time

# Custom utils
from utils import *

# Data wrangling
import pandas as pd
import numpy as np

# Plotting
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Text sanitization
import re
import nltk
from nltk.stem.snowball import SnowballStemmer

try:
    # Avoid error if you don't have the resource
    stopwords = nltk.corpus.stopwords.words("english")
except LookupError:
    nltk.download("stopwords")
    stopwords = nltk.corpus.stopwords.words("english")
    
stemmer = SnowballStemmer(language="english")

# Lang detection
#import langid
#from langid.langid import LanguageIdentifier, model
#identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)

# Misc
from tqdm import tqdm
tqdm.pandas()

# Define the seed for reproducibility
SEED = 31415

In [3]:
# Scikit time
from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import (
    LinearDiscriminantAnalysis,
    QuadraticDiscriminantAnalysis,
)

from sklearn.feature_extraction.text import (
    CountVectorizer, 
    TfidfTransformer, 
    TfidfVectorizer
)
from sklearn.pipeline import Pipeline, make_pipeline


from sklearn.model_selection import (
    train_test_split, 
    GridSearchCV, 
    KFold, 
    cross_val_score
)

from sklearn.metrics import (
    classification_report, 
    accuracy_score, 
    confusion_matrix
)

In [4]:
df = pd.read_csv("data/MLUnige2021_train.csv")

## 2. Strategy

1. Preprocess the text
    1. remove punctuation marks
    2. remove stopwords (en)
    3. stem or lemmatize the words
    
    
2. take a sample of our whole dataset (200k?) to do our preliminary test. We can't do cross validation on the whole dataset.

3. Begin to fit the models. 
    1. Pipeline with TfidfTransformer (or the other one I don't remember the name)
    2. BerNB
    3. LogisticRegression
    4. RidgeClassifier
    5. SGDClassifier
    6. SVC
    7. RandomForestClassifier
    8. DecisionTreeClassifier
    9. KNeighborsClassifier
    10. LDA? QDA?
4. Select a few of the best models, CV with bigger dataset
5. ...

# 1. Preprocessing

In [13]:
# Some special strings to test 
txt1 = "Hello @Shiva and @Marilyn! https://hello.ch 12334 #corona ..."
print("Sanitized:", sanitize(txt1))

Sanitized: hello shiva marilyn corona 


In [6]:
try:
    df_san = pd.read_pickle("./data/sanitized2.pkl")
except FileNotFoundError:
    print("No pickle file found, sanitizing existing df")
    
    # Sanitize whole dataset
    df_san = df.copy()
    df_san["sanitized"] = df["text"].progress_apply(sanitize)

    # Export it to pickle so we don't have to redo it
    df_san.to_pickle("./data/sanitized2.pkl")

  0%|                                                                          | 569/1280000 [00:00<03:45, 5675.35it/s]

No pickle file found, sanitizing existing df


100%|██████████████████████████████████████████████████████████████████████| 1280000/1280000 [02:38<00:00, 8056.71it/s]


In [7]:
print(df_san.shape)
df_san.head()

(1280000, 8)


Unnamed: 0,Id,emotion,tweet_id,date,lyx_query,user,text,sanitized
0,0,1,2063391019,Sun Jun 07 02:28:13 PDT 2009,NO_QUERY,BerryGurus,@BreeMe more time to play with you BlackBerry ...,breem time play blackberri
1,1,0,2000525676,Mon Jun 01 22:18:53 PDT 2009,NO_QUERY,peterlanoie,Failed attempt at booting to a flash drive. Th...,fail attempt boot flash drive fail attempt swi...
2,2,0,2218180611,Wed Jun 17 22:01:38 PDT 2009,NO_QUERY,will_tooker,@msproductions Well ain't that the truth. Wher...,msproduct well truth damn auto lock disabl go...
3,3,1,2190269101,Tue Jun 16 02:14:47 PDT 2009,NO_QUERY,sammutimer,@Meaghery cheers Craig - that was really sweet...,meagheri cheer craig realli sweet repli pump
4,4,0,2069249490,Sun Jun 07 15:31:58 PDT 2009,NO_QUERY,ohaijustin,I was reading the tweets that got send to me w...,read tweet got send lie phone face drop amp hi...


In [8]:
print("Before sanitizing", df['text'].apply(lambda x: len(x.split(' '))).sum())
print("After sanitizing", df_san['sanitized'].apply(lambda x: len(x.split(' '))).sum())

Before sanitizing 18398298
After sanitizing 10730388


Just to check, we see that before sanitizing, we had 18'398'298 words. We were able to halve it to 9'575'942 by sanitization and stemming our tweets.

Before fitting our models, we also take a subsample to be able to compute them faster.

In [9]:
# Preprocessed by me
df_sub_san = df_san.sample(frac=0.1, random_state=SEED)

# To check reproducibility
print("First Id", df_sub_san["Id"].iloc[0])
print("Last Id", df_sub_san["Id"].iloc[-1])
print("Length", df_sub_san.shape[0])

First Id 423388
Last Id 63949
Length 128000


In [10]:
#No preprocess, will use default by scikit
df_sub_nosan = df.sample(frac=0.1, random_state=SEED)

# To check reproducibility
print("First Id", df_sub_nosan["Id"].iloc[0])
print("Last Id", df_sub_nosan["Id"].iloc[-1])
print("Length", df_sub_nosan.shape[0])

First Id 423388
Last Id 63949
Length 128000


# 2. Fitting

## 1. with manual preprocessing

In [21]:
# Train test split, manual preprocessing
#X_train, X_test, y_train, y_test = train_test_split(df_sub_san["sanitized"], df_sub_san["emotion"], 
#                                                    test_size=0.2, shuffle=True, random_state=SEED)

X_train, X_test, y_train, y_test = train_test_split(df_sub_nosan["text"], df_sub_nosan["emotion"], 
                                                    test_size=0.2, shuffle=True, random_state=SEED)

#only 4 folds because I have 4 cores, just to test
folds = KFold(n_splits=4, shuffle=True, random_state=SEED)

In [14]:
# Sanity check
print("X_train: ", X_train.shape)
print("X_test: ", X_test.shape)
print("y_train: ", y_train.shape)
print("y_test: ", y_test.shape)

X_train:  (102400,)
X_test:  (25600,)
y_train:  (102400,)
y_test:  (25600,)


First model we will try is the `BernoulliNB` since we have binary data. 

In [15]:
berNB = Pipeline(
    [
        ("tfidf", TfidfVectorizer()),
        ("clf", BernoulliNB()),
    ]
)

start = time.time()

CV_ber = cross_val_score(
    berNB, X_train, y_train, scoring="accuracy", cv=folds, n_jobs=-1
)

berNB.fit(X_train, y_train)
y_pred = berNB.predict(X_test)
score = accuracy_score(y_test, y_pred)
print(f"Time {time.time() - start}")
print(f"Mean CV accuracy: {np.mean(CV_ber)}")
print(f"Test accuracy: {score}")

"""
Whole dataset:
Time 21.568854093551636
Mean CV accuracy: 0.7639306640625
Test accuracy: 0.76375390625

10% sample:
Time 2.146036386489868
Mean CV accuracy: 0.7483203125
Test accuracy: 0.7523828125
"""

Time 4.277349948883057
Mean CV accuracy: 0.745966796875
Test accuracy: 0.7516015625


'\nWhole dataset:\nTime 21.568854093551636\nMean CV accuracy: 0.7639306640625\nTest accuracy: 0.76375390625\n\n10% sample:\nTime 2.146036386489868\nMean CV accuracy: 0.7483203125\nTest accuracy: 0.7523828125\n'

In [40]:

models = [LogisticRegression(warm_start=True)]

params_tfid = {
    "tfidfvectorizer__strip_accents": ["unicode"],
    "tfidfvectorizer__lowercase": [True],
    "tfidfvectorizer__stop_words": ["english"],
    "tfidfvectorizer__norm": ["l2"],
    "tfidfvectorizer__analyzer": ["word"],
    "tfidfvectorizer__use_idf": [True],
    "tfidfvectorizer__smooth_idf": [True, False],
    "tfidfvectorizer__sublinear_tf": [True, False]
}

params = {
    "bernoullinb": {
        "bernoullinb__alpha": [7.5],
        "bernoullinb__fit_prior": [False, True],
    },
    "ridgeclassifier": {
        "ridgeclassifier__alpha": np.linspace(1e-5, 10, 5),
        "ridgeclassifier__class_weight": ["balanced", None],
        "ridgeclassifier__normalize": [False, True],
        
    },
    "logisticregression": {
        #"logisticregression__penalty": ["l1", "l2", "elasticnet"],
        #"logisticregression__penalty": ["l1"],
        #"logisticregression__dual": [False, True], try with liblinear
        "logisticregression__C": [1.2],
        #"logisticregression__C": [1e-1],
        #"logisticregression__random_state": [SEED],
        #"logisticregression__solver": ["newton-cg", "lbfgs", "saga"],
        #"logisticregression__solver": ["saga"],
        #"logisticregression__l1_ratio": np.linspace(0.1, 0.9, 5),
    },
    "sgdclassifier": {
        "sgdclassifier__random_state": [SEED],
        "sgdclassifier__loss": ["modified_huber"],
        "sgdclassifier__alpha": 10**np.linspace(-3, -0.001, 4),
    },
    "baggingclassifier": {
        "baggingclassifier__random_state": [SEED],
        "baggingclassifier__n_estimators": [30],
        "baggingclassifier__max_samples": [0.05],
        "baggingclassifier__max_features": [0.5],
        
    },
    "randomforestclassifier": {
        "randomforestclassifier__random_state": [SEED],
    },
    "svc": {
        "svc__random_state": [SEED],
    },
    "linearsvc": {
        "linearsvc__random_state": [SEED],
        "linearsvc__loss": ["squared_hinge"],
        "linearsvc__penalty": ["l2"],
        "linearsvc__max_iter": [1000],
        "linearsvc__dual": [False],
        "linearsvc__C": np.linspace(0.01, 0.2, 10),
        "linearsvc__class_weight": ["balanced"],
    }
}

# If we also want to gridsearch the different Tfidf params
for k, v in params_tfid.items():
    #params["bernoullinb"][k] = v
    params["logisticregression"][k] = v
    #Easier if we comment above
    pass

pipes = []

# Also check what we can do with the TfidfVectorizer parameters
for model in models:
    pipe = make_pipeline(TfidfVectorizer(), model)
    pipes.append(pipe)
    
    # Will use that once we have the best params
    #pipe.set_params(**params[pipe.steps[1][0]])

# Initialize empty dictionary
reports = {}

In [41]:
np.linspace(1e-3, 2, 4)

array([1.00000000e-03, 6.67333333e-01, 1.33366667e+00, 2.00000000e+00])

In [42]:
# Fit each different pipeline

for pipe in pipes:
    print(pipe.steps[1][0])
    start = time.time()
    
    gridsearch = GridSearchCV(pipe, params[pipe.steps[1][0]], scoring="accuracy", cv=folds, n_jobs=-1, verbose=3)
    gridsearch.fit(X_train, y_train)
    y_pred = gridsearch.predict(X_test)
    
    score = accuracy_score(y_test, y_pred)
    resdf = pd.DataFrame(gridsearch.cv_results_)
    
    reports[pipe.steps[1][0]] = classification_report(y_test, y_pred)
    
    print(f"Time {time.time() - start}s")
    #print(resdf[resdf["rank_test_score"] == 1])
    print(f"Test accuracy: {score}")
    

logisticregression
Fitting 4 folds for each of 4 candidates, totalling 16 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Time 23.103020668029785s
Test accuracy: 0.760703125


In [36]:
resdf.sort_values(by=["rank_test_score"]) 

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_tfidfvectorizer__analyzer,param_tfidfvectorizer__lowercase,param_tfidfvectorizer__norm,param_tfidfvectorizer__stop_words,param_tfidfvectorizer__strip_accents,param_tfidfvectorizer__use_idf,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
2,3.414015,0.320567,0.337937,0.014908,1.333333,word,True,l2,english,unicode,True,"{'logisticregression__C': 1.3333333333333333, ...",0.758125,0.755625,0.759609,0.761367,0.758682,0.002105,1
0,3.248127,0.479268,0.382911,0.034352,1.0,word,True,l2,english,unicode,True,"{'logisticregression__C': 1.0, 'tfidfvectorize...",0.758867,0.755195,0.759297,0.761289,0.758662,0.0022,2
4,3.240951,0.390944,0.335812,0.019201,1.666667,word,True,l2,english,unicode,True,"{'logisticregression__C': 1.6666666666666665, ...",0.75707,0.755313,0.759648,0.760781,0.758203,0.002143,3
6,3.458458,0.253091,0.353544,0.020942,2.0,word,True,l2,english,unicode,True,"{'logisticregression__C': 2.0, 'tfidfvectorize...",0.757422,0.755,0.75918,0.761094,0.758174,0.002246,4
5,3.569561,0.123064,0.340913,0.018221,1.666667,word,True,l2,english,unicode,False,"{'logisticregression__C': 1.6666666666666665, ...",0.75875,0.754258,0.759102,0.759687,0.757949,0.002157,5
3,3.582193,0.124175,0.343528,0.012095,1.333333,word,True,l2,english,unicode,False,"{'logisticregression__C': 1.3333333333333333, ...",0.758203,0.753828,0.758867,0.759492,0.757598,0.002224,6
7,3.328222,0.142667,0.308464,0.031369,2.0,word,True,l2,english,unicode,False,"{'logisticregression__C': 2.0, 'tfidfvectorize...",0.758008,0.752969,0.759375,0.759766,0.757529,0.002713,7
1,3.90855,0.179291,0.370534,0.029413,1.0,word,True,l2,english,unicode,False,"{'logisticregression__C': 1.0, 'tfidfvectorize...",0.757578,0.753828,0.758008,0.759258,0.757168,0.002025,8


### Summary

With less sanitization (either manual or scikit preprocess)

Linear models: <br>
BernoulliNB: <br>
0.76 <br>
LogisticRegression: <br>
0.76 <br>

    

No matter the classifier, we don't seem to go above 76%. Look into the data to see if we can see something.
-> delete very short tweets?
-> lemmatization?

Maybe deleted too much noise during preprocessing, only remove stopwords? (+URLs?)

According to internet:
"Also, very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable." play with the binary parameter of CountVectorizer!