In [1]:
import pandas as pd
import joblib

from sklearn import metrics
from sklearn.model_selection import cross_validate

from sklearn.linear_model import LogisticRegression

#### **Evaluation of Feature Extraction Techniques**

**Loading the tokenized tweets**

In [2]:
tweets_train_tokenized = pd.read_csv('csvs/tweets_train_tokens.csv', index_col=False)
tweets_train_tokenized_message = pd.Series(tweets_train_tokenized.message)
# Converting Panda series into Unicode datatype as required by vectorizers
tweets = tweets_train_tokenized_message.astype('U').values
tweets

array(['arirang simply kpop kim hyung jun cross ha yeong playback',
       'read politico article donald trump running mate tom brady list likely choice',
       'type bazura project google image image photo dad glenn moustache whatthe',
       ..., 'bring dunkin iced coffee tomorrow hero',
       'currently holiday portugal come home tomorrow poland tuesday holocaust memorial trip',
       'ladykiller saturday aternoon'], dtype=object)

**Loading the tweets targets**

In [3]:
tweets_train_y = pd.read_csv('csvs/tweets_train_y.csv', index_col=False)
tweets_train_y = pd.Series(tweets_train_y['0'])
tweets_train_y = tweets_train_y.values
tweets_train_y

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

**Evaluation function**

We define a function to calculate **Macro Recall**, **Accuracy** and **Macro F1** *(only over the positive and negative classes)* as our *PERFORMANCE* METRICS. 

**Macro Recall** will serve as our primary metric for evaluation.

We will evaluate our model on these 3 metrics through **5-fold cross-validation**. We will also measure the **recall of the negative class** since it is the minority one, and the one we should pay the most attention too.

In [4]:
labels_codes = {'negative': 0, 'neutral': 1, 'positive': 2}

f1_pos_neg = metrics.make_scorer(
    metrics.f1_score, average="macro", labels=[labels_codes["negative"], labels_codes["positive"]]
)
# micro-recall
#'micro' - Calculate metrics globally by counting the total true positives, false negatives and false positives.
# we cannot use pos-label = 0 discussed in specificity here because we have a multi-class classification, not binary
# label=[pos-label] will report scores for that label only
recall_neg = metrics.make_scorer(metrics.recall_score, average="micro", labels=[labels_codes["negative"]])


def evaluate_model(model, features, labels, cv=5, fit_params=None):
    scores = cross_validate(
        model,
        features,
        labels,
        cv=cv,
        fit_params=fit_params,
        scoring={
            "recall_macro": "recall_macro",
            "f1_pos_neg": f1_pos_neg,
            "accuracy": "accuracy",
            "recall_neg": recall_neg,
        },
        n_jobs=-1,  # this means that each metric will be computed using all cores of your computer processing unit
    )

    results = pd.DataFrame(scores).drop(["fit_time", "score_time"], axis=1)
    results.columns = pd.MultiIndex.from_tuples([c.split("_", maxsplit=1) for c in results.columns])
    summary = results.describe()
    results = pd.concat([results, summary.loc[["mean", "std"]]])

    def custom_style(row):
        color = "white"
        if row.name == "mean":
            color = "orange"
        return ["background-color: %s" % color] * len(row.values)

    results = results[sorted(results.columns, key=lambda x: x[0], reverse=True)]
    results = results.style.apply(custom_style, axis=1)

    return results

**Loading the vectors**

**a. TF-IDF tweets**

In [5]:
tfidf_tweets = joblib.load('./vectors/tfidf_tweets.sav')
tfidf_tweets

<49675x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 426709 stored elements in Compressed Sparse Row format>

**b. Keras embedded tweets**

In [6]:
keras_model_tweets = joblib.load('./vectors/keras_model_tweets.sav')
keras_model_tweets

array([[ 0.02011312, -0.00384263, -0.03569106, ...,  0.02549319,
         0.03361343, -0.00767218],
       [ 0.02011312, -0.00384263, -0.03569106, ...,  0.0429675 ,
         0.03223759, -0.04315708],
       [ 0.02011312, -0.00384263, -0.03569106, ..., -0.0283402 ,
        -0.02560207, -0.009534  ],
       ...,
       [ 0.02011312, -0.00384263, -0.03569106, ..., -0.01660419,
        -0.03300374,  0.00409142],
       [ 0.02011312, -0.00384263, -0.03569106, ...,  0.03942401,
        -0.04004568, -0.00603427],
       [ 0.02011312, -0.00384263, -0.03569106, ...,  0.0034272 ,
        -0.02500931,  0.04393745]], dtype=float32)

**c. Word2Vec tweets**

In [7]:
word2vec_tweets = joblib.load('./vectors/word2vec_tweets.sav')
word2vec_tweets

array([[ 0.43820262, -0.4754556 ,  0.11711419, ...,  0.8534837 ,
        -1.470024  , -0.72018313],
       [ 0.8019137 , -1.0742214 ,  0.6273    , ...,  0.96964335,
        -2.5701778 , -1.698925  ],
       [ 0.7749731 , -0.4721202 ,  0.4301789 , ...,  1.0060902 ,
        -1.4998134 , -0.62398565],
       ...,
       [ 1.2551143 , -0.6130551 ,  0.83693   , ...,  1.9660778 ,
        -1.5066313 , -1.1546358 ],
       [ 1.0965337 , -0.71529126,  0.89206046, ...,  1.8730817 ,
        -1.32547   , -0.6296154 ],
       [ 0.16459951, -0.8970738 ,  0.19940762, ...,  2.136376  ,
        -0.63419354,  0.01595367]], dtype=float32)

**d. fastText tweets**

In [8]:
fasttext_tweets = joblib.load('./vectors/fasttext_tweets.sav')
fasttext_tweets

array([[ 1.7083393 ,  0.16257861, -0.3256186 , ...,  1.6048752 ,
        -0.6958239 , -0.861176  ],
       [ 1.4671383 ,  0.27155647, -2.1007617 , ...,  2.8497617 ,
        -0.8527399 , -0.83536667],
       [ 1.7470719 ,  0.10171478, -0.4181965 , ...,  1.5183479 ,
        -0.21019751, -0.65540004],
       ...,
       [ 1.5684685 , -0.18233915, -0.61222434, ...,  1.6030623 ,
        -1.0378721 , -1.6156934 ],
       [ 1.7817134 ,  0.7445152 , -0.77702844, ...,  2.0351388 ,
        -0.6231385 , -1.0140816 ],
       [ 1.2703185 ,  1.3933463 , -0.4189272 , ...,  2.0525815 ,
        -0.8890051 , -1.0568657 ]], dtype=float32)

**e. Doc2Vec tweets**

In [9]:
doc2vec_tweets = joblib.load('./vectors/doc2vec_tweets.sav')
doc2vec_tweets

array([[-0.09706153,  0.00836273, -0.10036916, ..., -0.05951937,
        -0.12229338, -0.02943969],
       [ 0.0768654 ,  0.11051423, -0.14390431, ...,  0.15714288,
        -0.1072313 , -0.10160755],
       [-0.03059013,  0.0757471 ,  0.00896006, ..., -0.16501954,
         0.01974987, -0.09750597],
       ...,
       [ 0.03058317, -0.04542535,  0.02273304, ...,  0.00468193,
        -0.12863246, -0.12644576],
       [ 0.08006796, -0.18050203,  0.09373045, ...,  0.01869539,
        -0.16327702, -0.09766775],
       [ 0.03441153,  0.00589671, -0.02517486, ...,  0.08582754,
        -0.00996797, -0.01485144]], dtype=float32)

**Baseline model**

In [10]:
# multi_class='multimonial' to specify that we want to use Softmax/cross-entropy loss (and not 3 different binary classifiers)
# solver='lbfgs' to pick a solver that supports Softmax
# class_weight='balanced' to give more importance to the less represented classes during training (i.e. negative messages)
log_reg = LogisticRegression(multi_class="multinomial", solver="lbfgs", class_weight="balanced", max_iter=5000)

**Evaluation Proper**

a. TF-IDF tweets

In [11]:
evaluate_model(log_reg, tfidf_tweets, tweets_train_y)

Unnamed: 0_level_0,test,test,test,test
Unnamed: 0_level_1,recall_macro,f1_pos_neg,accuracy,recall_neg
0,0.626513,0.596663,0.622446,0.635249
1,0.635832,0.606322,0.626472,0.663871
2,0.632532,0.600897,0.6307,0.634839
3,0.632406,0.601818,0.632209,0.629677
4,0.632781,0.602636,0.630398,0.635249
mean,0.632013,0.601667,0.628445,0.639777
std,0.003385,0.003474,0.003967,0.013674


b. Keras embedded tweets

In [12]:
evaluate_model(log_reg,keras_model_tweets, tweets_train_y)

Unnamed: 0_level_0,test,test,test,test
Unnamed: 0_level_1,recall_macro,f1_pos_neg,accuracy,recall_neg
0,0.368321,0.330364,0.349371,0.429955
1,0.359047,0.319618,0.340614,0.42129
2,0.363185,0.323889,0.345647,0.42129
3,0.360702,0.322471,0.343734,0.416774
4,0.358764,0.318492,0.340513,0.420917
mean,0.362004,0.322967,0.343976,0.422045
std,0.003945,0.004665,0.003716,0.004816


c. Word2Vec tweets

In [13]:
evaluate_model(log_reg, word2vec_tweets, tweets_train_y)

Unnamed: 0_level_0,test,test,test,test
Unnamed: 0_level_1,recall_macro,f1_pos_neg,accuracy,recall_neg
0,0.492158,0.458887,0.466331,0.571982
1,0.500807,0.465479,0.468747,0.603226
2,0.499216,0.46313,0.469653,0.594839
3,0.488966,0.45279,0.466532,0.55871
4,0.499651,0.465054,0.472572,0.585539
mean,0.49616,0.461068,0.468767,0.582859
std,0.005265,0.005312,0.002559,0.017797


d. fastText tweets

In [14]:
evaluate_model(log_reg, fasttext_tweets, tweets_train_y)

Unnamed: 0_level_0,test,test,test,test
Unnamed: 0_level_1,recall_macro,f1_pos_neg,accuracy,recall_neg
0,0.485396,0.446723,0.465224,0.548741
1,0.491378,0.452748,0.470659,0.556129
2,0.486852,0.447345,0.464419,0.55871
3,0.480004,0.440098,0.46623,0.521935
4,0.490823,0.450023,0.469653,0.55907
mean,0.486891,0.447387,0.467237,0.548917
std,0.004617,0.004724,0.002764,0.015645


e. Doc2Vec tweets

In [15]:
evaluate_model(log_reg, doc2vec_tweets, tweets_train_y)

Unnamed: 0_level_0,test,test,test,test
Unnamed: 0_level_1,recall_macro,f1_pos_neg,accuracy,recall_neg
0,0.437089,0.411732,0.412079,0.510006
1,0.458424,0.427026,0.427781,0.554194
2,0.455513,0.425268,0.427781,0.54129
3,0.431766,0.405602,0.410971,0.490323
4,0.446502,0.417783,0.424157,0.511298
mean,0.445859,0.417482,0.420554,0.521422
std,0.011478,0.009028,0.008383,0.02583


Since **tfidf_tweets** scored the highest in Macro-Recall **(63.20%)**, and even in our secondary metrics such as our recall for the negative class **(63.97%)**, we will choose **TfidfVectorizer** as our feature extraction technique to vectorize our tweets. We will pass this vectorized texts to a neural network, possibly to create a model with a better performance.

#### **End. Thank you!**