# Classifier using average wordvecs

## Reading data

In [35]:
import pandas as pd
train = pd.read_csv('train_cleaned.csv')
test = pd.read_csv('test_cleaned.csv')

In [36]:
wordvec_train = pd.read_pickle('train_wordvec_glove.twitter.27B.200d.pickle')
wordvec_test = pd.read_pickle('test_wordvec_glove.twitter.27B.200d.pickle')

In [37]:
train = train.merge(wordvec_train, on=['id'])
test = test.merge(wordvec_test, on=['id'])
train.head()

Unnamed: 0,id,keyword,location,text,target,cleaned_text,glove_cleaned_text,wordvec,keyword_wordvec,wordvec_concat,wordvec_tfidf
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,Our Deeds are the Reason of this earthquake Ma...,our deeds are the reason of this <hashtag> ear...,"[0.17742333, 0.0825951, 0.13578425, 0.11371653...","[0.17742333, 0.0825951, 0.13578425, 0.11371653...","[[-0.05753900110721588, -0.018967999145388603,...","[0.41576690107051817, 1.0901237577199936, 0.07..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask. Canada,forest fire near la ronge sask. canada,"[-0.14923862, 0.020581482, -0.027608752, -0.11...","[-0.14923862, 0.020581482, -0.027608752, -0.11...","[[-0.4864000082015991, 0.07588999718427658, 0....","[-1.1782405631882804, 0.4126361182757786, 0.19..."
2,5,,,All residents asked to 'shelter in place' are ...,1,All residents asked to 'shelter in place' are ...,all residents asked to 'shelter in place' are ...,"[-0.03266476, 0.051652886, -0.13747665, 0.0763...","[-0.03266476, 0.051652886, -0.13747665, 0.0763...","[[0.26405999064445496, 0.1720300018787384, 0.0...","[-0.1711904447187077, 0.5443525652993809, -1.2..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfires evacuation orders in ...,<number> people receive <hashtag> wildfires ev...,"[0.15625294, 0.017215075, -0.020121612, -0.078...","[0.15625294, 0.017215075, -0.020121612, -0.078...","[[0.41253000497817993, -0.3959699869155884, 0....","[0.48644813439912266, 1.4941825361715422, -1.2..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,Just got sent this photo from Ruby Alaska as s...,just got sent this photo from ruby <hashtag> a...,"[0.13052966, 0.008564685, -0.009015768, -0.093...","[0.13052966, 0.008564685, -0.009015768, -0.093...","[[-0.19265000522136688, 0.4586299955844879, -0...","[0.5162669516661588, 0.5474840612972484, -0.59..."


# Train a model

In [38]:
import numpy
def get_X(df, col):
    X = []
    for index, row in df.iterrows():
        x = row[col]
        X.append(x)
    return X

In [39]:
y = train['target']
X = get_X(train, 'wordvec')

In [40]:
def prepare_submission(model, X, y, X_test, name):
    fit = model.fit(X,y)
    pred = model.predict(X_test)
    submission = pd.DataFrame({"id":test['id'], "target":pred})
    submission.to_csv(name+'.csv', index=False)
    return fit

## Logistic Regression

In [41]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(max_iter = 1000)
params = {'C': [0.05, 0.1, 0.2, 0.5, 0.75, 1, 1.5, 2, 4]}
clf = GridSearchCV(lg, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [42]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:    7.6s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:    7.1s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:    7.0s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:    6.7s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:    6.9s finished


{'fit_time': array([7.78051186, 7.32899547, 7.16335011, 6.84515262, 7.13647914]),
 'score_time': array([0.00642347, 0.00513864, 0.00473952, 0.00477815, 0.00473571]),
 'test_score': array([0.75061728, 0.72277228, 0.75307629, 0.74166667, 0.75493421]),
 'train_score': array([0.7552795 , 0.75857792, 0.7552795 , 0.76297968, 0.74968867])}

In [43]:
scores['test_score'].mean()

0.7446133460827962

## Neural net

In [44]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(random_state=42, max_iter=2000)
params = {'hidden_layer_sizes': [(1,), (2,), (4,),(8,)], 
          'learning_rate_init':[0.00005, 0.0001, 0.0002, 0.001]}
clf = GridSearchCV(mlp, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [45]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  36 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed:  2.3min finished


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  36 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed:  2.3min finished


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  36 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed:  2.4min finished


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  36 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed:  2.3min finished


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  36 tasks      | elapsed:   57.6s
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed:  2.2min finished


{'fit_time': array([145.00500083, 142.96130419, 146.76166844, 158.1616106 ,
        133.59070396]),
 'score_time': array([0.00425243, 0.01213694, 0.00424147, 0.00535369, 0.00420356]),
 'test_score': array([0.75900721, 0.73580247, 0.75562701, 0.73735726, 0.75873274]),
 'train_score': array([0.76525631, 0.76303414, 0.76533711, 0.77055756, 0.74923108])}

In [46]:
scores['test_score'].mean()

0.7493053363077024

In [47]:
X_test = get_X(test, 'wordvec')
fit = prepare_submission(clf, X, y, X_test, 'avg_wordvec_mlp')

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  36 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed:  2.5min finished


In [48]:
fit.best_params_

{'hidden_layer_sizes': (4,), 'learning_rate_init': 0.0001}

## SVM

In [49]:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
svm = SVC(kernel="rbf")
params = {'C': [0.05, 0.1, 0.2, 0.5, 0.75, 1, 1.5, 2, 4]}
clf = GridSearchCV(svm, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [50]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.4min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.4min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.4min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.3min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.4min finished


{'fit_time': array([85.51620579, 84.83172464, 85.25083637, 84.61296296, 84.95820141]),
 'score_time': array([0.71441388, 0.71049023, 0.78812814, 0.78455281, 0.72308707]),
 'test_score': array([0.75653595, 0.7458194 , 0.75294118, 0.74744027, 0.77025898]),
 'train_score': array([0.78905926, 0.79093932, 0.76708861, 0.76693937, 0.7849485 ])}

In [51]:
scores['test_score'].mean()

0.7545991551998315

In [52]:
X_test = get_X(test, 'wordvec')
fit = prepare_submission(clf, X, y, X_test, 'avg_wordvec')

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  2.1min finished


In [53]:
fit.best_params_

{'C': 1.5}

In [54]:
fit.cv_results_

{'mean_fit_time': array([20.85033793, 20.48873744, 16.8838593 , 16.09859061, 15.4869875 ,
        15.13233838, 14.25862041, 14.34904389, 11.32200117]),
 'std_fit_time': array([1.72599068, 0.74164895, 2.37157441, 1.36841514, 1.16602543,
        1.63879369, 1.51806699, 1.00803528, 2.12964137]),
 'mean_score_time': array([4.11234188, 4.0706604 , 3.34466949, 2.89260302, 2.57578387,
        2.89069328, 2.34909544, 2.21782842, 1.07016292]),
 'std_score_time': array([0.62465848, 0.07952023, 0.65589121, 0.51842836, 0.52976189,
        0.14779166, 0.65574025, 0.43018303, 0.35031141]),
 'param_C': masked_array(data=[0.05, 0.1, 0.2, 0.5, 0.75, 1, 1.5, 2, 4],
              mask=[False, False, False, False, False, False, False, False,
                    False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 0.05},
  {'C': 0.1},
  {'C': 0.2},
  {'C': 0.5},
  {'C': 0.75},
  {'C': 1},
  {'C': 1.5},
  {'C': 2},
  {'C': 4}],
 'split0_test_score': array([0.64403492, 0.67741935, 0.

## Train with keywords in text

In [21]:
y = train['target']
X = get_X(train, 'keyword_wordvec')

In [26]:
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   36.2s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   33.7s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   34.7s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   34.6s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   35.1s finished


{'fit_time': array([38.08809638, 35.52230072, 36.57976961, 36.54533672, 36.94821024]),
 'score_time': array([0.38468838, 0.37544727, 0.38675475, 0.36254406, 0.38588691]),
 'test_score': array([0.75603664, 0.74565037, 0.7589658 , 0.76348548, 0.77222693]),
 'train_score': array([0.77671119, 0.78043704, 0.77281309, 0.80423061, 0.77133825])}

In [27]:
scores['test_score'].mean()

0.7592730434622285

Seems to work less good.

# Train tfidf version

In [28]:
y = train['target']
X = get_X(train, 'wordvec_tfidf')

In [29]:
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   35.5s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   34.3s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   35.9s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   35.5s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   35.7s finished


{'fit_time': array([37.41883588, 36.20017576, 37.91277766, 37.41486454, 37.63504076]),
 'score_time': array([0.37383294, 0.37622619, 0.37766647, 0.37400675, 0.38509059]),
 'test_score': array([0.75493421, 0.73710904, 0.75      , 0.75021313, 0.7628692 ]),
 'train_score': array([0.79299562, 0.77724325, 0.81613977, 0.79059117, 0.76744186])}

In [30]:
scores['test_score'].mean()

0.7510251164739314