# Classifier using average wordvecs

## Reading data

In [1]:
import pandas as pd
train = pd.read_csv('train_cleaned.csv')
test = pd.read_csv('test_cleaned.csv')

In [2]:
wordvec_train = pd.read_pickle('train_wordvec.pickle')
wordvec_test = pd.read_pickle('test_wordvec.pickle')

In [3]:
train = train.merge(wordvec_train, on=['id'])
test = test.merge(wordvec_test, on=['id'])
train.head()

Unnamed: 0,id,keyword,location,text,target,cleaned_text,wordvec,keyword_wordvec
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,Our Deeds are the Reason of this earthquake Ma...,"[-0.26623327, 0.05843069, -0.1404636, -0.05265...","[-0.26623327, 0.05843069, -0.1404636, -0.05265..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask. Canada,"[-0.025449565, 0.031005142, -0.15566371, -0.23...","[-0.025449565, 0.031005142, -0.15566371, -0.23..."
2,5,,,All residents asked to 'shelter in place' are ...,1,All residents asked to 'shelter in place' are ...,"[0.0059339865, 0.016337818, -0.105279535, -0.0...","[0.0059339865, 0.016337818, -0.105279535, -0.0..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfires evacuation orders in ...,"[-0.18147185, 0.20731743, 0.014147284, -0.2182...","[-0.18147185, 0.20731743, 0.014147284, -0.2182..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,Just got sent this photo from Ruby Alaska as s...,"[-0.06394094, -0.01423019, 0.0063574947, 0.071...","[-0.06394094, -0.01423019, 0.0063574947, 0.071..."


# Train a model

In [5]:
import numpy
def get_X(df, col):
    X = []
    for index, row in df.iterrows():
        x = row[col]
        X.append(x)
    return X

In [6]:
y = train['target']
X = get_X(train, 'wordvec')

In [9]:
def prepare_submission(model, X, y, X_test, name):
    fit = model.fit(X,y)
    pred = model.predict(X_test)
    submission = pd.DataFrame({"id":test['id'], "target":pred})
    submission.to_csv(name+'.csv', index=False)
    return fit

## Logistic Regression

In [19]:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(max_iter = 1000)
params = {'C': [0.05, 0.1, 0.2, 0.5, 0.75, 1, 1.5, 2, 4]}
clf = GridSearchCV(lg, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [20]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   12.8s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   12.1s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   13.4s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   13.4s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   13.8s finished


{'fit_time': array([12.99856138, 12.27927947, 13.53545332, 13.56040645, 13.93444705]),
 'score_time': array([0.00516677, 0.00426793, 0.00454569, 0.00490212, 0.00448918]),
 'test_score': array([0.74739374, 0.75060533, 0.74285714, 0.74110835, 0.74024896]),
 'train_score': array([0.76376989, 0.76579322, 0.75365302, 0.75399099, 0.75061728])}

In [23]:
scores['test_score'].mean()

0.7444427062777617

## Neural net

In [75]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(random_state=42, max_iter=2000)
params = {'hidden_layer_sizes': [(1,), (2,), (3,),(4,)], 
          'learning_rate_init':[0.00005, 0.0001, 0.0002, 0.001]}
clf = GridSearchCV(mlp, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [76]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed:  8.6min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed: 13.2min finished


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed:  9.7min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed: 16.2min finished


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed: 10.9min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed: 17.8min finished


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed: 10.9min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed: 17.8min finished


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed: 10.8min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed: 17.5min finished


{'fit_time': array([ 837.6000452 , 1009.73397112, 1110.70215297, 1113.97717261,
        1078.55074215]),
 'score_time': array([0.00651336, 0.00806093, 0.0085547 , 0.0092454 , 0.00493741]),
 'test_score': array([0.75057915, 0.74437299, 0.77102804, 0.75625   , 0.76850394]),
 'train_score': array([0.79772638, 0.76150121, 0.80132838, 0.80054592, 0.794077  ])}

In [77]:
scores['test_score'].mean()

0.75814682306478

In [78]:
X_test = get_X(test, 'wordvec')
fit = prepare_submission(clf, X, y, X_test, 'avg_wordvec_mlp')

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed:  8.2min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed: 13.5min finished


In [79]:
fit.best_params_

{'hidden_layer_sizes': (2,), 'learning_rate_init': 5e-05}

## SVM

In [4]:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
svm = SVC(kernel="rbf")
params = {'C': [0.05, 0.1, 0.2, 0.5, 0.75, 1, 1.5, 2, 4]}
clf = GridSearchCV(svm, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [7]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  2.2min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  2.7min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  3.1min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  3.8min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  3.9min finished


{'fit_time': array([139.44666839, 170.67102933, 201.17410183, 243.69189501,
        246.78926659]),
 'score_time': array([2.11829853, 2.0863986 , 3.07872605, 3.41884947, 3.04222083]),
 'test_score': array([0.75671277, 0.75603664, 0.7769296 , 0.75407027, 0.77380952]),
 'train_score': array([0.8010865 , 0.80639058, 0.803938  , 0.76779265, 0.81434865])}

In [8]:
scores['test_score'].mean()

0.7635117603110111

In [10]:
X_test = get_X(test, 'wordvec')
fit = prepare_submission(clf, X, y, X_test, 'avg_wordvec')

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  6.0min finished


In [11]:
fit.best_params_

{'C': 1}

In [12]:
fit.cv_results_

{'mean_fit_time': array([22.59748025, 22.11816201, 19.43488131, 18.39031053, 18.71541686,
        17.75575438, 17.6215004 , 17.82478037, 17.04787793]),
 'std_fit_time': array([1.10509681, 1.99860621, 2.01556424, 2.11070107, 1.62980224,
        2.12102132, 1.94484426, 1.77811599, 1.63469602]),
 'mean_score_time': array([5.22519345, 5.20064754, 4.45583544, 3.79770932, 4.19909329,
        3.91350923, 3.81123333, 3.63650408, 3.51941271]),
 'std_score_time': array([0.87183819, 0.49655271, 0.61352098, 0.53385554, 0.13889082,
        0.50341953, 0.49758005, 0.49704975, 0.52774049]),
 'param_C': masked_array(data=[0.05, 0.1, 0.2, 0.5, 0.75, 1, 1.5, 2, 4],
              mask=[False, False, False, False, False, False, False, False,
                    False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 0.05},
  {'C': 0.1},
  {'C': 0.2},
  {'C': 0.5},
  {'C': 0.75},
  {'C': 1},
  {'C': 1.5},
  {'C': 2},
  {'C': 4}],
 'split0_test_score': array([0.71715328, 0.72760181, 0.

## Train with keywords in text

In [13]:
y = train['target']
X = get_X(train, 'keyword_wordvec')

In [14]:
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  3.9min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  4.0min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  4.1min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  3.0min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  3.6min finished


{'fit_time': array([250.63300896, 256.58355188, 255.80485058, 193.7976675 ,
        229.65453267]),
 'score_time': array([3.1427393 , 3.44542146, 1.46802878, 2.34649563, 3.17833591]),
 'test_score': array([0.76585366, 0.75124378, 0.76436303, 0.7589658 , 0.77337826]),
 'train_score': array([0.79775748, 0.77831074, 0.76984127, 0.77293674, 0.78860944])}

In [15]:
scores['test_score'].mean()

0.7627609079617144

Seems to work less good.