# Classifier using average wordvecs

## Reading data

In [31]:
import pandas as pd
train = pd.read_csv('train_cleaned.csv')
test = pd.read_csv('test_cleaned.csv')

In [32]:
wordvec_train = pd.read_pickle('train_wordvec.pickle')
wordvec_test = pd.read_pickle('test_wordvec.pickle')

In [33]:
train = train.merge(wordvec_train, on=['id'])
test = test.merge(wordvec_test, on=['id'])
train.head()

Unnamed: 0,id,keyword,location,text,target,cleaned_text,wordvec,keyword_wordvec,wordvec_tfidf
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,Our Deeds are the Reason of this earthquake Ma...,"[-0.26623327, 0.05843069, -0.1404636, -0.05265...","[-0.26623327, 0.05843069, -0.1404636, -0.05265...","[-2.0410312242232838, 0.1577752003302941, -0.8..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask. Canada,"[-0.025449565, 0.031005142, -0.15566371, -0.23...","[-0.025449565, 0.031005142, -0.15566371, -0.23...","[-0.27185601989428204, 0.2042857458194097, -1...."
2,5,,,All residents asked to 'shelter in place' are ...,1,All residents asked to 'shelter in place' are ...,"[0.0059339865, 0.016337818, -0.105279535, -0.0...","[0.0059339865, 0.016337818, -0.105279535, -0.0...","[0.07528745450756767, 0.11175614595413208, -0...."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfires evacuation orders in ...,"[-0.18147185, 0.20731743, 0.014147284, -0.2182...","[-0.18147185, 0.20731743, 0.014147284, -0.2182...","[-1.3403782035623277, 1.2000715562275477, 0.11..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,Just got sent this photo from Ruby Alaska as s...,"[-0.06394094, -0.01423019, 0.0063574947, 0.071...","[-0.06394094, -0.01423019, 0.0063574947, 0.071...","[-0.7245167245467504, -0.364056259393692, 0.52..."


# Train a model

In [13]:
import numpy
def get_X(df, col):
    X = []
    for index, row in df.iterrows():
        x = row[col]
        X.append(x)
    return X

In [14]:
y = train['target']
X = get_X(train, 'wordvec')

In [15]:
def prepare_submission(model, X, y, X_test, name):
    fit = model.fit(X,y)
    pred = model.predict(X_test)
    submission = pd.DataFrame({"id":test['id'], "target":pred})
    submission.to_csv(name+'.csv', index=False)
    return fit

## Logistic Regression

In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(max_iter = 1000)
params = {'C': [0.05, 0.1, 0.2, 0.5, 0.75, 1, 1.5, 2, 4]}
clf = GridSearchCV(lg, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [17]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   13.4s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   12.4s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   11.8s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   12.2s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   12.4s finished


{'fit_time': array([13.57049251, 12.52569532, 11.93298197, 12.30078721, 12.53057289]),
 'score_time': array([0.00529885, 0.00395751, 0.00391126, 0.00404596, 0.00413728]),
 'test_score': array([0.75060926, 0.74318745, 0.74896437, 0.73701843, 0.7437604 ]),
 'train_score': array([0.76548307, 0.75876289, 0.7590636 , 0.74533389, 0.74801836])}

In [18]:
scores['test_score'].mean()

0.7447079816861266

## Neural net

In [19]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(random_state=42, max_iter=2000)
params = {'hidden_layer_sizes': [(1,), (2,), (3,),(4,)], 
          'learning_rate_init':[0.00005, 0.0001, 0.0002, 0.001]}
clf = GridSearchCV(mlp, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [20]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed: 13.0min finished


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed:  9.5min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed: 15.8min finished


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed: 11.8min finished


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed:  7.7min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed: 12.1min finished


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed: 12.2min finished


{'fit_time': array([822.91087174, 967.33702207, 749.20637155, 762.30264521,
        760.28467965]),
 'score_time': array([0.00631166, 0.00498843, 0.00607467, 0.00612807, 0.00613856]),
 'test_score': array([0.75990676, 0.74122449, 0.76862124, 0.76243094, 0.77127244]),
 'train_score': array([0.7948011 , 0.76094965, 0.79135264, 0.79063361, 0.80323025])}

In [21]:
scores['test_score'].mean()

0.7606911736931822

In [22]:
X_test = get_X(test, 'wordvec')
fit = prepare_submission(clf, X, y, X_test, 'avg_wordvec_mlp')

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed:  8.0min
[Parallel(n_jobs=-2)]: Done  80 out of  80 | elapsed: 12.9min finished


In [23]:
fit.best_params_

{'hidden_layer_sizes': (2,), 'learning_rate_init': 5e-05}

## SVM

In [24]:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
svm = SVC(kernel="rbf")
params = {'C': [0.05, 0.1, 0.2, 0.5, 0.75, 1, 1.5, 2, 4]}
clf = GridSearchCV(svm, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [25]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  2.8min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  3.3min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  3.9min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  3.9min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  4.3min finished


{'fit_time': array([176.71431994, 213.27070498, 249.02646804, 247.91531539,
        272.18712544]),
 'score_time': array([2.10446692, 3.02242684, 3.04964161, 3.38860488, 3.06241655]),
 'test_score': array([0.75671277, 0.75603664, 0.7769296 , 0.75407027, 0.77380952]),
 'train_score': array([0.8010865 , 0.80639058, 0.803938  , 0.76779265, 0.81434865])}

In [26]:
scores['test_score'].mean()

0.7635117603110111

In [27]:
X_test = get_X(test, 'wordvec')
fit = prepare_submission(clf, X, y, X_test, 'avg_wordvec')

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


KeyboardInterrupt: 

In [None]:
fit.best_params_

In [None]:
fit.cv_results_

## Train with keywords in text

In [None]:
y = train['target']
X = get_X(train, 'keyword_wordvec')

In [None]:
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

In [None]:
scores['test_score'].mean()

Seems to work less good.

# Train tfidf version

In [34]:
y = train['target']
X = get_X(train, 'wordvec_tfidf')

In [35]:
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  2.4min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  2.8min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  3.0min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  3.9min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  4.0min finished


{'fit_time': array([153.62093925, 177.85225725, 195.45866704, 252.21644998,
        254.44740701]),
 'score_time': array([2.17354941, 2.14944482, 3.19811702, 3.40387273, 3.68649626]),
 'test_score': array([0.75367047, 0.75798319, 0.76534903, 0.74721508, 0.74588031]),
 'train_score': array([0.85323383, 0.81629225, 0.83492723, 0.76601196, 0.74989257])}

In [36]:
scores['test_score'].mean()

0.7540196185590924