# Classifier using average wordvecs

## Reading data

In [22]:
import pandas as pd
train = pd.read_csv('train_cleaned.csv')
test = pd.read_csv('test_cleaned.csv')

In [23]:
wordvec_train = pd.read_pickle('train_wordvec.pickle')
wordvec_test = pd.read_pickle('test_wordvec.pickle')

In [24]:
train = train.merge(wordvec_train, on=['id'])
test = test.merge(wordvec_test, on=['id'])
train.head()

Unnamed: 0,id,keyword,location,text,target,cleaned_text,wordvec,keyword_wordvec
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,Our Deeds are the Reason of this earthquake Ma...,"[-0.26623327, 0.05843069, -0.1404636, -0.05265...","[-0.26623327, 0.05843069, -0.1404636, -0.05265..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask. Canada,"[-0.025449565, 0.031005142, -0.15566371, -0.23...","[-0.025449565, 0.031005142, -0.15566371, -0.23..."
2,5,,,All residents asked to 'shelter in place' are ...,1,All residents asked to 'shelter in place' are ...,"[0.0059339865, 0.016337818, -0.105279535, -0.0...","[0.0059339865, 0.016337818, -0.105279535, -0.0..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfires evacuation orders in ...,"[-0.18147185, 0.20731743, 0.014147284, -0.2182...","[-0.18147185, 0.20731743, 0.014147284, -0.2182..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,Just got sent this photo from Ruby Alaska as s...,"[-0.06394094, -0.01423019, 0.0063574947, 0.071...","[-0.06394094, -0.01423019, 0.0063574947, 0.071..."


# Train a model

In [25]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
svm = SVC(kernel="rbf")
params = {'C': [0.05, 0.1, 0.2, 0.5, 0.75, 1, 1.5, 2, 4]}
clf = GridSearchCV(svm, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [30]:
import numpy
def get_X(df, col):
    X = []
    for index, row in df.iterrows():
        x = row[col]
        X.append(x)
    return X

In [31]:
y = train['target']
X = get_X(train, 'wordvec')

In [32]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=3, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.4min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.5min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.9min finished


{'fit_time': array([ 91.87541389,  99.34250069, 119.07736921]),
 'score_time': array([2.37854099, 2.86782217, 3.06027579]),
 'test_score': array([0.73401397, 0.7159035 , 0.78023033]),
 'train_score': array([0.79595449, 0.81373044, 0.80443548])}

In [36]:
scores['test_score'].mean()

0.7433825976979325

In [37]:
def prepare_submission(model, X, y, X_test, name):
    fit = model.fit(X,y)
    pred = model.predict(X_test)
    submission = pd.DataFrame({"id":test['id'], "target":pred})
    submission.to_csv(name+'.csv', index=False)
    return fit

In [38]:
X_test = get_X(test, 'wordvec')
fit = prepare_submission(clf, X, y, X_test, 'avg_wordvec')

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  5.4min finished


In [39]:
fit.best_params_

{'C': 1}

In [40]:
fit.cv_results_

{'mean_fit_time': array([16.57404799, 15.49016814, 17.34904728, 18.14531527, 18.34043355,
        17.38084989, 17.34763112, 17.236094  , 17.37177205]),
 'std_fit_time': array([1.70725989, 2.04373951, 3.1923638 , 2.06990052, 1.85025057,
        2.23644114, 1.49116081, 1.61724634, 1.98172452]),
 'mean_score_time': array([3.41077385, 3.93212471, 3.9233232 , 3.84473891, 3.86545863,
        3.72865171, 3.91775556, 3.70918512, 3.28557196]),
 'std_score_time': array([0.37195983, 0.54553463, 0.48637574, 0.35719485, 0.49229548,
        0.54934523, 0.45458584, 0.54571095, 0.39215836]),
 'param_C': masked_array(data=[0.05, 0.1, 0.2, 0.5, 0.75, 1, 1.5, 2, 4],
              mask=[False, False, False, False, False, False, False, False,
                    False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 0.05},
  {'C': 0.1},
  {'C': 0.2},
  {'C': 0.5},
  {'C': 0.75},
  {'C': 1},
  {'C': 1.5},
  {'C': 2},
  {'C': 4}],
 'split0_test_score': array([0.71715328, 0.72760181, 0.

## Train with keywords in text

In [41]:
y = train['target']
X = get_X(train, 'keyword_wordvec')

In [42]:
scores = cross_validate(clf, X, y, cv=3, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.8min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  2.4min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  2.9min finished


{'fit_time': array([113.82914662, 153.21824074, 184.55609488]),
 'score_time': array([3.05706263, 4.68825293, 4.5761528 ]),
 'test_score': array([0.7222523 , 0.70741483, 0.75709476]),
 'train_score': array([0.80109534, 0.78106808, 0.7917712 ])}

In [43]:
scores['test_score'].mean()

0.7289206292609237

Seems to work less good.