# Models using just metadata

## Loading data

In [50]:
import pandas as pd
train_tree = pd.read_csv('train_metafeatures_tree.csv')
test_tree = pd.read_csv('test_metafeatures_tree.csv')
train_normalized = pd.read_csv('train_metafeatures_normalized.csv')
test_normalized = pd.read_csv('test_metafeatures_normalized.csv')

In [51]:
for df in [train_tree, train_normalized]:
    df.drop(columns='id',inplace=True)

In [52]:
train_tree.head()

Unnamed: 0,num_char,num_words,num_hash,num_mention,num_url,has_location,geocoded,longitude_t,latitude_t,target
0,68,13,1,0,0,False,False,1000.0,1000.0,1
1,38,7,0,0,0,False,False,1000.0,1000.0,1
2,133,22,0,0,0,False,False,1000.0,1000.0,1
3,56,7,1,0,0,False,False,1000.0,1000.0,1
4,85,16,2,0,0,False,False,1000.0,1000.0,1


# Train a randomforest

In [92]:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [93]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_jobs=-2, random_state=42)

In [94]:
y = train_tree['target']
X = train_tree.drop(columns=['target'])

In [95]:
from sklearn.model_selection import cross_validate
scores = cross_validate(rf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

{'fit_time': array([1.12008572, 0.36800647, 0.36623549, 0.36667299, 0.35955834]),
 'score_time': array([0.0508945 , 0.03678083, 0.0380652 , 0.03521657, 0.03545141]),
 'test_score': array([0.54736842, 0.5562701 , 0.56279809, 0.56384743, 0.56648308]),
 'train_score': array([0.96391153, 0.96195652, 0.96306376, 0.95970411, 0.96394833])}

In [96]:
scores['test_score'].mean()

0.5593534246860525

# Train a boosted tree

In [97]:
from lightgbm import LGBMClassifier
params = {'max_depth': [1,2,3,4], 'n_estimators':[5,10,20,40,80]}
lg = LGBMClassifier(n_jobs=-2, random_state=42)
clf = GridSearchCV(lg, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [98]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done 100 out of 100 | elapsed:    1.9s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-2)]: Done 100 out of 100 | elapsed:    1.4s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-2)]: Done 100 out of 100 | elapsed:    1.5s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-2)]: Done 100 out of 100 | elapsed:    1.5s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-2)]: Done 100 out of 100 | elapsed:    1.5s finished


{'fit_time': array([1.91247272, 1.50768566, 1.4883604 , 1.54557109, 1.49960446]),
 'score_time': array([0.00682545, 0.00970888, 0.00679851, 0.00685954, 0.00674033]),
 'test_score': array([0.58208955, 0.57860616, 0.59475219, 0.57274119, 0.56855151]),
 'train_score': array([0.57967138, 0.64244898, 0.58733421, 0.58622631, 0.58333333])}

In [99]:
scores['test_score'].mean()

0.5793481205209567

# Train a SVM

In [102]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
svm = SVC(kernel="rbf", random_state=42)
params = {'C': [0.05, 0.1, 0.2, 0.5, 0.75, 1, 1.5, 2, 4]}
clf = GridSearchCV(svm, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [103]:
y = train_normalized['target']
X = train_normalized.drop(columns='target')

In [104]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   18.0s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   17.3s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   19.8s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   20.1s finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:   20.0s finished


{'fit_time': array([19.25720501, 18.51489115, 21.28129125, 21.57276535, 21.60879803]),
 'score_time': array([0.16554523, 0.16215897, 0.2002027 , 0.20133495, 0.1992209 ]),
 'test_score': array([0.59780908, 0.58418168, 0.59242424, 0.58563536, 0.58751903]),
 'train_score': array([0.5972571 , 0.59480724, 0.59891389, 0.59439614, 0.60299981])}

In [106]:
scores['test_score'].mean()

0.5895138759800866