In this notebook I will explore performance between model learned on data with filled values by central tendencies and model where NA values are filled by other models. Also Ill take a look how SMOTE will affect the performance.

In [1]:
import pandas as pd
import preprocess as p
from sklearn.svm import SVC

In [2]:
data = pd.read_csv(r'data/train.csv')
proc = p.Preprocess(data, mode='train')
# fills NA based on data_exploration_analysis.ipnb
x, y = proc.get_data_central_tendency()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['LoanAmount'][df['LoanAmount'].isnull()] = df[df['LoanAmount'].isnull()].apply(fill_median, axis=1)


In [3]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(x)
x = scaler.transform(x)

In [4]:
from sklearn.model_selection import train_test_split
x_tr, x_te, y_tr, y_te = train_test_split(x, y, test_size=0.1, random_state=5)

In [5]:
# hyperparameters are chosen by model_selection.ipnb
model = SVC(C=1, kernel='linear', class_weight='balanced')
model.fit(x_tr, y_tr)
pred = model.predict(x_te)

In [6]:
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
print('F1/acc: ',f1_score(y_te, pred),'/',accuracy_score(y_te, pred))

F1/acc:  0.88659793814433 / 0.8225806451612904


Now we try to oversample training set with SMOTE

In [7]:
from imblearn.over_sampling import SVMSMOTE
sm = SVMSMOTE(k_neighbors=20, n_jobs=1, m_neighbors=20, svm_estimator=SVC(kernel='sigmoid'), random_state=42)
x_transformed, y_transformed = sm.fit_resample(x_tr, y_tr)

In [8]:
model = SVC(C=1, kernel='linear')
model.fit(x_transformed, y_transformed)
pred = model.predict(x_te)
print('F1/acc: ',f1_score(y_te, pred),'/',accuracy_score(y_te, pred))

F1/acc:  0.88659793814433 / 0.8225806451612904


In this case SMOTE has a little benefit for our algorithm. I have chosen SVMSMOTE becasue I have seen in model_selection.ipnb SVM reacts well to this data.

OK, lets try ML approach

In [9]:
proc = p.Preprocess(data, mode='train')
# fills NA by models train in train_helper_models.py
x, y = proc.get_data_ml()

scaler = StandardScaler().fit(x)
x = scaler.transform(x)
x_tr, x_te, y_tr, y_te = train_test_split(x, y, test_size=0.1, random_state=5)

In [10]:
model = SVC(C=1, kernel='linear', class_weight='balanced')
model.fit(x_tr, y_tr)
pred = model.predict(x_te)
print('F1/acc: ',f1_score(y_te, pred),'/',accuracy_score(y_te, pred))

F1/acc:  0.8631578947368421 / 0.7903225806451613


In [11]:
sm = SVMSMOTE(k_neighbors=5, n_jobs=1, m_neighbors=10, svm_estimator=SVC(kernel='rbf'), random_state=42)
x_transformed, y_transformed = sm.fit_resample(x_tr, y_tr)
model = SVC(C=1, kernel='linear')
model.fit(x_transformed, y_transformed)
pred = model.predict(x_te)
print('F1/acc: ',f1_score(y_te, pred),'/',accuracy_score(y_te, pred))

F1/acc:  0.860215053763441 / 0.7903225806451613


### Result
From this experiment it seems that taking more complicated approach with predictiong missing values is not worth a time. It probably would take a lot of time to optimize helper models and there is big chance that filling with some central tendency variables is more robust solution. SMOTE didnt proove it can boost our performance so its redundant too.