### DS Test for Lucid 1: making a simple c;assification model

#### Summary:
Data is high-dimensional and imbalanced. We assume it's of value to identify the real minority cases in y. There's mo duplicates or NA. 
We tried a baseline with mode prediction. With another simple model, we oversample y, and Scale and PCA on X in preprocessing, then fit a logistic regressions. As we evaluate error with 30% test data, the baseline model gets nothing right about the minority cases in y, while the logistic model was able to reach a 60% recall on y minorities. It's then pickled as a final model. 

#### If I have more time...
1. More EDA on positive Y cases. Better feature engineering.
2. Try more model options, maybe more fancy ones. When I feel like wanna experiement more models, I usually go to H2O.ai and try their AutoML or Driverless AI instead of hand build 100 different models.

#### 1 Data preparation
Since this model is supposed to be simple and features and objective are unknown, I minimize my effort in prepraing the data.

In [193]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [194]:
# Loarding data set
dta = pd.read_csv("C:/Users/Tianl/Documents/GitHub/Pet-projects/lucid_skills_test/data/lucid_dataset_train.csv")
dta.head()

Unnamed: 0,label,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f493,f494,f495,f496,f497,f498,f499,f500,f501,f502
0,0,0,0,0,0,0.0,0.0,17755.0,7,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0.0,0.0,0.0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,1,4516870,0,3,0.25,0.035714,0.0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,1,52743,0,4,0.333333,0.047619,0.0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0.0,0.0,0.0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [195]:
#Checking y distribution
dta.label.describe()

count    6948.000000
mean        0.036126
std         0.186616
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: label, dtype: float64

Y has only 4% positive cases--this is a highly imbalanced dataset. 

In [196]:
#checking duplicates and NAs
dta.dropna().drop_duplicates().shape

(6948, 503)

In [197]:
#dropping duplicates and NAs
dta=dta.dropna().drop_duplicates()

There are many features in the data, but no missing values or duplicates.

In [198]:
#split train/test data with 7:3.
from sklearn.model_selection import train_test_split
y = dta.label
X = dta.drop(['label'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

#### Baseline: mode predicting all + CVed error
This model does nothing but use only the mode of test set to predict all values. This can be a stupid choice prediction-wise but it serves as a very basic comparison.

In [199]:
#Formulate the model 
from sklearn.dummy import DummyClassifier
dummy_clf = DummyClassifier(strategy="most_frequent")

In [200]:
#Because we don't have many data points, we use 10-fold CV to estimate train errors.
from sklearn.model_selection import cross_val_score
from sklearn.metrics import recall_score
scores=cross_val_score(dummy_clf,X_test_scaled, y_test,cv=10,scoring="recall_macro")
print("Recall: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Recall: 0.50 (+/- 0.00)


In [201]:
#Fit and inspect the model's test performance. Because there's only one parameter so I didn't estimate error rate on the training set.
dummy_clf.fit(X_train,y_train)
y_pred = dummy_clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      2012
           1       0.00      0.00      0.00        73

   micro avg       0.96      0.96      0.96      2085
   macro avg       0.48      0.50      0.49      2085
weighted avg       0.93      0.96      0.95      2085



The mode prediction did quite bad on test data. In fact it didn't get a single TP right, but it's not our final model anayways. 

#### A simple model: Oversampling + Scaling + PCA + Logit
This model applies some preprocessing tricks and tunes a logit model, which is fast, easy and commonly adopted. 

First, because the y is highly imbalanced, we use a simple method to oversample the minority (positive) case here. 

In [202]:
#Oversample train data
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

As the data has around 500 features, yet only has 7000 records, we want to scale the features first and reduce the dimensionality with a simple PCA.

In [203]:
#Scale train and test data
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X_resampled)
X_train_resampled_scaled = scaler.transform(X_resampled)
X_test_scaled = scaler.transform(X_test)

Because sample size is very restricted, here we use cross-validation to train and validate the model. We tune some key hyperparameters with a simple grid search.

In [204]:
#Use a pipeline to assemble PCA and logit
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pca = PCA()
logistic = LogisticRegression(max_iter=10000, tol=0.1)
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

I'd particularly look at recall here, as with the imbalanced data we care about how well we identify the TP case (and how many real cases we fail to identify).

In [205]:
#Tune PCA and logit parameters
from sklearn.model_selection import GridSearchCV
param_grid = {
'pca__n_components': [5, 15, 30, 45, 64],
'logistic__C': np.logspace(-4, 4, 4),
}
search = GridSearchCV(pipe, param_grid, n_jobs=-1,refit='recall_macro', cv=5)
search.fit(X_train_resampled_scaled, y_resampled)
#Inspect the best parameters
print("Best parameter (CV recall=%0.3f):" % search.best_score_)
search.best_params_

Best parameter (CV recall=0.813):


{'logistic__C': 21.54434690031882, 'pca__n_components': 64}

In [206]:
#Show test result
from sklearn.metrics import classification_report
y_pred = search.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.83      0.90      2012
           1       0.11      0.60      0.19        73

   micro avg       0.82      0.82      0.82      2085
   macro avg       0.55      0.72      0.55      2085
weighted avg       0.95      0.82      0.88      2085



Focusing on the recall for value 1 case, it looks like we were able to get 60% of the cases right. So this did a better job than the baseline which gets nothing right on positive cases.

If this is a production-level model, I'll retrain the model with all available data again to ensure all information is utilized, but I'll just pickle it for now.

In [207]:
#Pickle the model file
import joblib
best_model=search.best_estimator_
model_file='C:/Users/Tianl/Documents/GitHub/Lucid-test/logit.pkl'
joblib.dump(best_model, model_file)

['C:/Users/Tianl/Documents/GitHub/Lucid-test/logit.pkl']