# Module 3: Machine Learning

## Sprint 3: Introduction to Natural Language Processing and Computer Vision

## Kaggle competition - don't overfit!

## Background

---

Participating in Kaggle competitions is an efficient way to learn some aspects of Machine Learning. You can read solutions made public by the others, participate in the discussions to talk about solution ideas and test them by submitting them for evaluation.

The metric used for evaluation can vary from competition to competition, but the idea remains the same - build a model that is as accurate as possible on the testing set. In industry, there are other factors to consider when building machine learning models - inference time, solution complexity, maintainability and so on. However, even though you only learn a subset of the required skills while participating in Kaggle competitions, it is quite a fun way to learn by doing it, so let's participate in one of the competitions again!

## The competition

Even though we spent quite some time on natural language processing and computer vision during the sprint, the most accurate models on these types of data usually involves deep learning, that you will learn about in the upcoming course! In this project, the main goal will be to understand the concept of overfitting as deeply as possible, which essentially means fitting the training data very well at the expense of a model that generalizes and works well on other samples.

To learn about the concept of overfitting, we will participate in the following Kaggle competition:

- https://www.kaggle.com/c/dont-overfit-ii

IMPORTANT: download the data from here - https://www.kaggle.com/sahiltinky/org-dataset-dont-overfitii, as the evaluation is done on an older dataset version than the one available at the competition data section.

For help, you can look at some of the notebooks by other competitors. However, try to write code by yourself, as even though you will always be able to consult external resources while working as a professional, the main thing right now is to learn by first trying it yourself.

Some notebooks that are worth exploring:

- https://www.kaggle.com/artgor/how-to-not-overfit
- https://www.kaggle.com/rafjaa/dealing-with-very-small-datasets

---

## Concepts to explore

- https://towardsdatascience.com/how-to-improve-your-kaggle-competition-leaderboard-ranking-bcd16643eddf
- https://opendatascience.com/10-tips-to-get-started-with-kaggle/

## Requirements

- Data exploration
- Feature engineering
- At least several different models built and compared to each other on the validation set and on the public and private leaderboards

## Evaluation criteria

- Private leaderboard score (target is better than 0.8)
- How simple is the model


## Sample correction questions

During a correction, you may get asked questions that test your understanding of covered topics.

- Is it possible to use standard machine learning algorithms, such as logistic regression and random forests, when working with text? If yes, what has to be done and how?
- You train a machine learning model and get a low validation accuracy. What other metrics you could check to better understand the problem? What are some of the ways to improve the validation accuracy?
- How does looking at the validation accuracy, confusion matrix and important features complement each other when evaluating the model's performance?
- How to make sure that the model that was deployed to production performs well?

# Competition code

In [96]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt

from sklearn import set_config
set_config(display="diagram")

from datetime import datetime

RANDOM_STATE = 3

In [31]:
# Load the data
train_df = pd.read_csv("~/Google Drive/Data/335_dont_overfit/train.csv", index_col="id")
test_df = pd.read_csv("~/Google Drive/Data/335_dont_overfit/test.csv", index_col="id")

train_df.shape, test_df.shape

((250, 301), (19750, 300))

## EDA

In [32]:
train_df.head()

Unnamed: 0_level_0,target,0,1,2,3,4,5,6,7,8,...,290,291,292,293,294,295,296,297,298,299
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,-0.098,2.165,0.681,-0.614,1.309,-0.455,-0.236,0.276,-2.246,...,0.867,1.347,0.504,-0.649,0.672,-2.097,1.051,-0.414,1.038,-1.065
1,0.0,1.081,-0.973,-0.383,0.326,-0.428,0.317,1.172,0.352,0.004,...,-0.165,-1.695,-1.257,1.359,-0.808,-1.624,-0.458,-1.099,-0.936,0.973
2,1.0,-0.523,-0.089,-0.348,0.148,-0.022,0.404,-0.023,-0.172,0.137,...,0.013,0.263,-1.222,0.726,1.444,-1.165,-1.544,0.004,0.8,-1.211
3,1.0,0.067,-0.021,0.392,-1.637,-0.446,-0.725,-1.035,0.834,0.503,...,-0.404,0.64,-0.595,-0.966,0.9,0.467,-0.562,-0.254,-0.533,0.238
4,1.0,2.347,-0.831,0.511,-0.021,1.225,1.594,0.585,1.509,-0.012,...,0.898,0.134,2.415,-0.996,-1.006,1.378,1.246,1.478,0.428,0.253


In [13]:
train_df.describe()

Unnamed: 0,target,0,1,2,3,4,5,6,7,8,...,290,291,292,293,294,295,296,297,298,299
count,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,...,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0
mean,0.64,0.023292,-0.026872,0.167404,0.001904,0.001588,-0.007304,0.032052,0.078412,-0.03692,...,0.044652,0.126344,0.018436,-0.012092,-0.06572,-0.106112,0.046472,0.006452,0.009372,-0.128952
std,0.480963,0.998354,1.009314,1.021709,1.011751,1.035411,0.9557,1.006657,0.939731,0.963688,...,1.011416,0.972567,0.954229,0.96063,1.057414,1.038389,0.967661,0.998984,1.008099,0.971219
min,0.0,-2.319,-2.931,-2.477,-2.359,-2.566,-2.845,-2.976,-3.444,-2.768,...,-2.804,-2.443,-2.757,-2.466,-3.287,-3.072,-2.634,-2.776,-3.211,-3.5
25%,0.0,-0.64475,-0.73975,-0.42525,-0.6865,-0.659,-0.64375,-0.675,-0.55075,-0.6895,...,-0.617,-0.5105,-0.53575,-0.657,-0.8185,-0.821,-0.6055,-0.75125,-0.55,-0.75425
50%,1.0,-0.0155,0.057,0.184,-0.0165,-0.023,0.0375,0.0605,0.1835,-0.0125,...,0.0675,0.091,0.0575,-0.021,-0.009,-0.0795,0.0095,0.0055,-0.009,-0.1325
75%,1.0,0.677,0.62075,0.805,0.72,0.735,0.6605,0.78325,0.76625,0.635,...,0.79725,0.80425,0.6315,0.65025,0.7395,0.493,0.683,0.79425,0.65425,0.50325
max,1.0,2.567,2.419,3.392,2.771,2.901,2.793,2.546,2.846,2.512,...,2.865,2.801,2.736,2.596,2.226,3.131,3.236,2.626,3.53,2.771


In [14]:
test_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
count,19750.0,19750.0,19750.0,19750.0,19750.0,19750.0,19750.0,19750.0,19750.0,19750.0,...,19750.0,19750.0,19750.0,19750.0,19750.0,19750.0,19750.0,19750.0,19750.0,19750.0
mean,-0.014043,0.000972,0.005145,-0.003525,0.003394,0.002738,0.004213,-0.010618,-0.003211,-0.002738,...,0.002577,-0.01013,-0.003961,0.012793,0.009063,0.007512,-0.004283,-0.001203,0.013076,7e-05
std,1.003779,0.993955,1.000809,1.008545,1.002826,1.002917,0.994315,0.997972,0.996938,1.000688,...,0.996314,0.996511,0.999788,1.01452,0.994,0.999559,0.99627,1.003705,0.996285,1.000596
min,-4.07,-3.664,-4.258,-4.14,-4.411,-3.586,-3.953,-3.906,-4.203,-4.024,...,-3.688,-3.877,-3.599,-3.65,-3.865,-3.814,-3.835,-3.908,-3.581,-4.135
25%,-0.68875,-0.667,-0.668,-0.686,-0.671,-0.679,-0.673,-0.68,-0.667,-0.677,...,-0.66,-0.675,-0.68475,-0.672,-0.65675,-0.664,-0.665,-0.68,-0.663,-0.675
50%,-0.006,0.001,0.017,-0.006,0.007,0.005,0.014,-0.014,-0.003,-0.007,...,-0.006,-0.015,-0.004,0.007,0.001,0.001,-0.001,-0.01,0.016,0.007
75%,0.664,0.676,0.681,0.682,0.676,0.68475,0.67,0.66075,0.671,0.673,...,0.667,0.654,0.68,0.694,0.682,0.685,0.669,0.673,0.686,0.676
max,3.767,3.864,3.866,3.871,3.955,3.819,3.954,3.669,3.948,3.812,...,3.619,3.829,3.717,5.092,5.125,3.681,3.716,3.932,3.764,4.07


In [19]:
def get_missing_values(df: pd.DataFrame) -> pd.DataFrame:
  missing_values = pd.DataFrame()
  missing_values["counts"] = df.isnull().sum().sort_values(ascending=False)
  missing_values["percent"] = missing_values["counts"] / df.shape[0] * 100
  return missing_values[missing_values["counts"] > 0]

if len(get_missing_values(train_df)) == 0:
  print("No missing values in train dataset")

if len(get_missing_values(test_df)) == 0:
  print("No missing values in test dataset")

No missing values in train dataset
No missing values in test dataset


## Logistic regression on raw data

In [71]:
x_train = train_df.drop(["target"], axis="columns")
y_train = train_df["target"]

In [88]:
# Scale data
scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled = pd.DataFrame(scaler.transform(x_train), columns=x_train.columns)
x_test_scaled = pd.DataFrame(scaler.transform(test_df), columns=test_df.columns)

In [72]:
# features = [ '16', '33', '43', '45', '52', '63', '65', '73', '90', '91', '117', '133', '134', '149', '189', '199', '217', '237', '258', '295']

In [90]:
lr = LogisticRegression(
    random_state=RANDOM_STATE,
    penalty="l1",
    solver="saga",
    class_weight="balanced",
    C=0.2
)

scores = cross_val_score(lr, x_train_scaled[features], y_train, cv=5)
scores, scores.mean()

(array([0.82, 0.78, 0.86, 0.82, 0.76]), 0.808)

In [78]:
lr_model = lr.fit(x_train[features], y_train)
lr_pred = lr.predict(test_df[features])
lr_pred.shape

(19750,)

In [79]:
lr_pred

array([1., 0., 1., ..., 1., 1., 0.])

In [103]:
lr = LogisticRegression(
    random_state=RANDOM_STATE,
    penalty="l1",
    solver="liblinear",
    class_weight="balanced",
    C=0.1,
    max_iter=10000
)

lr_model = lr.fit(x_train, y_train)
lr_pred = lr.predict(test_df)
lr_pred.shape

(19750,)

In [59]:
rfc = RandomForestClassifier(
    random_state=RANDOM_STATE,
    n_estimators=100
)

scores = cross_val_score(rfc, x_train[features], y_train, cv=5)
scores, scores.mean()

(array([0.78, 0.86, 0.76, 0.82, 0.72]), 0.788)

In [101]:
nnet = MLPClassifier(
    random_state=RANDOM_STATE,
    hidden_layer_sizes=(300,2)
)

scores = cross_val_score(nnet, x_train_scaled[features], y_train, cv=5)
scores, scores.mean()



(array([0.76, 0.8 , 0.86, 0.84, 0.74]), 0.8)

In [102]:
%%time
# Search for better parameters 
from itertools import product
# make a map for hidden layer sizes
# sizes = [16, 32, 64, 128]
# shapes = (
#     list(product(sizes, repeat=1))
#     + list(product(sizes, repeat=2))
#     + list(product(sizes, repeat=3))
# )

distributions = dict(
    alpha=sp_randFloat(loc=0, scale=0.1),
    # hidden_layer_sizes=shapes,
    activation=['identity','logistic','tanh','relu'],
    max_iter=[50, 100, 200, 300, 400, 500],
    early_stopping=[True, False]
)

nnet_rs = RandomizedSearchCV(
    nnet,
    distributions,
    n_iter=20,
    cv=5,
    verbose=1,
    n_jobs=-1,
    random_state=RANDOM_STATE
)

nnet_rs.fit(x_train, y_train)

print(f"Best estimator: {nnet_rs.best_estimator_}")
print(f"Best score: {nnet_rs.best_score_}")
print(f"Best params: {nnet_rs.best_params_}")

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best estimator: MLPClassifier(activation='logistic', alpha=0.05627700944910962,
              hidden_layer_sizes=(300, 2), max_iter=400, random_state=3)
Best score: 0.7080000000000001
Best params: {'activation': 'logistic', 'alpha': 0.05627700944910962, 'early_stopping': False, 'max_iter': 400}
CPU times: user 10.8 s, sys: 1.75 s, total: 12.5 s
Wall time: 37.5 s




In [93]:
nnet_model = nnet.fit(x_train_scaled[features], y_train)
nnet_pred = nnet_model.predict(x_test_scaled[features])



In [91]:
gbc = GradientBoostingClassifier(
    random_state=RANDOM_STATE
)

scores = cross_val_score(gbc, x_train_scaled[features], y_train, cv=5)
scores, scores.mean()

(array([0.7 , 0.64, 0.8 , 0.8 , 0.74]), 0.736)

In [82]:
def generateSubmissionCsv(predictions, title):
  submission = pd.read_csv("~/Google Drive/Data/335_dont_overfit/sample_submission.csv")
  submission.iloc[:,1] = predictions

  now = datetime.now()
  timestamp = now.strftime("%Y-%m-%d_%H-%M-%S")

  submission.to_csv(f"{title}_submission_{timestamp}.csv", index=False)

In [104]:
generateSubmissionCsv(lr_pred, "lr_")

In [94]:
generateSubmissionCsv(nnet_pred, "nn_first")