In another post, we have taken a look at data from [Tabular Playground Series - January](https://www.kaggle.com/c/tabular-playground-series-jan-2021), and ran some simple linear regressions on it. In this post, I want to explore other approaches. 

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

from sklearn.model_selection import GridSearchCV, cross_val_score, \
                                    train_test_split
from sklearn.metrics import mean_squared_error as mse

In [2]:
# loading data
df = pd.read_csv('data/train.csv',index_col=0)
y = pd.DataFrame(df['target'])
train = df.drop(['target'],axis=1)

test = pd.read_csv('data/test.csv',index_col=0)

train.head()

Unnamed: 0_level_0,cont1,cont2,cont3,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0.67039,0.8113,0.643968,0.291791,0.284117,0.855953,0.8907,0.285542,0.558245,0.779418,0.921832,0.866772,0.878733,0.305411
3,0.388053,0.621104,0.686102,0.501149,0.64379,0.449805,0.510824,0.580748,0.418335,0.432632,0.439872,0.434971,0.369957,0.369484
4,0.83495,0.227436,0.301584,0.293408,0.606839,0.829175,0.506143,0.558771,0.587603,0.823312,0.567007,0.677708,0.882938,0.303047
5,0.820708,0.160155,0.546887,0.726104,0.282444,0.785108,0.752758,0.823267,0.574466,0.580843,0.769594,0.818143,0.914281,0.279528
8,0.935278,0.421235,0.303801,0.880214,0.66561,0.830131,0.487113,0.604157,0.874658,0.863427,0.983575,0.900464,0.935918,0.435772


### Decision Tree
First, we look at a popular method, decision tree. Recall that decision trees are built to optimize some quality criterion at each split, e.g., Gini impurity or misclassification error for classification problems, mean-squared-error or variance for regression problems. 

In [6]:
from sklearn.tree import DecisionTreeRegressor

# hyperparameter tuning, with cross-validation using GridSearchCV
tree = DecisionTreeRegressor(random_state=42,criterion='mse')

tree_params = {'max_depth':range(1,11),
               'max_features':range(4,15)
              }
tree_grid = GridSearchCV(tree,tree_params, cv=5,n_jobs=-1)

tree_grid.fit(train,y)

GridSearchCV(cv=5, estimator=DecisionTreeRegressor(random_state=42), n_jobs=-1,
             param_grid={'max_depth': range(1, 11),
                         'max_features': range(4, 15)})

In [7]:
# best parameters
tree_grid.best_params_

{'max_depth': 8, 'max_features': 6}

In [8]:
# making prediction with tree using above parameters
tree = DecisionTreeRegressor(random_state=42,max_depth=8,max_features=6)
tree.fit(train,y)
tree_preds = pd.DataFrame(tree.predict(test),index=test.index)
tree_preds.columns=['target']
tree_preds.to_csv('predictions/decisiontree.csv')
tree_preds

Unnamed: 0_level_0,target
id,Unnamed: 1_level_1
0,8.093925
2,7.751377
6,7.794199
7,8.029836
10,8.093925
...,...
499984,8.181278
499985,8.029836
499987,7.986406
499988,8.054487


Submitting this gives a score of 0.72077, which is a decent improvement upon our scores with linear regressions. The next natural step is taking ensemble.

### Random Forest

Random Forest is a type of ensemble method, which aggregates predictions from a collection of trees. In `DecisionTreeRegressor` class, the number of trees is controlled by the parameter `n_estimators`. Through the parameter `bootstrap`, it is possible to turn our estimator into *bagging* (see [Bootstrap aggregating](https://en.wikipedia.org/wiki/Bootstrap_aggregating)). 

In [3]:
from sklearn.ensemble import RandomForestRegressor

In [None]:


# hyperparameter tuning, with cross-validation using GridSearchCV
rf = RandomForestRegressor(n_estimators=100,random_state=42)

rf_params = {'max_depth':range(4,11),
             'min_samples_split': range(2,6),
             'max_features':range(4,11)
            }

rf_grid = GridSearchCV(rf,rf_params, cv=5,n_jobs=-1)

rf_grid.fit(train,y)

In [7]:
# best parameters
rf_grid.best_params_

{'max_depth': 10, 'max_features': 8, 'min_samples_split': 4}

In [4]:
# making prediction with random forest using above parameters
rf = RandomForestRegressor(n_estimators=10,random_state=42,
                           max_depth=10,max_features=8,min_samples_split=4)
rf.fit(train,y)
rf_preds = pd.DataFrame(rf.predict(test),index=test.index)
rf_preds.columns=['target']
rf_preds.to_csv('predictions/randomforest.csv')
rf_preds

  rf.fit(train,y)


Unnamed: 0_level_0,target
id,Unnamed: 1_level_1
0,8.093230
2,7.822159
6,7.867669
7,8.074233
10,8.158477
...,...
499984,8.153587
499985,8.001601
499987,8.048147
499988,7.988638


Submitting this gives a score of 0.71243, which is an improvement upon the decision tree score. In fact, one can get better score with higher `n_estimators`. For example, at `n_estimators`=50, we get a score of 0.71151. So, there is some room for future improvement via tuning. But, for now, let us try to use another approach. 

### Gradient boosting model
For this, we use the [xgboost](https://xgboost.readthedocs.io/en/latest/) implementation. We will also make use of early stopping. For that, we need to split training data into training part and validation part. 

In [28]:
# splitting data using train_test_split
X_train, X_val, y_train, y_val = train_test_split(train,y,train_size=0.9)

In [29]:
# fitting model
from xgboost import XGBRegressor
gbm = XGBRegressor(n_estimators=1000,learning_rate=0.01)
gbm.fit(X_train,y_train,early_stopping_rounds=5,
        eval_set=[(X_val,y_val)],verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.01, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=1000, n_jobs=6, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [30]:
# making and saving prediction
gbm_preds = pd.DataFrame(gbm.predict(test),index=test.index,
                        columns=['target'])
gbm_preds.to_csv('predictions/gbm.csv')
gbm_preds

Unnamed: 0_level_0,target
id,Unnamed: 1_level_1
0,7.967897
2,7.842190
6,7.916304
7,8.160949
10,8.249640
...,...
499984,8.205752
499985,8.216376
499987,8.080898
499988,8.046902


Submitting this gives a score of 0.70397, which is a nice improvement upon our previous score. It is also not a big surprise since it is one of the most powerful algorthim for tabular data. Much of the extra power of XGBoost comes from hyperparameters tuning. So, we can focus more on it later on. For now, let us try to use a feedforward neural network.

### Trying out a neural net

In [59]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import Input
import tensorflow as tf

nnet = Sequential([
    Input(shape=(14,)),
    Dense(16, activation='relu'),
    # Dropout(0.3),
    # Dense(16, activation='relu'),
    Dense(1, activation = 'relu')
])

nnet.summary()

Model: "sequential_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_29 (Dense)             (None, 16)                240       
_________________________________________________________________
dense_30 (Dense)             (None, 16)                272       
_________________________________________________________________
dense_31 (Dense)             (None, 1)                 17        
Total params: 529
Trainable params: 529
Non-trainable params: 0
_________________________________________________________________


In [60]:
# compiling model
opt = tf.keras.optimizers.Adam(learning_rate=0.001)


nnet.compile(optimizer=opt,loss = "mse", 
              metrics=[tf.metrics.MeanSquaredError()])

In [61]:
# fitting the model
history = nnet.fit(train,y,epochs=5,batch_size=16)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [62]:
# making and saving prediction
nnet_preds = pd.DataFrame(nnet.predict(test), index=test.index,
                       columns = ['target'])
nnet_preds.to_csv('predictions/simplenet.csv')
nnet_preds

Unnamed: 0_level_0,target
id,Unnamed: 1_level_1
0,8.026910
2,7.615761
6,8.012880
7,7.956695
10,8.045378
...,...
499984,8.004407
499985,7.949613
499987,7.706973
499988,7.931010


Submitting this gives a score of about 0.725, which is not better than a simple linear regression. 

### What's next?
One thing to try is to play around withXGBoost hyperparameters tuning. We can also try feature transformation and feature engineering, before applying models. Some good general advice are given [here](https://www.kaggle.com/c/tabular-playground-series-jan-2021/discussion/213090).