# Modeling
>  Time to bring everything together and build some models! In this last chapter, you will build a base model before tuning some hyperparameters and improving your results with ensembles. You will then get some final tips and tricks to help you compete more efficiently.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 4 exercises "Winning a Kaggle Competition in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Baseline model

### Replicate validation score

<div class=""><p>You've seen both validation and Public Leaderboard scores in the video. However, the code examples are available only for the test data. To get the validation scores you have to repeat the same process on the holdout set.</p>
<p>Throughout this chapter, you will work with New York City Taxi competition data. The problem is to predict the fare amount for a taxi ride in New York City. The competition metric is the root mean squared error.</p>
<p>The first goal is to evaluate the Baseline model on the validation data. You will replicate the simplest Baseline based on the mean of <code>"fare_amount"</code>. Recall that as a validation strategy we used a 30% holdout split with <code>validation_train</code> as train and <code>validation_test</code> as holdout DataFrames. Both of them are available in your workspace.</p></div>

In [2]:
from sklearn.model_selection import train_test_split

train = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/taxi_train.csv')
test = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/taxi_test.csv')
train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime']).dt.tz_localize(None)
test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime']).dt.tz_localize(None)
validation_train, validation_test = train_test_split(train, test_size=0.3, random_state = 123)

In [3]:
import hashlib
X = train
hashlib.sha256(X.to_json().encode()).hexdigest()

'c837903792d6345a07ded5be113debff9ab18e167fae6679f0dcfef44ed4c4e8'

In [4]:
c837903792d6345a07ded5be113debff9ab18e167fae6679f0dcfef44ed4c4e8
94d6e2dc1cbe787117f7ee693fa8180d0bce9f5e58acf8328af5c1c966e4cdc4

SyntaxError: ignored

In [None]:
for x in range(0, 1000):
  validation_train, validation_test = train_test_split(train, test_size=0.3, random_state=x)
  print(x, hashlib.sha256(validation_train.to_json().encode()).hexdigest())

Instructions
<ul>
<li>Calculate the mean of <code>"fare_amount"</code> over the whole <code>validation_train</code> DataFrame.</li>
<li>Assign this naive prediction value to all the holdout predictions. Store them in the <code>"pred"</code> column.</li>
</ul>

In [5]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# Calculate the mean fare_amount on the validation_train data
naive_prediction = np.mean(validation_train['fare_amount'])

# Assign naive prediction to all the holdout observations
validation_test = validation_test.copy()
validation_test['pred'] = naive_prediction

# Measure the local RMSE
rmse = sqrt(mean_squared_error(validation_test['fare_amount'], validation_test['pred']))
print('Validation RMSE for Baseline I model: {:.3f}'.format(rmse))

Validation RMSE for Baseline I model: 9.986


**It's exactly the same number you've seen in the slides, well done! So, to avoid overfitting you should fully replicate your models using the validation data**

### Baseline based on the date

<div class=""><p>We've already built 3 different baseline models. To get more practice, let's build a couple more. The first model is based on the grouping variables. It's clear that the ride fare could depend on the part of the day. For example, prices could be higher during the rush hours.</p>
<p>Your goal is to build a baseline model that will assign the average "fare_amount" for the corresponding hour. For now, you will create the model for the whole <code>train</code> data and make predictions for the <code>test</code> dataset.</p>
<p>The <code>train</code> and <code>test</code> DataFrames are available in your workspace. Moreover, the "pickup_datetime" column in both DataFrames is already converted to a <code>datetime</code> object for you.</p></div>

Instructions
<ul>
<li>Get the hour from the "pickup_datetime" column for the <code>train</code> and <code>test</code> DataFrames.</li>
<li>Calculate the mean "fare_amount" for each hour on the train data.</li>
<li>Make <code>test</code> predictions using <code>pandas</code>' <code>map()</code> method and the grouping obtained.</li>
<li>Write predictions to the file.</li>
</ul>

In [6]:
# Get pickup hour from the pickup_datetime column
train['hour'] = train['pickup_datetime'].dt.hour
test['hour'] = test['pickup_datetime'].dt.hour

# Calculate average fare_amount grouped by pickup hour 
hour_groups = train.groupby('hour')['fare_amount'].mean()

# Make predictions on the test set
test['fare_amount'] = test.hour.map(hour_groups)

# Write predictions
test[['id','fare_amount']].to_csv('hour_mean_sub.csv', index=False)

In [7]:
!head hour_mean_sub.csv

id,fare_amount
0,11.199879638916757
1,11.199879638916757
2,11.241585365853654
3,10.964889086069206
4,10.964889086069206
5,10.964889086069206
6,11.094688755020092
7,11.094688755020092
8,11.094688755020092


**Such baseline achieves 1409th place on the Public Leaderboard which is slightly better than grouping by the number of passengers. Also, remember to replicate all the results for the validation set as it was done in the previous exercise.**

### Baseline based on the gradient boosting

<div class=""><p>Let's build a final baseline based on the Random Forest. You've seen a huge score improvement moving from the grouping baseline to the Gradient Boosting in the video. Now, you will use <code>sklearn</code>'s Random Forest to further improve this score.</p>
<p>The goal of this exercise is to take numeric features and train a Random Forest model without any tuning. After that, you could make test predictions and validate the result on the Public Leaderboard. Note that you've already got an <code>"hour"</code> feature which could also be used as an input to the model.</p></div>

Instructions
<ul>
<li>Add the <code>"hour"</code> feature to the list of numeric features.</li>
<li>Fit the <code>RandomForestRegressor</code> on the train data with numeric features and <code>"fare_amount"</code> as a target.</li>
<li>Use the trained Random Forest model to make predictions on the test data.</li>
</ul>

In [8]:
from sklearn.ensemble import RandomForestRegressor

# Select only numeric features
features = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
            'dropoff_latitude', 'passenger_count', 'hour']

# Train a Random Forest model
rf = RandomForestRegressor()
rf.fit(train[features], train.fare_amount)

# Make predictions on the test data
test['fare_amount'] = rf.predict(test[features])

# Write predictions
test[['id','fare_amount']].to_csv('rf_sub.csv', index=False)

In [9]:
!head rf_sub.csv

id,fare_amount
0,8.528000000000002
1,8.481000000000003
2,5.441999999999998
3,9.580999999999996
4,13.517999999999997
5,7.904000000000002
6,5.852999999999998
7,53.9464
8,11.937000000000001


**This final baseline achieves the 1051st place on the Public Leaderboard which is slightly better than the Gradient Boosting from the video. So, now you know how to build fast and simple baseline models to validate your initial pipeline.**

## Hyperparameter tuning

### Grid search

<div class=""><p>Recall that we've created a baseline Gradient Boosting model in the previous lesson. Your goal now is to find the best <code>max_depth</code> hyperparameter value for this Gradient Boosting model. This hyperparameter limits the number of nodes in each individual tree. You will be using K-fold cross-validation to measure the local performance of the model for each hyperparameter value.</p>
<p>You're given a function <code>get_cv_score()</code>, which takes the train dataset and dictionary of the model parameters as arguments and returns the overall validation RMSE score over 3-fold cross-validation.</p></div>

In [10]:
from sklearn.model_selection import KFold
from sklearn.ensemble import GradientBoostingRegressor

In [11]:
def get_cv_score(train, params):
    # Create KFold object
    kf = KFold(n_splits=3, shuffle=True, random_state=123)

    rmse_scores = []
    
    # Loop through each split
    for train_index, test_index in kf.split(train):
        cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    
        # Train a Gradient Boosting model
        gb = GradientBoostingRegressor(random_state=123, **params).fit(cv_train[features], cv_train.fare_amount)
    
        # Make predictions on the test data
        pred = gb.predict(cv_test[features])
    
        fold_score = np.sqrt(mean_squared_error(cv_test['fare_amount'], pred))
        rmse_scores.append(fold_score)
    
    return np.round(np.mean(rmse_scores) + np.std(rmse_scores), 5)

Instructions
<ul>
<li>Specify the grid for possible <code>max_depth</code> values with 3, 6, 9, 12 and 15.</li>
<li>Pass each hyperparameter candidate in the grid to the model <code>params</code> dictionary.</li>
</ul>

In [12]:
# Possible max depth values
max_depth_grid = [3, 6, 9, 12, 15]
results = {}

# For each value in the grid
for max_depth_candidate in max_depth_grid:
    # Specify parameters for the model
    params = {'max_depth': max_depth_candidate}

    # Calculate validation score for a particular hyperparameter
    validation_score = get_cv_score(train, params)

    # Save the results for each max depth value
    results[max_depth_candidate] = validation_score   
print(results)

{3: 5.67086, 6: 5.36931, 9: 5.35548, 12: 5.51582, 15: 5.71856}


**We have a validation score for each value in the grid. It's clear that the optimal max depth value is located somewhere between 3 and 6. The next step could be to use a smaller grid, for example [3, 4, 5, 6] and repeat the same process. Moving from larger to smaller grids allows us to find the most optimal values. Keep going to try optimizing 2 hyperparameters simultaneously!**

### 2D grid search

<div class=""><p>The drawback of tuning each hyperparameter independently is a potential dependency between different hyperparameters. The better approach is to try all the possible hyperparameter combinations. However, in such cases, the grid search space is rapidly expanding. For example, if we have 2 parameters with 10 possible values, it will yield 100 experiment runs.</p>
<p>Your goal is to find the best hyperparameter couple of <code>max_depth</code> and <code>subsample</code> for the Gradient Boosting model. <code>subsample</code> is a fraction of observations to be used for fitting the individual trees.</p>
<p>You're given a function <code>get_cv_score()</code>, which takes the train dataset and dictionary of the model parameters as arguments and returns the overall validation RMSE score over 3-fold cross-validation.</p></div>

Instructions
<ul>
<li>Specify the grids for possible <code>max_depth</code> and <code>subsample</code> values. For <code>max_depth</code>: 3, 5 and 7. For <code>subsample</code>: 0.8, 0.9 and 1.0.</li>
<li>Apply the <code>product()</code> function from the <code>itertools</code> package to the hyperparameter grids. It returns all possible combinations for these two grids.</li>
<li>Pass each hyperparameters candidate couple to the model <code>params</code> dictionary.</li>
</ul>

In [13]:
import itertools

# Hyperparameter grids
max_depth_grid = [ 3, 5, 7]
subsample_grid = [0.8, 0.9, 1.0]
results = {}

# For each couple in the grid
for max_depth_candidate, subsample_candidate in itertools.product(max_depth_grid, subsample_grid):
    params = {'max_depth': max_depth_candidate,
              'subsample': subsample_candidate}
    validation_score = get_cv_score(train, params)
    # Save the results for each couple
    results[(max_depth_candidate, subsample_candidate)] = validation_score   
print(results)

{(3, 0.8): 5.65813, (3, 0.9): 5.65228, (3, 1.0): 5.67086, (5, 0.8): 5.34925, (5, 0.9): 5.44507, (5, 1.0): 5.3132, (7, 0.8): 5.39502, (7, 0.9): 5.40612, (7, 1.0): 5.3591}


**You can see that tuning multiple hyperparameters simultaneously achieves better results. In the previous exercise, tuning only the max_depth parameter gave the best RMSE of \$6.50. With max_depth equal to 7 and subsample equal to 0.8, the best RMSE is now $6.16. However, do not spend too much time on the hyperparameter tuning at the beginning of the competition!**

## Model ensembling

## Model blending

<div class=""><p>You will start creating model ensembles with a <strong>blending</strong> technique.</p>
<p>Your goal is to train 2 different models on the New York City Taxi competition data. Make predictions on the test data and then blend them using a simple arithmetic mean.</p>
<p>The <code>train</code> and <code>test</code> DataFrames are already available in your workspace. <code>features</code> is a list of columns to be used for training and it is also available in your workspace. The target variable name is "fare_amount".</p></div>

In [26]:
train = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/taxi_chapter_4_cv.csv')
test = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/23-winning-a-kaggle-competition-in-python/datasets/taxi_chapter_4_cv_test.csv')
features = ['pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count',
 'distance_km',
 'hour']

Instructions
<ul>
<li>Train a Gradient Boosting model on the train data using <code>features</code> list, and the "fare_amount" column as a target variable.</li>
<li>Train a Random Forest model in the same manner.</li>
<li>Make predictions on the test data using both Gradient Boosting and Random Forest models.</li>
<li>Find the average of both models predictions.</li>
</ul>

In [27]:
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

# Train a Gradient Boosting model
gb = GradientBoostingRegressor().fit(train[features], train.fare_amount)

# Train a Random Forest model
rf = RandomForestRegressor().fit(train[features], train.fare_amount)

# Make predictions on the test data
test['gb_pred'] = gb.predict(test[features])
test['rf_pred'] = rf.predict(test[features])

# Find mean of model predictions
test['blend'] = (test['gb_pred'] + test['rf_pred']) / 2
print(test[['gb_pred', 'rf_pred', 'blend']].head(3))

    gb_pred  rf_pred     blend
0  9.661374    9.301  9.481187
1  9.304288    8.030  8.667144
2  5.795140    4.770  5.282570


**Blending allows you to get additional score improvements almost for free just by averaging multiple models predictions.**

### Model stacking I

<div class=""><p>Now it's time for <strong>stacking</strong>. To implement the stacking approach, you will follow the 6 steps we've discussed in the previous video:</p>
<ol>
<li>Split train data into two parts</li>
<li>Train multiple models on Part 1</li>
<li>Make predictions on Part 2</li>
<li>Make predictions on the test data</li>
<li>Train a new model on Part 2 using predictions as features</li>
<li>Make predictions on the test data using the 2nd level model</li>
</ol>
<p><code>train</code> and <code>test</code> DataFrames are already available in your workspace. <code>features</code> is a list of columns to be used for training on the Part 1 data and it is also available in your workspace. Target variable name is "fare_amount".</p></div>

Instructions 1/2
<ul>
<li>Split the <code>train</code> DataFrame into two equal parts: <code>part_1</code> and <code>part_2</code>. Use the <code>train_test_split()</code> function with <code>test_size</code> equal to 0.5.</li>
<li>Train Gradient Boosting and Random Forest models on the <code>part_1</code> data.</li>
</ul>

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

# Split train data into two parts
part_1, part_2 = train_test_split(train, test_size=.5, random_state=123)

# Train a Gradient Boosting model on Part 1
gb = GradientBoostingRegressor().fit(part_1[features], part_1.fare_amount)

# Train a Random Forest model on Part 1
rf = RandomForestRegressor().fit(part_1[features], part_1.fare_amount)

Instructions 2/2
<ul>
<li>Make Gradient Boosting and Random Forest predictions on the <code>part_2</code> data.</li>
<li>Make Gradient Boosting and Random Forest predictions on the <code>test</code> data.</li>
</ul>

In [33]:
# Make predictions on the Part 2 data
part_2 = part_2.copy()
part_2['gb_pred'] = gb.predict(part_2[features])
part_2['rf_pred'] = rf.predict(part_2[features])

# Make predictions on the test data
test = test.copy()
test['gb_pred'] = gb.predict(test[features])
test['rf_pred'] = rf.predict(test[features])

**You've covered 4 out of 6 steps to create a stacking ensemble. The only steps left is to create a new model on Part 2 data using predictions as features and apply it to the test data.**

### Model stacking II

<div class=""><p>OK, what you've done so far in the stacking implementation:</p>
<ol>
<li>Split train data into two parts</li>
<li>Train multiple models on Part 1</li>
<li>Make predictions on Part 2</li>
<li>Make predictions on the test data</li>
</ol>
<p>Now, your goal is to create a second level model using predictions from steps 3 and 4 as features. So, this model is trained on Part 2 data and then you can make stacking predictions on the test data.</p>
<p><code>part_2</code> and <code>test</code> DataFrames are already available in your workspace. Gradient Boosting and Random Forest predictions are stored in these DataFrames under the names "gb_pred" and "rf_pred", respectively.</p></div>

Instructions
<ul>
<li>Train a Linear Regression model on the Part 2 data using Gradient Boosting and Random Forest models predictions as features.</li>
<li>Make predictions on the test data using Gradient Boosting and Random Forest models predictions as features.</li>
</ul>

In [36]:
from sklearn.linear_model import LinearRegression

# Create linear regression model without the intercept
lr = LinearRegression(fit_intercept=False)

# Train 2nd level model on the Part 2 data
lr.fit(part_2[['gb_pred', 'rf_pred']], part_2.fare_amount)

# Make stacking predictions on the test data
test['stacking'] = lr.predict(test[['gb_pred', 'rf_pred']])

# Look at the model coefficients
print(lr.coef_)

[0.35812797 0.64768928]




```
[0.72504358 0.27647395]
```


**Usually, the 2nd level model is some simple model like Linear or Logistic Regressions. Also, note that you were not using intercept in the Linear Regression just to combine pure model predictions. Looking at the coefficients, it's clear that 2nd level model has more trust to the Gradient Boosting: 0.7 versus 0.3 for the Random Forest model.**

## Final tips

### Testing Kaggle forum ideas

<div class=""><p>Unfortunately, not all the Forum posts and Kernels are necessarily useful for your model. So instead of blindly incorporating ideas into your pipeline, you should test them first.</p>
<p>You're given a function <code>get_cv_score()</code>, which takes a train dataset as an argument and returns the overall validation root mean squared error over 3-fold cross-validation. The <code>train</code> DataFrame is already available in your workspace.</p>
<p>You should try different suggestions from the Kaggle Forum and check whether they improve your validation score.</p></div>

In [37]:
def get_cv_score(train):
    features = ['pickup_longitude', 'pickup_latitude',
            'dropoff_longitude', 'dropoff_latitude',
            'passenger_count', 'distance_km', 'hour', 'weird_feature']
    
    features = [x for x in features if x in train.columns]
  
    # Create KFold object
    kf = KFold(n_splits=3, shuffle=True, random_state=123)

    rmse_scores = []
    
    # Loop through each split
    for train_index, test_index in kf.split(train):
        cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    
        # Train a Gradient Boosting model
        gb = GradientBoostingRegressor(random_state=123).fit(cv_train[features], cv_train.fare_amount)
    
        # Make predictions on the test data
        pred = gb.predict(cv_test[features])
    
        fold_score = np.sqrt(mean_squared_error(cv_test['fare_amount'], pred))
        rmse_scores.append(fold_score)
    
    return np.round(np.mean(rmse_scores) + np.std(rmse_scores), 5)

Instructions 1/2
<li><strong>Suggestion 1: the <code>passenger_count</code> feature is useless</strong>. Let's see! Drop this feature and compare the scores.</li>

In [38]:
# Drop passenger_count column
new_train_1 = train.drop('passenger_count', axis=1)

# Compare validation scores
initial_score = get_cv_score(train)
new_score = get_cv_score(new_train_1)

print('Initial score is {} and the new score is {}'.format(initial_score, new_score))

Initial score is 6.49932 and the new score is 6.42315


Instructions 2/2
<li>This first suggestion worked. <strong>Suggestion 2: Sum of <code>pickup_latitude</code> and <code>distance_km</code> is a good feature</strong>. Let's try it!</li>

In [39]:
# Create copy of the initial train DataFrame
new_train_2 = train.copy()

# Find sum of pickup latitude and ride distance
new_train_2['weird_feature'] = new_train_2['pickup_latitude'] + new_train_2['distance_km']

# Compare validation scores
initial_score = get_cv_score(train)
new_score = get_cv_score(new_train_2)

print('Initial score is {} and the new score is {}'.format(initial_score, new_score))

Initial score is 6.49932 and the new score is 6.50495


**Be aware that not all the ideas shared publicly could work for you! In this particular case, dropping the "passenger_count" feature helped, while finding the sum of pickup latitude and ride distance did not. The last action you perform in any Kaggle competition is selecting final submissions.**

### Select final submissions

<div class=""><p>The last action in every competition is selecting final submissions. Your goal is to select 2 final submissions based on the local validation and Public Leaderboard scores. Suppose that the competition metric is RMSE (the lower the metric the better). Keep up with a selection strategy we've discussed in the slides:</p>
<ol>
<li>Local validation: 1.25; Leaderboard: 1.35.</li>
<li>Local validation: 1.32; Leaderboard: 1.39.</li>
<li>Local validation: 1.10; Leaderboard: 1.29.</li>
<li>Local validation: 1.17; Leaderboard: 1.25.</li>
<li>Local validation: 1.21; Leaderboard: 1.32.</li>
</ol></div>

<pre>
Possible Answers
1 and 2.
2 and 3.
<b>3 and 4.</b>
4 and 5.
1 and 5.
</pre>

**Submission 3 is the best on local validation and submission 4 is the best on Public Leaderboard. So, it's the best choice for the final submissions!**