# Train a model with _Bike Rental Data_ using XGBoost

This section will work with XGBoost as a local installation to the instance.

## Training _log1p(count)_ onto Dataset

**Kernel used:** Conda with TensorFlow Python 3.6.5 for Amazon Elastic Instance *(conda_amazonei_tensorflow_p36)*

### First update *conda* and *pip* to latest version

In [None]:
!conda update -n base conda
!pip install --upgrade pip

### Ensure required packages are installed

In [None]:
!conda list nb_conda
!conda list numpy
!conda list pandas
!conda list pip
!conda list python
!conda list matplotlib

### Major Library Versions Used

| Library | Version |
|---------|:--------|
| nb_conda | 2.2.1 |
| matplotlib | 3.0.3 |
| numpy | 1.17.4 |
| pandas | 0.24.2 |
| pip | 20.2 |
| python | 3.6.5 |
| xgboost | 0.90 |

## Install xgboost

In [None]:
!conda install -y -c conda-forge xgboost

In [None]:
## Import Libraries
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [None]:
# XGBoost
import xgboost as xgb

## Load Data Files from Data Preparation Phase

In [None]:
column_list_file = 'bikeTrain_column_listv2.txt'
train_file = 'bikeTrainingv2.csv'
validation_file ='bikeValidationv2.csv'
test_file ='bikeTestv2.csv'

In [None]:
columns = '' # setup columns variable as empty string
with open(column_list_file,'r') as f:
    columns = f.read().split(',') # columns read from text file containing CSV

In [None]:
columns # data check

### Specify Column Names as the File Does Not Have Header

In [None]:
df_train = pd.read_csv(train_file, names=columns)
df_validation = pd.read_csv(validation_file,names=columns)

In [None]:
df_train.head()

In [None]:
df_validation.head() # data check

### Separating Features and Targets for Training and Validation
This is in preparation for use in XGBoost's regressor
*Note: Remember that Python indices start at 0*

In [None]:
x_train = df_train.iloc[:,1:] # Features: Seconds [1] Column to the end
y_train = df_train.iloc[:,0].ravel() # Target is the first column [0]th

x_validation = df_validation.iloc[:,1:] # Features: Seconds [1] Column to the end
y_validation = df_validation.iloc[:,0].ravel() # Target is the first column [0]th

## Set Up XGBoost Regressor

Below cells will set up the training instance, set the hyperparameters, and then fit the model to the training data.

Find Distributed (Deep) Machine Learning Community's XGBoost Training Parameter Reference [here](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst)

In this project, I updated the following parameters:
```reg:squarederror``` as ```reg:linear``` is deprecated.

Additionally, I am to add my tuning to the _XGBoost_ Regressor.

### Create Regressor

In [None]:
# Create regressor
# XGBoost Training Parameters Reference:
# https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst
# Limited the depth to 5 vice 6.
# n_estimators helps tune the number of Decision Trees in XGBoost

regressor = xgb.XGBRegressor(max_depth=5, n_estimators=200, objective='reg:squarederror')
# n_estimators at 150, 200 over to 250 results in little additional log loss reduction

In [None]:
regressor # display hyperparameters. This is my habit to ensure my settings are correct \\
# before I run a regressor. Note: This will display at the end of the training model process, as well.

In [None]:
regressor.fit(x_train, y_train, eval_set = [(x_train, y_train), (x_validation, y_validation)])

In [None]:
df_train['count'].describe()

In [None]:
eval_result = regressor.evals_result()

In [None]:
training_rounds = range(len(eval_result['validation_0']['rmse']))

In [None]:
print(training_rounds) # check

## Plot the Training vs. Validation Errors

In [None]:
plt.scatter(x=training_rounds, y=eval_result['validation_0']['rmse'], label='Training Error')
plt.scatter(x=training_rounds, y=eval_result['validation_1']['rmse'], label='Validation Error')
plt.grid(True)
plt.xlabel('Rounds')
plt.ylabel('RMSE')
plt.title('Training Vs. Validation Error')
plt.legend()
plt.show()

### XGBoost Feature Importance
By default, the graph displays as a horiztonal bar with counters.

In [None]:
xgb.plot_importance(regressor)
plt.show()

### Verify Quality using Validation Dataset

In [None]:
df = pd.read_csv(validation_file,names=columns)
# compare actual vs. predicted performance with dataset not seen by the model

In [None]:
df.head() # data check

In [None]:
df.shape # display a tuple that represents Dataframe's dimensionality (columns, rows, depth, etc.)

In [None]:
x_test = df.iloc[:,1:]
print(x_test[:5])

In [None]:
result = regressor.predict(x_test)

In [None]:
result[:5]

In [None]:
df['count_predicted'] = result

In [None]:
df.head() # new column at the end

### Negative Values can appear in predictions
Displayed through _pandas_ DataFrame Describe function

_Generate descriptive statistics_

### Finding All Negative Values for Zeroizing
Sometimes, regressors predict values that do not match the problem's context.

In [None]:
df['count_predicted'].hist()
plt.title('Predicted Count Histogram')
plt.show()
# There are values below 0

In [None]:
df[df['count_predicted'] < 0]

Note: There are no values below 0.

### Adjust the Count to Just Data Points

In [None]:
def adjust_count(x):
    if x < 0:
        return 0
    else:
        return x

In [None]:
df['count_predicted'] = df['count_predicted'].map(adjust_count)

In [None]:
df[df['count_predicted'] < 0] # double-check

In [None]:
df['count'] = df['count'].map(np.expm1)
df['count_predicted'] = df['count_predicted'].map(np.expm1)

## Plot Actual vs. Predicted

In [None]:
plt.plot(df['count'], label='Actual')
plt.plot(df['count_predicted'], label='Predicted')
plt.xlabel('Sample')
plt.ylabel('Rental Count')
plt.xlim([100,150])
plt.title('Validation Dataset: Predicted vs. Actual')
plt.legend()
plt.show()

In [None]:
# Over prediction and Under Prediction needs to be balanced
# Training Data Residuals
residuals = (df['count'] - df['count_predicted'])

plt.hist(residuals)
plt.grid(True)
plt.xlabel('Actual - Predicted')
plt.ylabel('Count')
plt.title('Residuals Distribution')
plt.axvline(color='r')
plt.show()

In [None]:
value_counts = (residuals > 0).value_counts(sort=False)
print(' Under Estimation: {0:0.2f}'.format(value_counts[True]/len(residuals)))
print(' Over  Estimation: {0:0.2f}'.format(value_counts[False]/len(residuals)))

### Print Metrics of the Model

In [None]:
# Current Model's RMSE
print("Model's RMSE: {0:0.2f}".format(mean_squared_error(df['count'],df['count_predicted'])**.5))

In [None]:
# RMSlE - Root Mean Squared Log Error
# RMSLE Metric is used by Kaggle

# RMSE Cost Function - Magnitude of difference matters

# RMSLE cost function - "Only Percentage difference matters"

# Reference:Katerina Malahova, Khor SoonHin 
# https://www.slideshare.net/KhorSoonHin/rmsle-cost-function
def compute_rmsle(y_true, y_pred):
    if type(y_true) != np.ndarray:
        y_true = np.array(y_true)
        
    if type(y_pred) != np.ndarray:
        y_pred = np.array(y_pred)
     
    return(np.average((np.log1p(y_pred) - np.log1p(y_true))**2)**.5)

In [None]:
print('RMSE: {0:.2f}'.format(compute_rmsle(df['count'], df['count_predicted'])))

## Prepare Data for Kaggle Submission

In [None]:
df_test = pd.read_csv(test_file,parse_dates=['datetime'])

In [None]:
df_test.head()

In [None]:
x_test =  df_test.iloc[:,1:] # Exclude datetime for prediction

In [None]:
x_test.head()

In [None]:
result = regressor.predict(x_test)

In [None]:
result[:5]

In [None]:
np.expm1(result)

In [None]:
df_test['count'] = np.expm1(result)

In [None]:
df_test.head()

In [None]:
df_test[df_test['count']<0]

In [None]:
df_test[['datetime','count']].to_csv('predicted_countv2.csv',index=False)