# Simple Regression Dataset - Linear Regression vs. XGBoost (Local Mode)

Model with XGBoost training locally to notebook
XGBoost, or [Extreme Gradient Boosting](https://www.datacamp.com/community/tutorials/xgboost-in-python), is a family of boosting algorithms that uses gradient boosting framework at its core.

This section will work with XGBoost as a local installation to the instance.

* This will take several minutes to train (even with a small amount of data)
* When algorithm supported by *Python*, the data can be locally to the instance
* In this section: Compare XBGoost to Linear Regression against dataset

**Kernel used:** Conda with TensorFlow Python 3.6.5 for Amazon Elastic Instance *(conda_amazonei_tensorflow_p36)*

### Major Library Versions Used

| Library | Version |
|---------|:--------|
| conda | 4.8.2 |
| matplotlib | 3.0.3 |
| numpy | 1.17.4 |
| pandas | 0.24.2 |
| pip | 20.2 |
| python | 3.6.5 |
| xgboost | 0.90 |

### First update *conda* and *pip* to latest version

In [None]:
!conda install conda -y
!pip install --upgrade pip

### Ensure required packages are installed

In [None]:
!conda list conda
!conda list numpy
!conda list pandas
!conda list pip
!conda list python
!conda list matplotlib

### Missing Required Libraries

In [None]:
# If missing required libraries: uncomment
# ! conda install <package from previous step>

## Install XGboost into the Notebook

Here: I am using *pip* to install.

Ensure that a kernel is running before installing.
***Note:*** *This may take several minutes for the initial installation.*

In [None]:
!pip install xgboost==0.90

## Libraries

In [None]:
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error # to calculate the regression loss
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.linear_model import LinearRegression

In [None]:
# XGBoost

import xgboost as xgb

### Data Read

In [None]:
df = pd.read_csv('linear_data.csv')

In [None]:
df.head()

### Plot the dataset

In [None]:
plt.plot(df.x, df.y, label='Target')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('Target')
plt.legend()
plt.title('Simmple Regression Dataset')
plt.show()

### Load Training and Validation Datasets

In [None]:
train_file = 'linearTrain.csv'
validation_file = 'linearValidation.csv'

# Specify the column names, since the files do not have headers
df_train = pd.read_csv(train_file, names=['y','x'])
df_validation = pd.read_csv(validation_file, names=['y','x'])

In [None]:
df_train.head() # data check

In [None]:
df_validation.head()

### Plot the datasets

In [None]:
plt.scatter(df_train.x, df_train.y, label='Training', marker='.')
plt.scatter(df_validation.x, df_validation.y, label='Validation', marker='.')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('Target')
plt.title('Simple Regression Dataset')
plt.legend()
plt.show()

### Separating Features and Targets for Training and Validation
This is in preparation for use in XGBoost's regressor
*Note: Remember that Python indices start at 0*

In [None]:
X_train = df_train.iloc[:,1:] # Features pull from 2nd column to the end
y_train = df_train.iloc[:,0].ravel() # Target: 1st Column (0th) Recall: ravel to flatten array

X_validation = df_validation.iloc[:,1:]
y_validation = df_validation.iloc[:,0].ravel()

## Build the XGBoosst Regressor model

Below cells will set up the training instance, set the hyperparameters, and then fit the model to the training data.

Find Distributed (Deep) Machine Learning Community's XGBoost Training Parameter Reference [here](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst)

In this project, I updated the following parameters:
```reg:squarederror``` as ```reg:linear``` is deprecated.


In [None]:
# Create regressor
regressor = xgb.XGBRegressor(objective='reg:squarederror')

In [None]:
regressor # display hyperparameters. Note: This will display at the end of the training model process, as well.

### Training the model

In [None]:
# Provide the Training and Validations Datasets
# XGBoost will report the training and validation errors
# While training, the errors should trend downwards
regressor.fit(X_train, y_train, eval_set = [(X_train, y_train), (X_validation, y_validation)])

From the model's result, one can see that the training *rmse* trends downward as the model improves.

### Plotting the Errors
First, pull the Training RMSE and Evaluation RMSE values

In [None]:
eval_result = regressor.evals_result()

In [None]:
eval_result # display results in a new format

In [None]:
training_rounds = range(len(eval_result['validation_0']['rmse'])) # x-axis data

In [None]:
### Graph
plt.scatter(x=training_rounds,y=eval_result['validation_0']['rmse'],label='Training Error')
plt.scatter(x=training_rounds,y=eval_result['validation_1']['rmse'],label='Validation Error')
plt.grid(True)
plt.xlabel('Interations')
plt.ylabel('RMSE')
plt.title('XGBoost Training vs. Validation Error')
plt.legend()
plt.show()

### XGBoost Feature Importance
*plot_importance* function shows which features were usefule in the model's operation.

In [None]:
xgb.plot_importance(regressor) # To find which features were useful in the model
plt.show()

In this case, *x* was the only feature. This become more interesting with more complex models.

## Validation Dataset: Compare Actual and Predicted
This section focused on evaluating the performance of the model.

In [None]:
result=regressor.predict(X_validation) # predicted results for plotting

In [None]:
plt.scatter(df_validation.x, df_validation.y, label='actual', marker='.')
plt.scatter(df_validation.x, result, label='predicted', marker='.')
plt.grid(True)
plt.legend()
plt.title('XGBoost: Validation Dataset')
plt.show()

### XGBoost Metrics
Calculate the *mean squared error* and *root mean squared error*.
Reminder: RMSE is the standard deviation of the residuals (prediction errors), or how well that predicted data concentrated around the line of best fit.

In [None]:
# Display the Root Mean Square Error (RMSE) Metrics
print('XGBoost Algorithm Metrics')
mse = mean_squared_error(df_validation.y,result)
print(' Mean Squared Error: {0: .2f}'.format(mse))
print(' Root Mean Squared Error: {0: .2f}'.format(mse**.5))

### XGboost Residual Histogram

In [None]:
# Training Data Residuals
residuals = df_validation.y - result
plt.hist(residuals)
plt.grid(True)
plt.xlabel('Actual - Predicted')
plt.ylabel('Count')
plt.title('XGBoost Residual')
plt.axvline(color='r') # overall center of data deviations
plt.show()

### XGBoost Plot Predicted vs. Actual Targets

In [None]:
# Plot the dataset
plt.plot(df.x, df.y, label='Target')
plt.plot(df.x, regressor.predict(df[['x']]), label='Predicted')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('Target')
plt.legend()
plt.title('XGBoost')
plt.show()

The plots are nearly identical. This is a great performance.

## Linear Regression Algorithm
This section will set up *sklearn's* linear regression for comparison to XGBoost.

In [None]:
linear_regressor = LinearRegression() # from sklearn

In [None]:
linear_regressor.fit(X_train,y_train)

### Compare Weights Assigned by Linear Regression
Using the original function: _5*x + 8 + noise_
Below will show the following weights:
* coeffecient(s) -> coef_
* intercept -> intercept_

In [None]:
linear_regressor.coef_ # Do not forget underscore '_' at the end. 
# This will estimate the coefficient for the linear regression plot

Notice that the array value is very close to the actual coefficient *5*.

In [None]:
linear_regressor.intercept_

Notice that the array value is very close to the actual coefficient *8*.

In [None]:
linear_result = linear_regressor.predict(df_validation[['x']])

In [None]:
plt.scatter(df_validation.x,df_validation.y,label='actual',marker='.')
plt.scatter(df_validation.x,linear_result,label='predicted',marker='.')
plt.grid(True)
plt.title('LinearRegression - Validation Dataset')
plt.legend()
plt.show()

The plot shows where the line expected to be drawn.

### Linear Regression Metrics
Calculate the *mean squared error* and *root mean squared error*.

In [None]:
# Display the Root Mean Square Error (RMSE) Metrics
print('Linear Regression Algorithm Metrics')
mse = mean_squared_error(df_validation.y,linear_result)
print(' Mean Squared Error: {0: .2f}'.format(mse))
print(' Root Mean Squared Error: {0: .2f}'.format(mse**.5))

### Linear Regression Residual Histogram

In [None]:
# Training Data Residuals
residuals = df_validation.y - linear_result
plt.hist(residuals)
plt.grid(True)
plt.xlabel('Actual - Predicted')
plt.ylabel('Count')
plt.title('XGBoost Residual')
plt.axvline(color='r') # overall center of data deviations
plt.show()

Linear Regression in this case performed better than XGBoost.

### Linear Plot Predicted vs. Actual Targets

In [None]:
# Plot the dataset
plt.plot(df.x, df.y, label='Target')
plt.plot(df.x, linear_regressor.predict(df[['x']]), label='Predicted')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('Target')
plt.legend()
plt.title('XGBoost')
plt.show()

## Input Features - Outside of Range Used for Training
* XGBoost Prediction has an upper and lower bound (directly applies to tree-based algorithms)
* Linear Regression extrapolates

In [None]:
# Revist the function
def straight_line(x):
    return 5*x+8

### X is outside the training samples' range

In [None]:
X = np.array([-100,-5,0.5,1,1.9,5,29,49,160,1000,5000])
y = straight_line(X)

df_IF = pd.DataFrame({'x':X,'y':y})
df_IF['xgboost']=regressor.predict(df_IF[['x']])
df_IF['linear']=linear_regressor.predict(df_IF[['x']])

In [None]:
df_IF # display values

* XGBoost have caps for upper and lower bounds, not reach the extent of *y*, due to being designed to be memory efficient. The *upper* bound is set to *X=149* and *lower* bound to *X=1*, as default, due to regressor being configured to ```reg:squarederror```.
* Linear followed the *y* values nearly identically

### Visualize the Outside of Range

In [None]:
# XGBoost Predictions: upper and lower bounds
# Linear Regression: extrapolation
plt.scatter(df_IF.x, df_IF.y, label='Actual',color='red')
plt.plot(df_IF.x,df_IF.linear,label='LinearRegression')
plt.plot(df_IF.x,df_IF.xgboost,label='XGBoost')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.title('Input Outside Range')
plt.show()

Here, the straight horizontal shows where XGBoost bounded.

Red dots are the 'ground truth.'


### X is Inside the training samples' range

In [None]:
X = np.array([1,3,5,7,89,110,125,149]) # Values changed
y = straight_line(X)

df_IF = pd.DataFrame({'x':X,'y':y})
df_IF['xgboost']=regressor.predict(df_IF[['x']])
df_IF['linear']=linear_regressor.predict(df_IF[['x']])

In [None]:
df_IF # display values

* XGBoost and Linear followed the *y* values closer.

### Visualize the Inside of Range

In [None]:
# XGBoost Predictions: upper and lower bounds
# Linear Regression: extrapolation
plt.scatter(df_IF.x, df_IF.y, label='Actual',color='red')
plt.plot(df_IF.x,df_IF.linear,label='LinearRegression')
plt.plot(df_IF.x,df_IF.xgboost,label='XGBoost')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.title('Input Inside Range')
plt.show()

Here, there is no straight horizontal for XGBoost bound, due to inputs inside being in its range.

Red dots are the 'ground truth.'

The reason for XGBoost's bounding is to concentrate on decision tree-like operations. Many times, the branches do not need wide array of values to make the decision.

## Summary
1. Updated core installation tools
1. Checked for required libraries
1. Installed `xgboost` for local mode
1. Built training and validation datasets
1. Built `xgboost` Regressor
1. Built *Linear Regression* care of `sklearn`
1. Explored performance 'Out of Range' and 'In Range' Inputs with *XGBoost* and *Linear Regression*