# Simple Regression Dataset - Linear Regression vs. XGBoost

Model with XGBoost training locally to notebook
XGBoost, or [Extreme Gradient Boosting](https://www.datacamp.com/community/tutorials/xgboost-in-python), is a family of boosting algorithms that uses gradient boosting framework at its core.

* Later in the project's progression, *SageMaker's* XGBoost algorithm
* This will take several minutes to train (even with a small amount of data)
* When algorithm supported by *Python*, the data can be locally to the instance
* In this section: Compare XBGoost to Linear Regression against dataset

**Kernel used:** Conda with TensorFlow Python 3.6.7

## Install XGboost into the Notebook

Here: I am using *conda* to install. For those familiar, this is the same installer found in *Anaconda Navigator*

Ensure that a kernel is running before installing.
***Note:*** *This may take several minutes for the initial installation.*

### First update *conda* to latest version

In [None]:
!conda install conda -y

In [None]:
!pip install --upgrade pip

### Ensure required packages are installed

In [None]:
!conda list numpy
!conda list pandas
!conda list python
!conda list xgboost
!conda list matplotlib
!conda list sagemaker

In [None]:
!conda install matplotlib pandas -c conda-forge -y

### Update SageMaker
SageMaker 1.50.9.post0 as of 2/6/2020

In [None]:
!pip install sagemaker==1.50.9.post0

### Next: Ensure XGBoost installed
**Caution**: **AWS Sagemaker** uses its own called *sagemaker.xgboost* (requiring no additional installation outside of *sagemaker* package on AWS instance)

XGBoost will install its required *numpy* package (At this time, XGBoost (0.90) requires numpy (1.16.4) and will downgrade if needed.)

In [None]:
!pip3 install xgboost==0.90

## Libraries

In [None]:
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error # to calculate the regression loss

In [None]:
# XGBoost
import sagemaker.xgboost as xgb # using industry standard xgb
from sklearn.linear_model import LinearRegression

### Data Read

In [None]:
df = pd.read_csv('linear_data.csv')

In [None]:
df.head()

### Plot the dataset

In [None]:
plt.plot(df.x, df.y, label='Target')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('Target')
plt.legend()
plt.title('Simmple Regression Dataset')
plt.show()

### Load Training and Validation Datasets

In [None]:
train_file = 'linearTrain.csv'
validation_file = 'linearValidation.csv'

# Specify the column names, since the files do not have headers
df_train = pd.read_csv(train_file, names=['y','x'])
df_validation = pd.read_csv(validation_file, names=['y','x'])

In [None]:
df_train.head() # data check

In [None]:
df_validation.head()

### Plot the datasets

In [None]:
plt.scatter(df_train.x, df_train.y, label='Training', marker='.')
plt.scatter(df_validation.x, df_validation.y, label='Validation', marker='.')
plt.grid(True)
plt.xlabel('Input Feature')
plt.ylabel('Target')
plt.title('Simple Regression Dataset')
plt.legend()
plt.show()

### Separating Features and Targets for Training and Validation
This is in preparation for use in XGBoost's regressor
*Note: Remember that Python indices start at 0*

In [None]:
x_train = df_train.iloc[:,1] # Features pull from 2nd column to the end
y_train = df_train.iloc[:,0].ravel() # Target: 1st Column (0th) Recall: ravel to flatten array

x_validation = df_validation.iloc[:,1]
y_validation = df_validation.iloc[:,0].ravel()

## Create an XGBoosst Regressor for this instance

Find Distributed (Deep) Machine Learning Community's XGBoost Training Parameter Reference [here](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst)

Find Amazon's SageMaker XGBoost documentation [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html)

In this project, I updated the following parameters (also, noted later)
* objective='reg:squarederror'


In [None]:
regressor = 
# Going with defaults, except for objective. reg:linear is deprecated.

In [None]:
# Default Options display
regressor

### Training the model

In [None]:
# Provide the Training and Validations Datasets
# XGBoost will report the training and validation errors
# While training, the errors should trend downwards
regressor.fit(x_train,y_train, eval_set = [(x_train, y_train), (x_validation, y_validation)])

In [None]:
!conda list numpy
!conda list pandas
!conda list python
!conda list xgboost
!conda list matplotlib

### Major Library Versions

| Library | Version |
|---------|:--------|
| matplotlib | 3.1.3 |
| numpy | 1.17.5 |
| pandas | 0.22.0 |
| python | 3.6.7 |
| sagemaker | 1.50.9.post0 |
| xgboost | 0.90 |

In [None]:
!conda list x*