# PIPELINES
### Following this there is also the boiler-plate code of XGBooost and its application
A critical skill for deploying (and even testing) complex model with preprocessing.
Pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Benefits of using pipelines:
* Cleaner code
* Fewer bugs
* Easier to deploy
* More options for model validation

Steps to construct a pipeline:
1. Define preprocessing steps
2. define the model
3. Create and evalutate the pipeline

#### THE CODE WRITTEN DOWN HERE IS JUST THE BOILERPLATE CODE

### 1. Define preprocessing steps
* impute missing numerical values
* impute missing values and apply OHE to categorical data

We construct the full pipeline in three steps.

Step 1: Define Preprocessing Steps
Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps. The code below:
imputes missing values in numerical data, and
imputes missing values and applies a one-hot encoding to categorical data.


In [55]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # strategy can be 'constant'
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

### 2. Define the model
For example only we will create a random forest model

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators = 100 , random_state = 0)

### 3. Create and evaluate the pipeline
Use the Pipeline class to bundle the preprocessing and model.

In [None]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

### CONCLUSION
Pipelines are valuable for cleaning up ML code and avoid errors especially for large projects.

# APPLICATION (Example)

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split

#Save path
train_path = r"home-data-for-ml-course\train.csv"
test_path = r"home-data-for-ml-course\test.csv"

#Read the data
train_full = pd.read_csv(train_path , index_col = "Id")
test_full  = pd.read_csv(test_path  , index_col="Id")

### AXIS = 0 to drop rows and AXIS=1 to drop columns 

# Remove rows with missing target(y value or salesPrice)
train_full.dropna(axis=0, subset=["SalePrice"], inplace =True)  ## Passing inplace = True changes the value on site 
                                                                ## and doesnt retrun anything. On using False nothing 
                                                                ## is changed in the location and a changed copy is retruned

# Also separate the target from the predictors
y=train_full.SalePrice
train_full.drop(['SalePrice'], axis=1,inplace=True)

# check the train_full

#### Test_train_split

In [24]:
# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(train_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)


# "Cardinality" means the number of unique values in a column
# Select categorical(object) columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
# Copy rest all of the column headers to the columns
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = test_full[my_cols].copy()

In [25]:
X_test

Unnamed: 0_level_0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Condition1,Condition2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,RH,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Feedr,Norm,...,730.0,140,0,0,0,120,0,0,6,2010
1462,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Norm,Norm,...,312.0,393,36,0,0,0,0,12500,6,2010
1463,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Norm,Norm,...,482.0,212,34,0,0,0,0,0,3,2010
1464,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Norm,Norm,...,470.0,360,36,0,0,0,0,0,6,2010
1465,RL,Pave,,IR1,HLS,AllPub,Inside,Gtl,Norm,Norm,...,506.0,0,82,0,0,144,0,0,1,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2915,RM,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,...,0.0,0,0,0,0,0,0,0,6,2006
2916,RM,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,...,286.0,0,24,0,0,0,0,0,4,2006
2917,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,...,576.0,474,0,0,0,0,0,0,9,2006
2918,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,...,0.0,80,32,0,0,0,0,700,7,2006


### Preprocess the data and train the model

In [26]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

# # Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# # Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

print('MAE:', mean_absolute_error(y_valid, preds))


MAE: 17861.780102739725


### Results on the testing data using the trained model
The trained model is completely unaware of the testing data.

In [27]:
# Generating prediction on the testing data
full_prediction = my_pipeline.predict(test_full)

Now we will save the data to a CSV file if asked so that we can successfully submit it to competitons

In [28]:
output = pd.DataFrame({ 'Id':test_full.index,
                       'SalePrice' : full_prediction
                      })
output.to_csv('submission.csv', index = False)

# CROSS VALIDATION
Using cross validation for better model performance

For example, we could begin by dividing the data into 5 pieces, each 20% of the full dataset. In this case, we say that we have broken the data into 5 "folds".

<img src="https://i.imgur.com/9k60cVA.png"></img>
<br>
In one case we use first 20% as validation and the rest as trainning and in the next case we will use the second 20% as validation and the rest of the data as training. We will follow the same trend for subsequent steps

<b>When to use cross validation</b>
* Use when the dataset is small.
* For larger datasets on using cross validation the time taken will be too much. FOr large dataset cross validation is not usually required.
* Still then use cross validation to check if your model performs consistently over all data

In [29]:
from sklearn.model_selection import cross_val_score
X = train_full
# mutilply by -1 to get +ve values since sklearn predicts -ve values
# cv denotes the no of sections.
cvScores = -1 * cross_val_score(my_pipeline , X, y, cv = 5, scoring = 'neg_mean_absolute_error' )
print("Mae scores = :\n", cvScores)


Mae scores = :
 [17739.03188356 17360.37171233 17864.42116438 16309.21712329
 19153.59958904]


In [30]:
print("Best score = ", cvScores.min())
print("Worst score = ", cvScores.max())
print("Average score = ",cvScores.mean())

Best score =  16309.217123287672
Worst score =  19153.599589041096
Average score =  17685.32829452055


#### Now this is a function that uses cross validation to select the parameters

In [31]:
def get_score(n_estimators):
    my_pipeline = Pipeline(steps=[
        ('preprocessor', SimpleImputer()),
        ('model', RandomForestRegressor(n_estimators, random_state=0))
    ])
    scores = -1 * cross_val_score(my_pipeline, X, y,
                                  cv=3,
                                  scoring='neg_mean_absolute_error')
    return scores.mean()

In [32]:
# results = {}
# for i in range(1,9):
#     results[50*i] = get_score(50*i)
    
# print(results)

### Find the best parameter value

In [35]:
# import matplotlib.pyplot as plt
# %matplotlib inline

# plt.plot(list(results.keys()), list(results.values()))
# plt.show()

# XGBoost / Extreme Gradient Boosting
Gradient boosting is a method that goes through cycles to iteratively add models into an ensemble.

It begins by initializing the ensemble with a single model, whose predictions can be pretty naive. (Even if its predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors.)

Then, we start the cycle:

* First, we use the current ensemble to generate predictions for each observation in the dataset. To make a prediction, we add the predictions from all models in the ensemble.
* These predictions are used to calculate a loss function (like mean squared error, for instance).
* Then, we use the loss function to fit a new model that will be added to the ensemble. Specifically, we determine model parameters so that adding this new model to the ensemble will reduce the loss. (Side note: The "gradient" in "gradient boosting" refers to the fact that we'll use gradient descent on the loss function to determine the parameters in this new model.)
* Finally, we add the new model to ensemble, and ...
* ... repeat!

<img src="https://i.imgur.com/MvCGENh.png"></img>

We will import the scikit-learn API for XGBoost ```xgboost.XGBRegressor```. This allows us to build and fit a model just as we would in scikit-learn. As you'll see in the output, the ```XGBRegressor``` class has many tunable parameters -- you'll learn about those soon!

In [50]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error

xgbModel1 = XGBRegressor()

xgb_pipeline1 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', xgbModel1)
                     ])


xgb_pipeline1.fit(X_train,y_train)


predictions1 = xgb_pipeline.predict(X_valid)
mae1 = mean_absolute_error(predictions1,y_valid)
print("MAE on XGB model 1= ",mae1)

MAE on XGB model 1=  17709.19373394692


## Parameter tuning for xgb
```n_estimators``` : n_estimators specifies how may times to go through the modeling cycle.

* Too low a value causes underfitting, which leads to inaccurate predictions on both training data and test data.
* Too high a value causes overfitting, which causes accurate predictions on training data, but inaccurate predictions on test data (which is what we care about).
Typical values range from 100-1000, though this depends a lot on the ```learning_rate``` parameter discussed below.

In [51]:
xgbModel2 = XGBRegressor(n_estimators = 500)

xgb_pipeline2 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', xgbModel2)
                     ])

xgb_pipeline2.fit(X_train,y_train)


predictions2 = xgb_pipeline2.predict(X_valid)
mae2 = mean_absolute_error(predictions2,y_valid)
print("MAE on XGB model 2= ",mae2)

MAE on XGB model 2=  17640.496816138697


<b>early_stopping_rounds</b>
```early_stopping_rounds``` offers a way to automatically find the ideal value for n_estimators. Early stopping causes the model to stop iterating when the validation score stops improving, even if we aren't at the hard stop for ```n_estimators```. It's smart to set a high value for ```n_estimators``` and then use ```early_stopping_rounds``` to find the optimal time to stop iterating.

Since random chance sometimes causes a single round where validation scores don't improve, you need to specify a number for how many rounds of straight deterioration to allow before stopping. Setting ```early_stopping_rounds=5``` is a reasonable choice. In this case, we stop after 5 straight rounds of deteriorating validation scores.

When using ```early_stopping_rounds```, you also need to set aside some data for calculating the validation scores - this is done by setting the ```eval_set``` parameter.

In [52]:
# Using the another pipeline to involve early stopping rounds

xgbModel3 = XGBRegressor(n_estimators = 500, early_stopping_rounds = 5,
                  eval_set=[(X_valid, y_valid)], verbose = False)


xgb_pipeline3 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', xgbModel3)
                     ])

xgb_pipeline3.fit(X_train,y_train)


predictions3 = xgb_pipeline3.predict(X_valid)
mae3 = mean_absolute_error(predictions3, y_valid)
print("MAE on XGB model 3= ",mae3)

Parameters: { early_stopping_rounds, eval_set, verbose } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


MAE on XGB model 3=  17640.496816138697


#### learning_rate
Instead of getting predictions by simply adding up the predictions from each component model, we can multiply the predictions from each model by a small number (known as the <b>learning rate</b>) before adding them in.

This means each tree we add to the ensemble helps us less. So, we can set a higher value for ```n_estimators``` without overfitting. If we use early stopping, the appropriate number of trees will be determined automatically.

In general, <b>a small learning rate and large number of estimators </b> will yield more accurate XGBoost models, though it will also take the model longer to train since it does more iterations through the cycle. As default, XGBoost sets learning_rate=0.1.

In [54]:
xgbModel4 = XGBRegressor(n_estimators = 1000, learning_rate = 0.1,early_stopping_rounds = 5,
                  eval_set=[(X_valid, y_valid)], verbose = False)


xgb_pipeline4 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', xgbModel4)
                     ])

xgb_pipeline4.fit(X_train,y_train)


predictions4 = xgb_pipeline4.predict(X_valid)
mae4 = mean_absolute_error(predictions4, y_valid)
print("MAE on XGB model 4= ",mae4)

Parameters: { early_stopping_rounds, eval_set, verbose } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


MAE on XGB model 4=  16575.19825556507


### n_job

On larger datasets where runtime is a consideration, you can use parallelism to build your models faster. It's common to set the parameter ```n_jobs``` equal to the number of cores on your machine. On smaller datasets, this won't help.

The resulting model won't be any better, so micro-optimizing for fitting time is typically nothing but a distraction. But, it's useful in large datasets where you would otherwise spend a long time waiting during the fit command.

Here's the modified example:
```python
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)
             
```
<b>This will use 4 cores on the machine</b>

### XGBoost is a the leading software library for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos). With careful parameter tuning, you can train highly accurate models.

#### Some other alternative data preprocessing pipelines:
```python
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice              
X.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

# One-hot encode the data (to shorten the code, we use pandas)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)
```