# XGBoost

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.
It is an algorithm that has recently been dominating applied machine learning and [Kaggle](https://www.kaggle.com/) competitions for structured or tabular data.

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

### What is Boosting?
* Not a specific machine learning algorithm
* Concept that can be applied to a set of machine learning models "Meta-algorithm"
* Ensemble meta-algorithm used to convert many weak learners into a strong learner

**weak Learner** - Learners which are slightly better than randomness, eg: any ML algo with accuracy > 50%

### How boosting works?

* Iteratively learning a set of weak models on subsets of the data
* Weighing each weak prediction according to each weak learner's performance
* Combine the weighted predictions to obtain a single weighted prediction

... that is much better than the individual predictions themselves!


**To install XGBoost**<br>
```console
foo@bar:~$ pip install xgboost
```

DATASETS USED CAN BE DOWNLOADED FROM :
- [iris flower dataset](https://github.com/Pratham1807/Machine-Learning/blob/master/datasets/iris.csv)
- [boston housing dataset](https://github.com/Pratham1807/Machine-Learning/blob/master/datasets/boston_housing.csv)
- [ames housing dataset(processed)](https://github.com/Pratham1807/Machine-Learning/blob/master/datasets/ames_housing_trimmed_processed.csv)

## Objective Functions and Base Learners

### Objective Functions

* Quantifies how far off a prediction is from the actual result
* Measures the difference between estimated and true values for some collection of data
* Goal: Find the model that yields the minimum value of the loss function

### Common Loss Functions and XGBoost

**Loss function names in xgboost:**

- reg:linear - use for regression problems
- reg:logistic - use when you want probability rather than just decision
- binary:logistic - use for classification problems when you want just decision, not probability

**Base Learners and Why We Need Them?**

* XGBoost involves creating a meta-model that is composed of many individual models that combine to give a final prediction
* Individual models = base learners
* Want base learners that when combined create final prediction that is non-linear
* Each base learner should be good at distinguishing or predicting different parts of the dataset

**Two kinds of base learners: tree and linear**

### XGBClassifier

In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np 
from sklearn import datasets
from sklearn.model_selection import train_test_split

In [8]:
# load the dataset 
data = pd.read_csv("iris.csv")
data.head(10)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
5,5.4,3.9,1.7,0.4,Setosa
6,4.6,3.4,1.4,0.3,Setosa
7,5.0,3.4,1.5,0.2,Setosa
8,4.4,2.9,1.4,0.2,Setosa
9,4.9,3.1,1.5,0.1,Setosa


In [20]:
data.tail()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


In [21]:
data.variety[data.variety == "Setosa"] = 0
data.variety[data.variety == "Versicolor"] = 1
data.variety[data.variety == "Virginica"] = 2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [22]:
data.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [23]:
data.tail()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2
149,5.9,3.0,5.1,1.8,2


In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal.length    150 non-null float64
sepal.width     150 non-null float64
petal.length    150 non-null float64
petal.width     150 non-null float64
variety         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


In [25]:
X = data.drop("variety", axis = 1)
y = data.variety

In [29]:
# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train,y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy:  ", (accuracy))

accuracy:   1.0


### XGBRegressor

**Trees as Base Learners in Scikit API**

In [30]:
boston_data = pd.read_csv("boston_housing.csv")

X, y = boston_data.iloc[:,:-1],boston_data.iloc[:,-1] 

X_train, X_test, y_train, y_test= train_test_split(X, y,test_size=0.2, random_state=123)

xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed=123)

xg_reg.fit(X_train,y_train)

preds = xg_reg.predict(X_test)

from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

RMSE: 173308.244360


  if getattr(data, 'base', None) is not None and \


**Linear Base Learner in XGBoost API**

In [31]:
boston_data = pd.read_csv("boston_housing.csv")
X, y = boston_data.iloc[:,:-1],boston_data.iloc[:,-1] 

X_train, X_test, y_train, y_test= train_test_split(X, y,test_size=0.2, random_state=123)

DM_train = xgb.DMatrix(data=X_train,label=y_train)
DM_test = xgb.DMatrix(data=X_test,label=y_test)

params = {"booster":"gblinear","objective":"reg:linear"}

xg_reg = xgb.train(params = params, dtrain=DM_train,num_boost_round=10)

preds = xg_reg.predict(DM_test)
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

  if getattr(data, 'base', None) is not None and \


RMSE: 98285.815883


**When to NOT use XGBoost?**

* Image recognition
* Computer vision
* Natural language processing and understanding problems
* When the number of training samples is significantly smaller than the number of features

### Tuning the Model

#### Why tune? 
- because it gives better accuracy

**UNTUNED MODEL**

In [32]:
housing_data = pd.read_csv("ames_housing_trimmed_processed.csv")

X,y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]

housing_dmatrix = xgb.DMatrix(data=X,label=y)

untuned_params = {"objective":"reg:linear"}

tuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=untuned_params, nfold=4, num_boost_round=200, metrics="rmse",as_pandas=True, seed=123)

print(type(tuned_cv_results_rmse))

print("Tuned rmse: %f" %((tuned_cv_results_rmse["test-rmse-mean"]).tail(1)))

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \


<class 'pandas.core.frame.DataFrame'>
Tuned rmse: 33288.914551


**TUNED MODEL**

In [33]:
housing_data = pd.read_csv("ames_housing_trimmed_processed.csv")

X,y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]

housing_dmatrix = xgb.DMatrix(data=X,label=y)

tuned_params = {"objective":"reg:linear",'colsample_bytree': 0.3, 'learning_rate': 0.1, 'max_depth': 5}

tuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=tuned_params, nfold=4, num_boost_round=200, metrics="rmse",as_pandas=True, seed=123)

print("Tuned rmse: %f" %((tuned_cv_results_rmse["test-rmse-mean"]).tail(1)))

Tuned rmse: 31111.040039


#### COMMON TUNEABLE PARAMETERS :-
- **learning rate:** learning rate/eta
- **gamma:** min loss reduction to create new tree split
- **lambda:** L2 reg on leaf weights
- **alpha:** L1 reg on leaf weights
- **max_depth:** max depth per tree
- **subsample:** % samples used per tree
- **colsample_bytree:** % features used per tree