# Extreme Gradient Boosting with XGBoost DataCamp
## Supervised Learning
- Classification
    - Outcome can be binary or multi-class
    - Area under the ROC Curve (AUC) most common metric
    - AUC: Probability randomly chosen positive datapoint will have a higher rank than a randomly chosen negative datapoint
    - Higher AUC = more sensitive, better performing model
    - With multi-class problems, look at the accuracy score and use confusion matrices
- Regression
    - Continuous value prediction
    - Metrics: Root Mean Squared Error (RMSE) or or mean absolute error MAE
    - RMSE: Difference between predicted and actual, squaring differences, computing mean, then taking square root. Enables you to treat negative and positive differences equally, but punishes larger differences more than smaller ones
    MAE: Sums actual differences between all samples. Isn't as affected by big differences, but not as popular as RMSE
- Data
    - Features numeric or categorical
    - Features scaled (e.g. Z scores): essential for algorithms like SVM
    - Categorical features should be encoded (one-hot)
- Problems other than classification and regression
    - Ranking
    - Recommendation

## XGBoost

- Speed and performance are the reasons it has become popular
- Outperforms other single-algorithm methods: state-of-the-art results
- Has cross-validation baked in (don't have to use scikit-learn)
- Why use XGBoost
    - Large number of trainning samples e.g. > 1000 with < 100 features
    - Mixture of categorical and numeric, or just numeric features
- Why not use XGBoost
    - Images, computer vision, NLP, etc do better with neural networks
    - Very few training samples e.g. < 100 samples or samples are significantly fewer than number of features

### Example: ride-sharing data
Data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. It has been pre-loaded for you into a DataFrame called churn_data.

Your goal is to use the first month's worth of data to predict whether the app's users will remain users of the service at the 5 month mark. This is a typical setup for a churn prediction problem. To do this, you'll split the data into training and test sets, fit a small xgboost model on the training set, and evaluate its performance on the test set by computing its accuracy.

```
avg_dist                       float64
avg_rating_by_driver           float64
avg_rating_of_driver           float64
avg_inc_price                  float64
inc_pct                        float64
weekday_pct                    float64
fancy_car_user                    bool
city_Carthag                     int64
city_Harko                       int64
phone_iPhone                     int64
first_month_cat_more_1_trip      int64
first_month_cat_no_trips         int64
month_5_still_here               int64
```

In [None]:
import xgboost as xgb

# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

### Decision trees
- Decision trees are the base learners of XGBoost
- Base learner: individual learning algorithm in an ensemble algorithm
- XGBoot is an ensemble method that uses the outputs of many simple models to make its predictions
- Decision trees are composed of a series of binary decisions that split the data and yield a prediction
- Predictions happen at the 'leaves' of the tree
- Decision trees are constructed iteratively (one binary decision at a time) until a stopping criterion is met (e.g. the depth of the tree reaches a particular value
- Low bias: Good at learning relationships
- High variance: Tend to overfit data you train them on and generalise poorly
- XGBoost uses a type of decision tree called CART (Classification and Regression Tree)
- CART: each leaf contains a real value school regardless of whether it is used for classification or regression
- Cart: Real value score can be converted to categories for classification problems if necessary

### Make a decision tree
Your task in this exercise is to make a simple decision tree using scikit-learn's DecisionTreeClassifier on the breast cancer dataset that comes pre-loaded with scikit-learn.

This dataset contains numeric measurements of various dimensions of individual tumors (such as perimeter and texture) from breast biopsies and a single outcome value (the tumor is either malignant, or benign).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the classifier: dt_clf_4
# specify max num successive split points before reaching a leaf node
dt_clf_4 = DecisionTreeClassifier(max_depth=4)

# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)

### Boosting
- Ensemble meta-algorithm applied to convert many weak-learners (model with performance > 50%) into a strong learner
- Iteratively learning a set of weak models on subsets of the data
- Weighing each weak prediction according to each weak learner's performance
- Combine all weak learners predictions multiplied by their weights to get a single weighted prediction
- Combined base learners create a final predictor that is non-linear
- Want base learners than are slightly better than guessing on certain subsets of training examples and uniformly bad at the remainder so that when they're combined the uniformly bad predictions cancel out and those that do slightly better than chance combine into a good prediction

You'll now practice using XGBoost's learning API through its baked in cross-validation capabilities. As Sergey discussed in the previous video, XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.

In the previous exercise, the input datasets were converted into DMatrix data on the fly, but when you use the xgboost cv object, you have to first explicitly convert your data into a DMatrix. So, that's what you will do here before running cross-validation on churn_data

In [None]:
# Create the DMatrix: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
# num_boosting_rounds is the number of trees we want to build
# folds is the number of cross-validation folds
# as_pandas return as pandas df
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="error", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(((1-cv_results["test-error-mean"]).iloc[-1]))

# measuring AUC
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="auc", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])

### Objective (loss) functions
- Quantifies how far prediction is from actual result
- Measures distance between between estimates and true values for a collection of data
- Find the model that minimises the loss function
- Loss functions in XGBoost
    - `reg:linear` for regression problems
    - `reg:logistic` for binary classification problems when you just want the decision, not the probability
    - `binary:logistic` when you want the probability rather than the decision
    
## Regression with XGBoost

### Decision trees as base learners
It's now time to build an XGBoost model to predict house prices - not in Boston, Massachusetts, as you saw in the video, but in Ames, Iowa! This dataset of housing prices has been pre-loaded into a DataFrame called df. If you explore it in the Shell, you'll see that there are a variety of features about the house and its location in the city.

In this exercise, your goal is to use trees as base learners. By default, XGBoost uses trees as base learners, so you don't have to specify that you want to use trees here with booster="gbtree".

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBRegressor: xg_reg
xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed=123)

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_reg.predict(X_test)

# Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

### Linear base learners
Now that you've used trees as base models in XGBoost, let's use the other kind of base model that can be used with XGBoost - a linear learner. This model, although not as commonly used in XGBoost, allows you to create a regularized linear regression using XGBoost's powerful learning API. However, because it's uncommon, you have to use XGBoost's own non-scikit-learn compatible functions to build the model, such as xgb.train().

In order to do this you must create the parameter dictionary that describes the kind of booster you want to use (similarly to how you created the dictionary in Chapter 1 when you used xgb.cv()). The key-value pair that defines the booster type (base model) you need is "booster":"gblinear".

Once you've created the model, you can use the .fit() and .predict() methods of the model just like you've done in the past.