## Classification tree

A classification tree makes a prediction on whether given conditions X, y will happen or not. It has a maximum depth to which it can be limited.

The class to use is DecisionTreeClassifier.

```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split dataset into 80% train, 20% test
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)

# Instantiate dt
dt = DecisionTreeClassifier(max_depth=2, random_state=1)

# Fit dt to the training set
dt.fit(X_train,y_train) 

# Predict test set labels
y_pred = dt.predict(X_test)

# Evaluate test-set accuracy
accuracy_score(y_test, y_pred)
```

To look up: criterion = gini, entropy, ... (gini is the default if nothing is specified)

## Regression tree in scikit learn

A regression tree predicts a valye y for given conditions X.

The class to use is DecisionTreeRegressor. We can specify a minimum size for the leaves by specifying the min_samples_leaf parameter, which is a value from 0 to 1, representing a percentage.

```python
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

# Split data into 80% train and 20% test
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=3)

# Instantiate a DecisionTreeRegressor 'dt'
dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.1, random_state=3)

# Fit 'dt' to the training-set
dt.fit(X_train, y_train)

# Predict test-set labels
y_pred = dt.predict(X_test)

# Compute test-set MSE to 
mse_dt =  MSE(y_test, y_pred)
# Compute test-set RMSE 
rmse_dt = np.sqrt(mse_dt)
```

## K-fold cross validation (10-fold)

K-fold cross validation splits the training dataset into 10 parts. Then it trains on 9 of the parts and uses this model on the 10th part to determine the error. The mean of these 10 is the CV error (CVE).

If the CVE is bigger than the error on the training set, then there is high variance and the model is overfit. It needs to be made less complex.

If the CVE is about the same as the error on the training set, and it is higher than the desired error, then the model has high bias and is underfit. It needs to be made more complex.

```python
# Evaluate the list of MSE ontained by 10-fold CV 
# Set n_jobs to -1 in order to exploit all CPU cores in computation
MSE_CV = - cross_val_score(dt, X_train, y_train, cv= 10, scoring='neg_mean_squared_error', n_jobs = -1)
```

## Ensemble learning

Ensemble learning uses multiple models like LogisticRegression, DecisionTreeClassifier, and KNeighborsClassifier and then applies a voting system to it, where the most votes win. **This only works for categorized datasets.**

```python
# Set seed for reproducibility
SEED=1
# Instantiate lr
lr = LogisticRegression(random_state=SEED)
# Instantiate knn
knn = KNeighborsClassifier(n_neighbors=27)
# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)
```

Place these in a list as tuples:
```python
# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]
```

If you want to check the values from these models you can use the following:
```python
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:     
    # Fit clf to the training set
    clf.fit(X_train, y_train)       
    # Predict y_pred
    y_pred = clf.predict(X_test)    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)    
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))
```

Finally we instantiate a votingclassifier that we pass the models we created earlier and perform a fit on it, after which we can use it to do predictions with a better result than those separately. 
```python
# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))
```

## Bagging or bootstrap aggregation

Bagging is an ensemble method involving training the same algorithm many times using different subsets sampled from the training data. Not all data will be used with this method.

The BaggingClassifier constructor takes a parameter n_estimators, which specifies the number of different subsets to sample.

```python
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, n_estimators=50, random_state=1)

# Fit bc to the training set
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate acc_test
acc_test = accuracy_score(y_test, y_pred)
print('Test set accuracy of bc: {:.2f}'.format(acc_test)) 
```

Bagging works for both classification and regression.

For classification it aggregates predictions by majority voting. The class is BaggingClassifier in scikit-learn.

For regression it aggregates predictions through averaging. The class is BaggingRegressor in scikit-learn.

## Out Of Bag (OOB)

Since bagging only uses about 63% of the training samples due to it's random nature, around 37% is left unused. These can be used for cross validation. Each sample's model is tested on the unused data for that sample which leads to a set of OOB scores. The mean of these is the final OOB score.

The difference is that when the BaggingClassifier is constructed, you need to set oob_score to true.
```python
# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, n_estimators=50, oob_score=true, random_state=1)
```
After having done the .fit and .predict, you can get the OOb score from the BaggingClassifierand compare with the accuracy_score of the testset:
```python
acc_oob = bc.oob_score_
```

## Random forests

Random forests work for both classification and regression.

For Classification it Aggregates predictions by majority voting. The class is RandomForestClassifier in scikit-learn.
For regression it Aggregates predictions through averaging. The class is RandomForestRegressor in scikit-learn.

n_estimators is the number of trees in the forest.

```python
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error as MSE

# Instantiate rf
rf = RandomForestRegressor(n_estimators=25, random_state=2)
            
# Fit rf to the training set    
rf.fit(X_train, y_train)

# Predict the test set labels
y_pred = rf.predict(X_test)

# Evaluate the test set RMSE
rmse_test = np.sqrt(MSE(y_test, y_pred))

# Print rmse_test
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))
```

To visualise the importance of the individual features you can use the parameter feature_importances_ of RandomForestRegressor.

```python
# Create a pd.Series of features importances
importances = pd.Series(data=rf.feature_importances_, index= X_train.columns)

# Sort importances
importances_sorted = importances.sort_values()

# Draw a horizontal barplot of importances_sorted
importances_sorted.plot(kind='bar', color='lightgreen')
plt.title('Features Importances')
plt.show()
```

## AdaBoost

Classification:

Weighted majority voting.
In sklearn: AdaBoostClassifier.

Regression:

Weighted average.
In sklearn: AdaBoostRegressor.

```python
# Instantiate a classification-tree 'dt'
dt = DecisionTreeClassifier(max_depth=1, random_state=SEED)

# Instantiate an AdaBoost classifier 'adab_clf'
adb_clf = AdaBoostClassifier(base_estimator=dt, n_estimators=100)
adb_clf.fit(X_train, y_train)

# Predict the test set probabilities of positive class
y_pred_proba = adb_clf.predict_proba(X_test)[:,1]
    
# Evaluate test-set roc_auc_score
adb_clf_roc_auc_score = roc_auc_score(y_test, y_pred_proba)
```

## Gradient boosting
Regression: In sklearn: GradientBoostingRegressor.
Classification: In sklearn: GradientBoostingClassifier.

```python
from sklearn.ensemble import GradientBoostingRegressor

gb = GradientBoostingRegressor(max_depth=4, n_estimators=200, random_state=2)
gb.fit(X_train, y_train)

# Predict test set labels
y_pred = gb.predict(X_test)

# Compute RMSE
rmse_test = np.sqrt(MSE(y_test, y_pred))

# Print RMSE
print('Test set RMSE of gb: {:.3f}'.format(rmse_test))
``` 

To do a stochastic gradient boost:
Differences: subsample=0.9, max_features=0.75
```pyton
# Instantiate sgbr
sgbr = GradientBoostingRegressor(max_depth=4, subsample=0.9, max_features=0.75, n_estimators=200, random_state=2)
```

## Hyperparameter tuning for a Tree

model.getparams() returns all the parameters.

```python
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define params_dt
params_dt = {"max_depth":[2,3,4], "min_samples_leaf":[0.12, 0.14, 0.16, 0.18]}

# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt, param_grid=params_dt, scoring=roc_auc, cv=5, n_jobs=-1)

grid_dt.fit(X_train, y_train)

# Import roc_auc_score from sklearn.metrics
from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

```


## Hyperparametertraining for a RandomForest
```python
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

params_rf = params_rf = { 'n_estimators': [100, 350, 500], 'min_samples_leaf': [2, 10, 30], 'max_features': ['log2', 'auto', 'sqrt']}

# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
                       param_grid=params_rf,
                       scoring='neg_mean_squared_error',
                       cv=3,
                       verbose=1,
                       n_jobs=-1)

grid_rf.fit(X_train, y_train)

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
rmse_test = np.sqrt(MSE(y_test, y_pred))

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test)) 
``` 