## Boosting

Methods:
* Sequential Fitting
* Aggregation

[Dataset](https://archive.ics.uci.edu/dataset/20/census+income)

## Bagging - Bootstrap AGGregatING

Weak learners as base models that are `complex` and tend to suffer from `high variance`.

Bootstrapping and aggregation. Bagging can be used for both `classification` and `regression` problems.

Bagging is a learning technique that is done in parallel. Each of the base models is trained independently of the others. Additionally, each base model is trained using only a `subset` of the original features. This allows them to be diverse from one another, often leading to a very strong ensemble model when aggregated.

Base models are `decision trees` that are relatively large and overfit to the bootstrapped subset of data provided to each of them.

Steps:
Once each of the base models is trained, the method for ensembling tends to be a simple aggregation technique over each of the models.
A majority vote for classification problems and averaging for regression problems.

A common implementation of a bagging algorithm that uses decision trees as their base model is the `Random Forest`.

### Import Libraries

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, RandomForestRegressor
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

#### Evaluating Data

The data sets contains 14 predictor variables. Here's a brief description of them.

* `age`	Integer
* `workclass`	Categorical     Missing values
    1. Private
    2. Self-emp-not-inc
    3. Self-emp-inc
    4. Federal-gov
    5. Local-gov
    6. State-gov
    7. Without-pay
    8. Never-worked
* `fnlwgt`	Integer
* `education`	Categorical
    1. Bachelors
    2. Some-college
    3. 11th
    4. HS-grad
    5. Prof-school
    6. Assoc-acdm
    7. Assoc-voc
    8. 9th
    9. 7th-8th
    10. 12th
    11. Masters
    12. 1st-4th
    13. 10th
    14. Doctorate
    15. 5th-6th
    16. Preschool
* `education-num`	Integer
* `marital-status`	Categorical
    1. Married-civ-spouse
    2. Divorced
    3. Never-married
    4. Separated
    5. Widowed
    7. Married-spouse-absent
    8. Married-AF-spouse
* `occupation`	Categorical     Missing values
    1. Tech-support
    2. Craft-repair
    3. Other-service
    4. Sales
    5. Exec-managerial
    6. Prof-specialty
    7. Handlers-cleaners
    8. Machine-op-inspct
    9. Adm-clerical
    10. Farming-fishing
    11. Transport-moving
    12. Priv-house-serv
    13. Protective-serv
    14. Armed-Forces.
* `relationship`	Categorical
    1. Wife
    2. Own-child
    3. Husband
    4. Not-in-family
    5. Other-relative
    6. Unmarried
* `race`	Categorical
    1. White
    2. Asian-Pac-Islander
    3. Amer-Indian-Eskimo
    4. Other
    5. Black
* `sex`	Binary
    1. Female
    2. Male
* `capital-gain`	Integer	
* `capital-loss`	Integer	
* `hours-per-week`	Integer	
* `native-country`	Categorical     Missing values
    1. United-States
    2. Cambodia
    3. England
    4. Puerto-Rico
    5. Canada
    6. Germany
    7. Outlying-US(Guam-USVI-etc)
    8. India
    9. Japan
    10. Greece
    11. South
    12. China
    13. Cuba
    14. Iran
    15. Honduras
    16. Philippines
    17. Italy
    18. Poland
    19. Jamaica
    20. Vietnam
    21. Mexico
    22. Portugal
    23. Ireland
    24. France
    25. Dominican-Republic
    26. Laos
    27. Ecuador
    28. Taiwan
    29. Haiti
    30. Columbia
    31. Hungary
    32. Guatemala
    33. Nicaragua
    34. Scotland
    35. Thailand
    36. Yugoslavia
    37. El-Salvador
    38. Trinadad&Tobago
    39. Peru
    40. Hong
    41. Holand-Netherlands


The outcome variable, `income` Boolean - >50K(True - 1) or <=50K(False - 0)

Missing values: Yes

Number of instances: 48842

### Read and Convert Datasets

In [4]:
# 1. Read the CSV files into DataFrames
col_names = ['age', 'workclass', 'fnlwgt','education', 'education-num', 
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain','capital-loss', 'hours-per-week','native-country', 'income']
df = pd.read_csv('adult.data', header=None, names = col_names)

print(df.head())

   age          workclass  fnlwgt   education  education-num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country  income  
0          2174             0              40   United-States   <=50

### Scale/Transform Data

Use `StandardScaler().fit()` to fit the variable features and then use `transform()` to get X to get the transformed input to our model.

In [9]:
# 5. Create a Decision Tree model
tree = DecisionTreeClassifier()

In [None]:
#Distribution of income
print(df.income.value_counts(normalize=True))

#Clean columns by stripping extra whitespace for columns of type "object"
for c in df.select_dtypes(include=['object']).columns:
    df[c] = df[c].str.strip()
    

feature_cols = ['age',
       'capital-gain', 'capital-loss', 'hours-per-week', 'sex','race']
#Create feature dataframe X with feature columns and dummy variables for categorical features
X = pd.get_dummies(df[feature_cols], drop_first=True)
#Create output variable y which is binary, 0 when income is less than 50k, 1 when it is greather than 50k
y = np.where(df.income=='<=50K', 0, 1)

#Split data into a train and test set
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=1, test_size=.2)

#Instantiate random forest classifier, fit and score with default parameters
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
rf.score(x_test, y_test)
print(f'Accuracy score for default random forest: {round(rf.score(x_test, y_test)*100,3)}%')

#Tune the hyperparameter max_depth over a range from 1-25, save scores for test and train set
np.random.seed(0)
accuracy_train=[]
accuracy_test = []
depths = range(1,26)
for i in depths:
    rf = RandomForestClassifier(max_depth=i)
    rf.fit(x_train, y_train)
    y_pred = rf.predict(x_test)
    accuracy_test.append(accuracy_score(y_test, rf.predict(x_test)))
    accuracy_train.append(accuracy_score(y_train, rf.predict(x_train)))
    
#Find the best accuracy and at what depth that occurs
best_acc= np.max(accuracy_test)
best_depth = depths[np.argmax(accuracy_test)]
print(f'The highest accuracy on the test is achieved when depth: {best_depth}')
print(f'The highest accuracy on the test set is: {round(best_acc*100,3)}%')

#Plot the accuracy scores for the test and train set over the range of depth values  
plt.plot(depths, accuracy_test,'bo--',depths, accuracy_train,'r*:')
plt.legend(['test accuracy', 'train accuracy'])
plt.xlabel('max depth')
plt.ylabel('accuracy')
plt.show()

#Save the best random forest model and save the feature importances in a dataframe
best_rf = RandomForestClassifier(max_depth=best_depth)
best_rf.fit(x_train, y_train)
feature_imp_df = pd.DataFrame(zip(x_train.columns, best_rf.feature_importances_),  columns=['feature', 'importance'])
print('Top 5 random forest features:')
print(feature_imp_df.sort_values('importance', ascending=False).iloc[0:5])


#Create two new features, based on education and native country
df['education_bin'] = pd.cut(df['education-num'], [0,9,13,16], labels=['HS or less', 'College to Bachelors', 'Masters or more'])

feature_cols = ['age',
       'capital-gain', 'capital-loss', 'hours-per-week', 'sex', 'race','education_bin']
#Use these two new additional features and recreate X and test/train split
X = pd.get_dummies(df[feature_cols], drop_first=True)

x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=1, test_size=.2)

#Find the best max depth now with the additional two features
np.random.seed(0)
accuracy_train=[]
accuracy_test = []
depths = range(1,10)
for i in depths:
    rf = RandomForestClassifier(max_depth=i)
    rf.fit(x_train, y_train)
    y_pred = rf.predict(x_test)
    accuracy_test.append(accuracy_score(y_test, rf.predict(x_test)))
    accuracy_train.append(accuracy_score(y_train, rf.predict(x_train)))
    
best_acc= np.max(accuracy_test)
best_depth = depths[np.argmax(accuracy_test)]
print(f'The highest accuracy on the test is achieved when depth: {best_depth}')
print(f'The highest accuracy on the test set is: {round(best_acc*100,3)}%')

plt.figure(2)
plt.plot(depths, accuracy_test,'bo--',depths, accuracy_train,'r*:')
plt.legend(['test accuracy', 'train accuracy'])
plt.xlabel('max depth')
plt.ylabel('accuracy')
plt.show()

#Save the best model and print the two features with the new feature set
best_rf = RandomForestClassifier(max_depth=best_depth)
best_rf.fit(x_train, y_train)
feature_imp_df = pd.DataFrame(zip(x_train.columns, best_rf.feature_importances_),  columns=['feature', 'importance'])
print('Top 5 random forest features:')
print(feature_imp_df.sort_values('importance', ascending=False).iloc[0:5])