# Lesson 3 Assignment

In this lab assignment, you will implement a simplified version of Random Forest classifier and practice how to use and fine-tune Random Forest, Extra Trees, and Gradient Boosted Trees. You will then compare the model performance of various classifiers on internet ad dataset.

In [1]:
# import packages
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import pandas as pd
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample

# make this notebook's output stable across runs
np.random.seed(0)

## Data Set Information:

This dataset represents a set of possible advertisements on Internet pages. The features encode the geometry of the image (if available) as well as phrases occuring in the URL, the image's URL and alt text, the anchor text, and words occuring near the anchor text. The task is to predict whether an image is an advertisement ("ad") or not ("nonad"). Additional information can be found [here](https://archive.ics.uci.edu/ml/datasets/internet%2Badvertisements).

## Attribute Information:

The dataset has 3 continous (height, width, aratio) and 1555 binary (urls, tags, captions) features. 

## Source:

Creator & donor: Nicholas Kushmerick <nick '@' ucd.ie>

In [2]:
# Load the data and trim whitespace before and after string data
internetAd = pd.read_csv('internet_Ad_Data.csv', skipinitialspace=True)
#internetAd = internetAd.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
print(internetAd.info())
internetAd.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3279 entries, 0 to 3278
Columns: 1559 entries, height to Target
dtypes: int64(1554), object(5)
memory usage: 39.0+ MB
None


  internetAd = pd.read_csv('internet_Ad_Data.csv', skipinitialspace=True)


Unnamed: 0,height,width,aratio,local,url*images+buttons,url*likesbooks.com,url*www.slake.com,url*hydrogeologist,url*oso,url*media,...,caption*home,caption*my,caption*your,caption*in,caption*bytes,caption*here,caption*click,caption*for,caption*you,Target
0,125,125,1.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
1,57,468,8.2105,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
2,33,230,6.9696,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
3,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
4,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
5,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
6,59,460,7.7966,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
7,60,234,3.9,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
8,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
9,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.


Question 1: Prepare and impute missing values with the median (missing values for this dataset are \?, nonad. ad.)

In [3]:
# for the last column convert the values to 0 and 1 for the binary classification
internetAd['Target'] = internetAd['Target'].map({'nonad.': 0, 'ad.': 1})
internetAd.head(20)
# if there is a value with ?, replace that with the median of the values in the same column
internetAd = internetAd.replace('?', np.nan)
internetAd = internetAd.fillna(internetAd.median())
internetAd.head(20)

Unnamed: 0,height,width,aratio,local,url*images+buttons,url*likesbooks.com,url*www.slake.com,url*hydrogeologist,url*oso,url*media,...,caption*home,caption*my,caption*your,caption*in,caption*bytes,caption*here,caption*click,caption*for,caption*you,Target
0,125.0,125.0,1.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,57.0,468.0,8.2105,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,33.0,230.0,6.9696,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,60.0,468.0,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,60.0,468.0,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,60.0,468.0,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
6,59.0,460.0,7.7966,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
7,60.0,234.0,3.9,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8,60.0,468.0,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9,60.0,468.0,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Question 2: Split dataset into training and test set

In [4]:
from sklearn.model_selection import train_test_split

# let X be the features and let Y be the target values
X = internetAd.iloc[:, 0:1558]
y = internetAd['Target']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Question 3: Train and evaluate a randomeforrest classifier using the following gridsearch parameters:
- "max_depth": [2, 4],
- "min_samples_split": [0.05, 0.1, 0.2]

In [5]:
parameters = {
    "max_depth": [2, 4],
    "min_samples_split": [0.05, 0.1, 0.2]
}

dtc_grid = GridSearchCV(RandomForestClassifier(), param_grid=parameters, cv=3, n_jobs=-1)
dtc_grid.fit(X_train, y_train)

In [6]:
# make predictions with the trained random forest
test_z = dtc_grid.predict(X_test)
test_z_prob = dtc_grid.predict_proba(X_test)[:,1]

# evaluate the model performance - ACCURACY AND ROC AUC
print("RFC Accuracy: ", accuracy_score(y_test, test_z))
print("RFC ROC AUC: ", roc_auc_score(y_test, test_z_prob))

RFC Accuracy:  0.8984302862419206
RFC ROC AUC:  0.9504745966582119


Question 4: Train and evaluate a ExtraTrees classifier using the following gridsearch parameters:
- "max_depth": [2, 4],
- "min_samples_split": [0.05, 0.1, 0.2]

In [7]:
parameters = {
  "max_depth": [2, 4],
  "min_samples_split": [0.05, 0.1, 0.2]
}

etc_grid = GridSearchCV(ExtraTreesClassifier(), param_grid=parameters, cv=3, n_jobs=-1)
etc_grid.fit(X_train, y_train)

In [8]:
# make predictions with the trained random forest
test_z = etc_grid.predict(X_test)
test_z_prob = etc_grid.predict_proba(X_test)[:,1]

# evaluate the model performance - ACCURACY AND ROC AUC
print("ETC Accuracy: ", accuracy_score(y_test, test_z))
print("ETC ROC AUC: ", roc_auc_score(y_test, test_z_prob))

ETC Accuracy:  0.8947368421052632
ETC ROC AUC:  0.9419403550976649


Question 5: Train and evaluate a Gradient Boosted Trees classifier using the following gridsearch parameters:
- "max_depth": [2, 4],
- "min_samples_split": [0.05, 0.1, 0.2]

In [9]:
parameters = {
    "max_depth": [2, 4],
    "min_samples_split": [0.05, 0.1, 0.2]
}

gbc_grid = GridSearchCV(GradientBoostingClassifier(), param_grid=parameters, cv=3, n_jobs=-1)
gbc_grid.fit(X_train, y_train)

In [10]:
# make predictions with the trained random forest
test_z = gbc_grid.predict(X_test)
test_z_prob = gbc_grid.predict_proba(X_test)[:,1]

# evaluate the model performance - ACCURACY AND ROC AUC
print("GBC Accuracy: ", accuracy_score(y_test, test_z))
print("GBC ROC AUC: ", roc_auc_score(y_test, test_z_prob))

GBC Accuracy:  0.96398891966759
GBC ROC AUC:  0.9657322908767617


[Bonus] Question 6: Which algorithm performed better and why?


The Accuracy and ROC AUC values for the Random Forest Classifier, Extra Trees Classifier and Gradient Boosting Classifier are as follows:
    
    RFC Accuracy:  0.8984302862419206
    RFC ROC AUC:  0.9504745966582119

    ETC Accuracy:  0.8947368421052632
    ETC ROC AUC:  0.9422933608765003

    GBC Accuracy:  0.96398891966759
    GBC ROC AUC:  0.9657322908767617

Clearly the GradientBoostingClassifier has the highest accuracy and ROC AUC values thus performing the best among the three classifiers.

**Q: Why did GradientBoostingClassifier perform the best among the three classifiers?**

Gradient Boosting Classifier is an ensemble learning method that builds a strong classifier by combining multiple poorly performing classifiers. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. Gradient Boosting Classifier builds trees one at a time, where each new tree helps to correct errors made by previously trained tree. This method is computationally expensive and requires more time to train the model, but it is more accurate and has a higher ROC AUC value compared to the other classifiers.

Question 7: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

This week's assignment seemed fairly straighfoward compared to last weeks. IE, we didn't need to implement a machine learning algorithm from scratch. So really it was the standard practice of reading in the data, cleansing the data and applying 3 different models to predict the data and measure accuracy.

Really it was just down to reading the documentation of each of the different model types and applying them to the given data set.

It is somewhat re-assuring that the model that was most computationally intensive model (GradientBoostingClassifier GBC) was also the most accurate. That classifier took nearly 3 times as much time to train as the other classifiers, so I would expect it to perform better purely on computation complexity. And it did perform better. For that 3x complexity It improved accuracy by a not insignificant amount. With data sets this small, it is a no brainer to use it. But for very large datasets that might require days or weeks of training, perhaps the small improvement in model accuracy may not be worth the cost.