# Lesson 3 Assignment

In this lab assignment, you will implement a simplified version of Random Forest classifier and practice how to use and fine-tune Random Forest, Extra Trees, and Gradient Boosted Trees. You will then compare the model performance of various classifiers on internet ad dataset.

In [3]:
# import packages
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import pandas as pd
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample

# make this notebook's output stable across runs
np.random.seed(0)

## Data Set Information:

This dataset represents a set of possible advertisements on Internet pages. The features encode the geometry of the image (if available) as well as phrases occuring in the URL, the image's URL and alt text, the anchor text, and words occuring near the anchor text. The task is to predict whether an image is an advertisement ("ad") or not ("nonad"). Additional information can be found [here](https://archive.ics.uci.edu/ml/datasets/internet%2Badvertisements).

## Attribute Information:

The dataset has 3 continous (height, width, aratio) and 1555 binary (urls, tags, captions) features. 

## Source:

Creator & donor: Nicholas Kushmerick <nick '@' ucd.ie>

In [4]:
# Load the data
df = pd.read_csv('Internet_Ad_Data.csv', sep=',')
print(df.info())
df.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3279 entries, 0 to 3278
Columns: 1559 entries, height to Target
dtypes: int64(1554), object(5)
memory usage: 39.0+ MB
None


  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,height,width,aratio,local,url*images+buttons,url*likesbooks.com,url*www.slake.com,url*hydrogeologist,url*oso,url*media,...,caption*home,caption*my,caption*your,caption*in,caption*bytes,caption*here,caption*click,caption*for,caption*you,Target
0,125,125,1.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
1,57,468,8.2105,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
2,33,230,6.9696,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
3,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
4,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
5,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
6,59,460,7.7966,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
7,60,234,3.9,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
8,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
9,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.


Question 1: Prepare and impute missing values with the median (missing values for this dataset are \?, nonad. ad.)

In [5]:
missing_val_columns = ['height', 'width', 'aratio', 'local']
for column in missing_val_columns:
    median = df[~df[column].astype(str).str.contains("\?")][column].median()
    df[column] = df[column].replace("\?", median, regex=True)
df.head()

Unnamed: 0,height,width,aratio,local,url*images+buttons,url*likesbooks.com,url*www.slake.com,url*hydrogeologist,url*oso,url*media,...,caption*home,caption*my,caption*your,caption*in,caption*bytes,caption*here,caption*click,caption*for,caption*you,Target
0,125,125,1.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
1,57,468,8.2105,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
2,33,230,6.9696,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
3,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
4,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.


Question 2: Split dataset into training and test set

In [6]:
from sklearn.model_selection import train_test_split

X = df.drop('Target', axis=1)
y = df['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=99)

Question 3: Train and evaluate a randomeforrest classifier using the following gridsearch parameters:
- "max_depth": [2, 4],
- "min_samples_split": [0.05, 0.1, 0.2]

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
parameters = {
    'max_depth': [2,4],
    'min_samples_split': [0.05, 0.1, 0.2]
}

dtc_grid = GridSearchCV(RandomForestClassifier(), param_grid=parameters)
dtc_grid.fit(X_train, y_train)

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'max_depth': [2, 4],
                         'min_samples_split': [0.05, 0.1, 0.2]})

In [8]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

# make predictions with the trained random forest
test_z = dtc_grid.predict(X_test)
test_z_prob = dtc_grid.predict_proba(X_test)[:, 1]

In [9]:
# evaluate the model performance - ACCURACY AND ROC AUC
accuracy_score(y_test, test_z)

0.9039704524469068

In [10]:
roc_auc_score(y_test, test_z_prob)

0.9749033075486797

Question 4: Train and evaluate a ExtraTrees classifier using the following gridsearch parameters:
- "max_depth": [2, 4],
- "min_samples_split": [0.05, 0.1, 0.2]

In [11]:
from sklearn.ensemble import ExtraTreesClassifier

parameters = {
    "max_depth": [2, 4],
    "min_samples_split": [0.05, 0.1, 0.2]
}

etc_grid = GridSearchCV(ExtraTreesClassifier(), param_grid=parameters)
etc_grid.fit(X_train, y_train)

GridSearchCV(estimator=ExtraTreesClassifier(),
             param_grid={'max_depth': [2, 4],
                         'min_samples_split': [0.05, 0.1, 0.2]})

In [12]:
# make predictions with the trained random forest
test_z = etc_grid.predict(X_test)
test_z_prob = etc_grid.predict_proba(X_test)[:, 1]

# evaluate the model performance - ACCURACY AND ROC AUC

In [13]:
accuracy_score(y_test, test_z)

0.8975069252077562

In [14]:
roc_auc_score(y_test, test_z_prob)

0.9687916777807415

Question 5: Train and evaluate a Gradient Boosted Trees classifier using the following gridsearch parameters:
- "max_depth": [2, 4],
- "min_samples_split": [0.05, 0.1, 0.2]

In [15]:
from sklearn.ensemble import GradientBoostingClassifier

parameters = {
    "max_depth": [2, 4],
    "min_samples_split": [0.05, 0.1, 0.2]
}

gbt_grid = GridSearchCV(GradientBoostingClassifier(), param_grid=parameters)
gbt_grid.fit(X_train, y_train)

GridSearchCV(estimator=GradientBoostingClassifier(),
             param_grid={'max_depth': [2, 4],
                         'min_samples_split': [0.05, 0.1, 0.2]})

In [16]:
# make predictions with the trained random forest
test_z = gbt_grid.predict(X_test)
test_z_prob = gbt_grid.predict_proba(X_test)[:, 1]

# evaluate the model performance - ACCURACY AND ROC AUC

In [17]:
accuracy_score(y_test, test_z)

0.976915974145891

In [18]:
roc_auc_score(y_test, test_z_prob)

0.9780374766604428

[Bonus] Question 6: Which algorithm performed better and why?


The gradient boosted tree performed best, by both metrics. It's accuracy was much higher and its ROC_AUC score was a little higher. My sense is that this is because gradient boosting allows the model to learn complex decision boundaries, while its ensemble nature helps mitigate against overfitting. It seems that, with this data, the added model complexity was useful for learning meaningful patterns in the data, patterns which were not captured with the relatively simpler random forest model. 

Question 7: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

I was not familiar with these models, going into this assignment. I had always heard gradient boosted trees talked about as a type of model with impressive and, in some cases, state of the art performance. So it was interesting to see its power in person.