# Lesson 3 Assignment

In this lab assignment, you will implement a simplified version of Random Forest classifier and practice how to use and fine-tune Random Forest, Extra Trees, and Gradient Boosted Trees. You will then compare the model performance of various classifiers on internet ad dataset.

In [81]:
# import packages
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import pandas as pd
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample

# make this notebook's output stable across runs
np.random.seed(0)

## Data Set Information:

This dataset represents a set of possible advertisements on Internet pages. The features encode the geometry of the image (if available) as well as phrases occuring in the URL, the image's URL and alt text, the anchor text, and words occuring near the anchor text. The task is to predict whether an image is an advertisement ("ad") or not ("nonad"). Additional information can be found [here](https://archive.ics.uci.edu/ml/datasets/internet%2Badvertisements).

## Attribute Information:

The dataset has 3 continous (height, width, aratio) and 1555 binary (urls, tags, captions) features. 

## Source:

Creator & donor: Nicholas Kushmerick <nick '@' ucd.ie>

In [82]:
# (Mateusz Haligowski) this version was deprecated, change it to on_bad_lines
# Load the data
internetAd = pd.read_csv('Internet_Ad_Data.csv', sep=',', on_bad_lines='skip')
print(internetAd.info())
internetAd.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3279 entries, 0 to 3278
Columns: 1559 entries, height to Target
dtypes: int64(1554), object(5)
memory usage: 39.0+ MB
None


  internetAd = pd.read_csv('Internet_Ad_Data.csv', sep=',', on_bad_lines='skip')


Unnamed: 0,height,width,aratio,local,url*images+buttons,url*likesbooks.com,url*www.slake.com,url*hydrogeologist,url*oso,url*media,...,caption*home,caption*my,caption*your,caption*in,caption*bytes,caption*here,caption*click,caption*for,caption*you,Target
0,125,125,1.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
1,57,468,8.2105,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
2,33,230,6.9696,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
3,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
4,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
5,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
6,59,460,7.7966,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
7,60,234,3.9,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
8,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
9,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.


Question 1: Prepare and impute missing values with the median (missing values for this dataset are \?, nonad. ad.)

In [83]:
import re
# Replace all the '?' with NaNs
internetAd = internetAd.replace(r'\?', np.nan, regex=True)
if internetAd['Target'].dtype != 'int64':
  internetAd['Target'] = internetAd['Target'].str.contains('nonad.').apply(lambda x: 1 if x else 0)
internetAd = internetAd.map(lambda x: pd.to_numeric(x))
internetAd = internetAd.fillna(internetAd.median())

display(internetAd)

Unnamed: 0,height,width,aratio,local,url*images+buttons,url*likesbooks.com,url*www.slake.com,url*hydrogeologist,url*oso,url*media,...,caption*home,caption*my,caption*your,caption*in,caption*bytes,caption*here,caption*click,caption*for,caption*you,Target
0,125.0,125.0,1.0000,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,57.0,468.0,8.2105,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,33.0,230.0,6.9696,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,60.0,468.0,7.8000,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,60.0,468.0,7.8000,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3274,170.0,94.0,0.5529,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3275,101.0,140.0,1.3861,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3276,23.0,120.0,5.2173,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3277,51.0,110.0,2.1020,1.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Question 2: Split dataset into training and test set

In [84]:
from sklearn.model_selection import train_test_split

print(internetAd)
X = internetAd.drop(columns=['Target'])
y = internetAd['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

      height  width  aratio  local  url*images+buttons  url*likesbooks.com  \
0      125.0  125.0  1.0000    1.0                   0                   0   
1       57.0  468.0  8.2105    1.0                   0                   0   
2       33.0  230.0  6.9696    1.0                   0                   0   
3       60.0  468.0  7.8000    1.0                   0                   0   
4       60.0  468.0  7.8000    1.0                   0                   0   
...      ...    ...     ...    ...                 ...                 ...   
3274   170.0   94.0  0.5529    0.0                   0                   0   
3275   101.0  140.0  1.3861    1.0                   0                   0   
3276    23.0  120.0  5.2173    1.0                   0                   0   
3277    51.0  110.0  2.1020    1.0                   0                   0   
3278    40.0   40.0  1.0000    1.0                   0                   0   

      url*www.slake.com  url*hydrogeologist  url*oso  url*media

Question 3: Train and evaluate a randomeforrest classifier using the following gridsearch parameters:
- "max_depth": [2, 4],
- "min_samples_split": [0.05, 0.1, 0.2]

In [85]:
parameters = {
  'max_depth': [2,4],
  'min_samples_split': [0.05, 0.1, 0.2]
}

classifier = RandomForestClassifier()
dtc_grid = GridSearchCV(classifier, parameters)
dtc_grid.fit(X_train, y_train)

In [87]:
# make predictions with the trained random forest
test_z = dtc_grid.predict(X_test)
test_z_prob = dtc_grid.predict_proba(X_test)

# evaluate the model performance - ACCURACY AND ROC AUC
acc_score = accuracy_score(y_test, test_z)
display(acc_score)
roc_auc = roc_auc_score(y_test, test_z_prob[:, 1])
display(roc_auc)

0.9048938134810711

0.9543478545093219

Question 4: Train and evaluate a ExtraTrees classifier using the following gridsearch parameters:
- "max_depth": [2, 4],
- "min_samples_split": [0.05, 0.1, 0.2]

In [88]:
parameters = {
  'max_depth': [2,4],
  'min_samples_split': [0.05, 0.1, 0.2]
}

classifier = ExtraTreesClassifier()
dtc_grid = GridSearchCV(classifier, parameters)
dtc_grid.fit(X_train, y_train)

In [89]:
# make predictions with the trained random forest
test_z = dtc_grid.predict(X_test)
test_z_prob = dtc_grid.predict_proba(X_test)

# evaluate the model performance - ACCURACY AND ROC AUC
acc_score = accuracy_score(y_test, test_z)
display(acc_score)
roc_auc = roc_auc_score(y_test, test_z_prob[:, 1])
display(roc_auc)

0.8873499538319483

0.9285686269382631

Question 5: Train and evaluate a Gradient Boosted Trees classifier using the following gridsearch parameters:
- "max_depth": [2, 4],
- "min_samples_split": [0.05, 0.1, 0.2]

In [90]:
parameters = {
  'max_depth': [2,4],
  'min_samples_split': [0.05, 0.1, 0.2]
}

classifier = GradientBoostingClassifier()
dtc_grid = GridSearchCV(classifier, parameters)
dtc_grid.fit(X_train, y_train)

In [91]:
# make predictions with the trained random forest
test_z = dtc_grid.predict(X_test)
test_z_prob = dtc_grid.predict_proba(X_test)

# evaluate the model performance - ACCURACY AND ROC AUC
acc_score = accuracy_score(y_test, test_z)
display(acc_score)
roc_auc = roc_auc_score(y_test, test_z_prob[:, 1])
display(roc_auc)

0.9630655586334257

0.9650622336113799

[Bonus] Question 6: Which algorithm performed better and why?


All the classifiers performed pretty well. The best one turned out to be the gradient boosting classifier.
It makes sense, since it's iteratively optimizing the classifier.

Question 7: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

Compared to the previous exercise, this one was way easier :). I didn't have a lot of experience working with random forests coming to this exercies. It's really interesting to see how sklearn implemented the grid search. It makes the whole process incredibly easy, and the whole exercise was mostly copy-paste.

As for the real-life applications, it's still one of the simples algorithms to apply, and I'll definitely want to try it out in a real-world applications.

One thing I want to try out is the NaN imputation, but this time with medians applied class-wise, instead of applying it full-data-wise.