# Lesson 4 Assignment

In this lab assignment, you will build a non-tree-based classifier where you can ensemble any base-learners and pass it to the using a BaggingClassifier or any of the other ensemble learners in sklearn.ensemble.

In [79]:
# import packages
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import pandas as pd
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample

# make this notebook's output stable across runs
np.random.seed(0)

## Data Set Information:

This dataset represents a set of possible advertisements on Internet pages. The features encode the geometry of the image (if available) as well as phrases occuring in the URL, the image's URL and alt text, the anchor text, and words occuring near the anchor text. The task is to predict whether an image is an advertisement ("ad") or not ("nonad"). Additional information can be found [here](https://archive.ics.uci.edu/ml/datasets/internet%2Badvertisements).

## Attribute Information:

The dataset has 3 continous (height, width, aratio) and 1555 binary (urls, tags, captions) features. 

## Source:

Creator & donor: Nicholas Kushmerick <nick '@' ucd.ie>

#### Note (MH) 

I'm using Python 3.11 here with the newer version of pandas. I need to update the `read_csv` function to accept `on_bad_lines` parameter instead.

In [80]:
# Load the data
internetAd = pd.read_csv('Internet_Ad_Data.csv', sep=',', on_bad_lines="skip")
print(internetAd.info())
internetAd.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3279 entries, 0 to 3278
Columns: 1559 entries, height to Target
dtypes: int64(1554), object(5)
memory usage: 39.0+ MB
None


  internetAd = pd.read_csv('Internet_Ad_Data.csv', sep=',', on_bad_lines="skip")


Unnamed: 0,height,width,aratio,local,url*images+buttons,url*likesbooks.com,url*www.slake.com,url*hydrogeologist,url*oso,url*media,...,caption*home,caption*my,caption*your,caption*in,caption*bytes,caption*here,caption*click,caption*for,caption*you,Target
0,125,125,1.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
1,57,468,8.2105,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
2,33,230,6.9696,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
3,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
4,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
5,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
6,59,460,7.7966,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
7,60,234,3.9,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
8,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
9,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.


Question 1: Prepare and impute missing values with the median

### MH

1. We need to switch the `Target` variable to booleans (`0`/`1`)
1. Replace the `?` with `NA`
1. Impute the variables.

In [81]:
# Switching the `Target` variable
if "Target_orig" not in internetAd:
    internetAd["Target_orig"] = internetAd["Target"]
internetAd["Target"] = internetAd["Target_orig"].map(
    lambda value: 1 if value == "ad." else 0
)

# Replace `?` with NA
internetAd = internetAd.replace(to_replace=r".*\?.*", value=np.nan, regex=True)

for column in internetAd:
    if column == 'Target_orig':
        continue
    internetAd[column] = internetAd[column].astype('float64')
    internetAd[column] = internetAd[column].fillna(internetAd[column].median())
# Look up what's there for a good measure.
internetAd.head(20)

Unnamed: 0,height,width,aratio,local,url*images+buttons,url*likesbooks.com,url*www.slake.com,url*hydrogeologist,url*oso,url*media,...,caption*my,caption*your,caption*in,caption*bytes,caption*here,caption*click,caption*for,caption*you,Target,Target_orig
0,125.0,125.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,ad.
1,57.0,468.0,8.2105,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,ad.
2,33.0,230.0,6.9696,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,ad.
3,60.0,468.0,7.8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,ad.
4,60.0,468.0,7.8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,ad.
5,60.0,468.0,7.8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,ad.
6,59.0,460.0,7.7966,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,ad.
7,60.0,234.0,3.9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,ad.
8,60.0,468.0,7.8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,ad.
9,60.0,468.0,7.8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,ad.


Question 2: Split dataset into training and test set

In [82]:
from sklearn.model_selection import train_test_split

X = internetAd.drop(["Target", "Target_orig"], axis=1)
y = internetAd["Target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Question 3: Train and evaluate a LogisticRegression classifier using LogisticRegression.

In [83]:
dtc_grid = LogisticRegression()
dtc_grid.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [84]:
model = dtc_grid

# make predictions with the trained random forest
test_z = dtc_grid.predict(X_test)
test_z_prob = dtc_grid.predict_proba(X_test)

accuracy = dtc_grid.score(X_test, y_test)
print(f"Accuracy: {accuracy}")
auc = roc_auc_score(y_test, test_z_prob[:,1])
print(f"AUC: {auc}")

Accuracy: 0.9602954755309326
AUC: 0.9781299845723399


Question 4: Use BaggingClassifier to train and evaluate an ensemble model of LogisticRegression  base classifiers. Each base classifier should be trained only on a sample half the size of the training data, and using only half as many features as there are in in total the training data (read the documentation for the function to see how to do this).

In [85]:
bagOLR = BaggingClassifier(estimator=LogisticRegression(), max_samples=0.5, max_features=0.5)
bagOLR.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [86]:
model = bagOLR

# make predictions with the trained random forest
test_z = model.predict(X_test)
test_z_prob = model.predict_proba(X_test)

accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy}")
auc = roc_auc_score(y_test, test_z_prob[:,1])
print(f"AUC: {auc}")

Accuracy: 0.9547553093259464
AUC: 0.9812351279972804


Question 5: Use AdaBoostClassifier to train and evaluate an ensemble model of LogisticRegression base classifiers.

In [87]:
boostOkLR = AdaBoostClassifier(estimator=LogisticRegression())
boostOkLR.fit(X_train, y_train)



In [88]:
model = boostOkLR

# make predictions with the trained random forest
test_z = model.predict(X_test)
test_z_prob = model.predict_proba(X_test)

accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy}")
auc = roc_auc_score(y_test, test_z_prob[:,1])
print(f"AUC: {auc}")

Accuracy: 0.9538319482917821
AUC: 0.9783195617498628


[Bonus] Question 6: Use StackingClassifier to train and evaluate an ensemble model of LogisticRegression base classifiers. to get better accuracy than previous classifiers.

In [89]:
stackingClassifier = StackingClassifier(
    estimators=[
        ("rf", RandomForestClassifier(n_estimators=10, random_state=42)),
        ("lr", LogisticRegression()),
    ]
)

stackingClassifier.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [90]:
model = stackingClassifier

# make predictions with the trained random forest
test_z = model.predict(X_test)
test_z_prob = model.predict_proba(X_test)

accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy}")
auc = roc_auc_score(y_test, test_z_prob[:,1])
print(f"AUC: {auc}")


Accuracy: 0.9695290858725761
AUC: 0.9771690243966216


Question 7: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

I had some experience working with logistic regression and it was definitely interesting to see how to improve the predictor in some weak cases.

I was surprised to see that the accuracy of the base logistic model was pretty good to start with. I should probably look a little more into the data, possibly select some subset of variables. Also, the algorithm used by default doesn't perform very well, and it's possible that I need to take a deeper look at scaling, as recommended by estimator itself.

As a result of the high accuracy and ROC AUC it is difficult to see a real improvement in the process. I'm really glad to see that the estimators are so easy to use!