# Internet Advertisements Data Set

## Data Set Information:

This dataset represents a set of possible advertisements on Internet pages. The features encode the geometry of the image (if available) as well as phrases occuring in the URL, the image's URL and alt text, the anchor text, and words occuring near the anchor text. The task is to predict whether an image is an advertisement ("ad") or not ("nonad"). Additional information can be found [here](https://archive.ics.uci.edu/ml/datasets/internet%2Badvertisements).

## Attribute Information:

The dataset has 3 continous (height, width, aratio) and 1555 binary (urls, tags, captions) features. 

## Source:

Creator & donor: Nicholas Kushmerick <nick '@' ucd.ie>

# Learning Objectives
- Identify and impute missing data
- Use normalization as part of the modeling process: min max normalization.
- Use normalization as part of the modeling process: centering and scaling.
- Use hold-out validation to compare the performance of a pair of models using a large data set.

In [89]:
import pandas as pd

# Load the data
internetAd = pd.read_csv('Internet_Ad_Data.csv', sep=',', on_bad_lines='skip', low_memory=False)
print(internetAd.info())
internetAd.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3279 entries, 0 to 3278
Columns: 1559 entries, height to Target
dtypes: int64(1554), object(5)
memory usage: 39.0+ MB
None


Unnamed: 0,height,width,aratio,local,url*images+buttons,url*likesbooks.com,url*www.slake.com,url*hydrogeologist,url*oso,url*media,...,caption*home,caption*my,caption*your,caption*in,caption*bytes,caption*here,caption*click,caption*for,caption*you,Target
0,125,125,1.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
1,57,468,8.2105,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
2,33,230,6.9696,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
3,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
4,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
5,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
6,59,460,7.7966,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
7,60,234,3.9,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
8,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
9,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.


Question 1.1: Replace all the values of '   ?' with nan. And replace ad. with 1 and nonad. with 0

In [90]:
import numpy as np

# Replace ?

# regex pattern matching any string ending with a question mark. The question mark apparently is prepended with some spaces.
pattern = r"^\s+\?$"  

internetAd = internetAd.replace(pattern, np.nan, regex=True)
internetAd["Target"] = internetAd["Target"].map({"ad.": 1, "nonad.": 0})

Unnamed: 0,height,width,aratio,local,url*images+buttons,url*likesbooks.com,url*www.slake.com,url*hydrogeologist,url*oso,url*media,...,caption*home,caption*my,caption*your,caption*in,caption*bytes,caption*here,caption*click,caption*for,caption*you,Target
0,125.0,125.0,1.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,57.0,468.0,8.2105,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,33.0,230.0,6.9696,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,60.0,468.0,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,60.0,468.0,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,60.0,468.0,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
6,59.0,460.0,7.7966,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
7,60.0,234.0,3.9,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8,60.0,468.0,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9,60.0,468.0,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Question 1.1 Part 2: Check that replacing worked

In [91]:
# Let's just limit to two columnns, row 10 has some missing values.
internetAd[['height', 'Target']].head(20)

Unnamed: 0,height,Target
0,125.0,1
1,57.0,1
2,33.0,1
3,60.0,1
4,60.0,1
5,60.0,1
6,59.0,1
7,60.0,1
8,60.0,1
9,60.0,1


Make Sure that "height","width","aratio" is type - float

In [95]:
internetAd[['height', 'width', 'aratio']] = internetAd[['height', 'width', 'aratio']].astype(float)
internetAd[['height', 'width', 'aratio']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3279 entries, 0 to 3278
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   height  2376 non-null   float64
 1   width   2378 non-null   float64
 2   aratio  2369 non-null   float64
dtypes: float64(3)
memory usage: 77.0 KB


Question 1.2: Describe the statistics for each of the columns ["height","width","aratio","local"] 

In [96]:
internetAd[['height', 'width', 'aratio', 'local']].describe()

Unnamed: 0,height,width,aratio
count,2376.0,2378.0,2369.0
mean,64.021886,155.344828,3.911953
std,54.868604,130.03235,6.042986
min,1.0,1.0,0.0015
25%,25.0,80.0,1.0357
50%,51.0,110.0,2.102
75%,85.25,184.0,5.3333
max,640.0,640.0,60.0


The average height is 64 pixels and average width is 155 pixels. The average aspect ration is around 3.9, which doesn't really match the aspect ration of a typical image. The minimum values might indicate that there is some error in the 
dataset.

Question 2.1: Caluclate and display the (mean, median, mode) for each of the columns ["height","width","aratio","local"] as well as the overall statistics using describe method.

In [97]:
#Mean
internetAd[['height', 'width', 'aratio', 'local']].mean()

TypeError: Could not convert ['111111111111111101111111111100000000111000001010111111111111100000111000010111111111111111111101111101110011111111111111101111110111?111111000110001111111111111111111111110111100000010000001101111111111111110011111101110111111111????1111111111111100011110110111111111111111111101111101111111011100111100011011001001111000111111110111101011111111100000011111111110011111111011111111111110100111011001011111001110111111111101111111111011111111111111111111011111111100110111111101111101111101111111111110011111111111011111101111111101100101111111111101111110011101001011111111101110011100110101101110110010101111101101110010001110110010011000111?01111110110110111011110111001111010111000110110101110111110101001111001011011111111010110111011101111110111111010110011101011101011111011111101111111101110111111011111110110110111111110110111111011111101100111111111111111111111111000101111011100111011110011110111000111011110110001110110110111101111010110010111011011100101111101000111001010111111111101001011111101101111111101111101110110111111111011011111110111110110100100?11111111100000?110111110101111?111101101100110111111100100110111111111101101001010111101011111111111101110111110111010110011111110111111111111010101111010111111010011011011111111111111110111111111011011011111111111111101111100111111111111111101111?1111111110?1101111101101111111111111?11111101011111111101011111000111111011000111101011111110101101111110111110111111100111110011101111111011111110111111111111110111001111110101011111011011111111011110111001111101111111110100111111101?00111101011111111110101101111111111111111010111101111111011111111111100111111111100111111111101101111111011111111110111101100111110000101111111010111011111111110111111010011111101011111111111110011111111111011111111110011011111101101111111111111110011110101110011001111111101111111111111111110101111110101111111111111110101110110?111111110111111101011111011111111111111110011111100110011111111011111111111101111110111011101111110110111011000110110111101101111111100111111011010101?1111111101111111101110101101110010111100111011010101111101010111011100111001110111011011011110111111110111011111010011111001110101110101110010111111111110111111110001111100111101101001111111111111111101010011111111111101111111110101101001111111111111101110111011111111101111011011101010111110111011111111011111101111100110110111111101111101101110111111010110101111101111111111111111110111110011111101011101011111110011011111110101111111101111111111011101101111100011111111111111011111011110110111111111111011111011111101101111101101111011101111111111111111110111111111111111111111111110111111111110111111111101111111111000010011011100110011000011001111100010010101111110111111111111011101110001100101001100110111101111101111111110111111100010001101110101100011110001110001101111011101001111100010110111100100101111000101110111111011111111010111011101111011101101101111011110111011101111111101011111101111111011010111011010101111111001110101001101110111110100011101110111001111111111011010011100100010111010100111101111101100111011111100011111101101111010110011111111101111011100011001001111111001111110111111111101101110101111110111101111100110110011111100111111111100111111111110011100111111101111111111010101111111111111111111111011111111001111'] to numeric

In [None]:
#Median

In [None]:
#Mode

Question 2.2: Replace nan values for each of the ["height","width","aratio","local"] with the respective median value

Question 3.1: Plot the distribution of each of ["height","width","aratio","local"]

In [None]:
import seaborn as sns



Question 3.2: You have noticed the wide variation across the different features. As a result, let's normalize the features using [MinMaxScalar](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) method. Before we do that, we need to split the data into training and testing.

In [None]:
from sklearn.model_selection import train_test_split

X = #fill this in
y = #fill this in

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)



In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(#fill this in)
X_train_minmax_scaled = #fill this in
X_test_minmax_scaled = #fill this in

Question 3.3: Lets build another training set where features are normalized using [StandardScalar](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

In [None]:
from sklearn.preprocessing import StandardScaler

sscaler = StandardScaler()
X_train_minmax_scaled = X_train.copy()
X_test_minmax_scaled = X_test.copy()

#fill this in

Question 4.1: Apply [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to the above minmax scaled dataset with class_weight='balanced', solver='saga', and max_iter=1000. Calculate Accuracy, Confusion Matrix, Precision, and Recall.


In [None]:
from sklearn.linear_model import LogisticRegression


In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


Question 4.2: Repeat question 4.1 setting penalty to {‘l1’,'l2', ‘elasticnet’}. 
- Set C=0.1 for l1.
- Set l1_ratio=0.5 for elasticnet.

In [None]:
#penalty='l1'

In [None]:
#penalty='elasticnet'


Question 5: How do the three models compare? How did you make the comparison?

Question 6: Repeat steps 4.1-4.3 with the standard scaled dataset