#Demo: **Predict Ad Click Using Logistic Regression**

Explore the data to predict who is more likely to click the ad! using **Logistic Regression** based on the features of the user



###**Dataset Description**



The dataset contains the following attributes:

- **Daily Time Spent on Site** - Consumer time on site in minutes
- **Age** - Cutomer age in years
- **Area Income** - Avg. Income of geographical area of consumer
- **Daily Internet Usage** - Avg. minutes a day consumer is on the internet
- **Ad Topic Line** - Headline of the advertisement
- **City** - City of consumer
- **Male** - Whether or not consumer was male
- **Country** - Country of consumer
- **Timestamp** - Time at which consumer clicked on Ad or closed window
- **Clicked on Ad** - 0 or 1 indicated clicking on Ad

###**Tasks to be performed**

- Import Required Libraries and analyze the dataset
       Check the shape of the dataset
       Check and deal with the Null Values present in the dataset
       Analyze the dataset using Pandas Profiling and Sweetviz
- Perform Data Visualization on the dataset Using **Plotly Express**
- Split the dataset using train_test_split from the sklearn library
- Build a Logistic Regression model and fit the model
- Evaluate the Model

       Use a Confusion Matrix and write your observations
       Check the accuracy_score
       Print a classification report


###**Task 1:**
Import Required Libraries and analyze the dataset

       Check the shape of the dataset
       Check and deal with the Null Values present in the dataset
       Analyze the dataset using Pandas Profiling and Sweetviz

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
print('Libraries Imported')

Libraries Imported


In [None]:
# Downloading the dataset from Dropbox

!wget https://www.dropbox.com/s/3u5l88jokdaj1wu/advertising.csv

--2020-10-19 10:26:06--  https://www.dropbox.com/s/3u5l88jokdaj1wu/advertising.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.1, 2620:100:6018:1::a27d:301
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/3u5l88jokdaj1wu/advertising.csv [following]
--2020-10-19 10:26:06--  https://www.dropbox.com/s/raw/3u5l88jokdaj1wu/advertising.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc9aad5e1f7a222f53efe5ecc9e9.dl.dropboxusercontent.com/cd/0/inline/BBgyqrq3Bd7GKBt4mqcNeRzRTKmjzk3eppDB70dEaFk-T3-tRW5-da55QC5wJRwmkXH0fYg1PKM4TZ6HZ5r1TipkWkxRGWQeu830jQNioIdl9Jvh_iK2EHx4aK9cIDQSFo8/file# [following]
--2020-10-19 10:26:07--  https://uc9aad5e1f7a222f53efe5ecc9e9.dl.dropboxusercontent.com/cd/0/inline/BBgyqrq3Bd7GKBt4mqcNeRzRTKmjzk3eppDB70dEaFk-T3-tRW5-da55QC5wJRwmkXH0fYg1PKM4TZ6HZ5r1TipkWkxRGWQeu830jQNio

In [None]:
df = pd.read_csv('advertising.csv')
df.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0


**Analyzing the data using Pandas Profiling**

In [None]:
!pip install pandas-profiling==2.7.1 



In [None]:
#Generating a Pandas Profiling Report 

import pandas_profiling
from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof.to_file(output_file='output.html')

HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=22.0), HTML(value='')))




HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render HTML'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Export report to file'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




Please refer to the HTML file created by the name of **output.html**

**Analyzing the data using Sweetviz**

**Sweetviz** is an open source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. **Output** is a fully self-contained **HTML** application.

The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks.

**[Click Here!](https://pypi.org/project/sweetviz/)** to learn more about Sweetviz

In [None]:
#Installing Sweetviz

!pip install sweetviz



In [None]:
# Importing sweetviz
import sweetviz as sv

#Analyzing the dataset
report = sv.analyze(df)

#Display the report
report.show_html('Ad-Click.html')

HBox(children=(HTML(value=''), FloatProgress(value=0.0, layout=Layout(flex='2'), max=11.0), HTML(value='')), l…




HBox(children=(HTML(value=''), FloatProgress(value=0.0, layout=Layout(flex='2'), max=10.0), HTML(value='')), l…




HBox(children=(HTML(value=''), FloatProgress(value=0.0, layout=Layout(flex='2'), max=1.0), HTML(value='')), la…


Report Ad-Click.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


Please refer to the HTML file created by the name of **Ad-Click.html**

In [None]:
print('Columns present in the dataset:\n')
for i in df.columns:
  print(i)

Columns present in the dataset:

Daily Time Spent on Site
Age
Area Income
Daily Internet Usage
Ad Topic Line
City
Male
Country
Timestamp
Clicked on Ad


**Check the Shape of the Dataset**

In [None]:
df.shape

(1000, 10)

**Check the Null Values present in the dataset**

In [None]:
df.isnull().sum()

Daily Time Spent on Site    0
Age                         0
Area Income                 0
Daily Internet Usage        0
Ad Topic Line               0
City                        0
Male                        0
Country                     0
Timestamp                   0
Clicked on Ad               0
dtype: int64

In [None]:
df.describe()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male,Clicked on Ad
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,65.0002,36.009,55000.00008,180.0001,0.481,0.5
std,15.853615,8.785562,13414.634022,43.902339,0.499889,0.50025
min,32.6,19.0,13996.5,104.78,0.0,0.0
25%,51.36,29.0,47031.8025,138.83,0.0,0.0
50%,68.215,35.0,57012.3,183.13,0.0,0.5
75%,78.5475,42.0,65470.635,218.7925,1.0,1.0
max,91.43,61.0,79484.8,269.96,1.0,1.0


###**Task 2:**
Perform Data Visualization on the dataset Using **Plotly Express**


In [None]:
fig = px.histogram(df, x="Age")

fig.show()

___
**Observations:**
- Most of the users present in the dataset fall between the age group of 20 and 40
___

In [None]:
fig = px.bar(df, x='Age', y='Daily Internet Usage')
fig.show()

___
**Observations:**
- People between the age group of 20 and 40 spent the most time on Internet
- Old people and very young tend to spend less time of Internet daily as compared to other people
___

In [None]:
fig = px.scatter(x= df['Daily Time Spent on Site'], y= df.Age, labels={'x':'Daily Time Spent on Site', 'y':'Age'})
fig.show()

___
**Observations:**
- People in their twenties and thirties spent the most time on site
- Older people tend to spend less time as compared to young people
___

###**Task 3:**
Split the dataset using **train_test_split** from the **sklearn** library 

In [None]:
from sklearn.model_selection import train_test_split

X = df[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage',  'Male']]
y = df['Clicked on Ad']

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size =  0.4, random_state= 42)

print('Splitting the Dataset Completed')

Splitting the Dataset Completed


###**Task 4:**
Build a **Logistic Regression** model and fit the model

In [None]:
from sklearn.linear_model import LogisticRegression

# Creating a Logistic Regression Object

model = LogisticRegression()

# Fitting the Model 

model.fit(train_X, train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
prediction = model.predict(test_X)

###**Task 5:**
Evaluate the Model

- Use a Confusion Matrix and write your observations
- Check the **accuracy_score**
- Print a classification report

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(test_y, prediction)

array([[181,  10],
       [ 32, 177]])

___
**Observations:**

**Confusion Matrix:**

**Each row:** Actual Class

**Each column:** Predicted Class

**First:** Non-clicked Ads, the **negative class**:

- 181 were correctly classified as Non-clicked Ads. True negatives.
- Remaining 32 were wrongly classified as clicked Ads. False positive

**Second:** The clicked Ads, the **positive class**:

- 10 were incorrectly classified as Non-clicked Ads. False negatives
- 177 were correctly classified clicked Ads. True positives
___

In [None]:
from sklearn.metrics import accuracy_score

print(f"ACCURACY SCORE:\n{accuracy_score(test_y, prediction):.4f}")

ACCURACY SCORE:
0.8950


**Accuracy** is the fraction of predictions our model got right. Formally, accuracy has the following definition:

**Accuracy** = Number of Correct Predictions **/** Total Number of Predictions



___
**Observations:**
- Accuracy comes out to 0.89, or 89% (89 correct predictions out of 100 total examples)
___

In [None]:
from sklearn import metrics

print(metrics.classification_report(test_y, prediction))

              precision    recall  f1-score   support

           0       0.85      0.95      0.90       191
           1       0.95      0.85      0.89       209

    accuracy                           0.90       400
   macro avg       0.90      0.90      0.89       400
weighted avg       0.90      0.90      0.89       400



___
- **Precision**: Ability of a classifier not to label an instance positive that is actually negative. For each class, it is defined as the ratio of true positives (TP) to the sum of a true positive (TP) and false positive (FP).

   Precision = TP/(TP + FP)

- **Recall**: Ability of a classifier to find all positive instances. For each class, it is defined as the ratio of true positives (TPs) to the sum of true positives (TPs) and false negatives (FNs).

   Recall = TP/(TP+FN)

- **f1-score**: A weighted harmonic mean of **precision** and **recall** such that the best score is **1.0** and the worst is **0.0**. **F1 scores** are lower than accuracy measures

- **Support**: Support is the number of actual occurrences of the class in the specified dataset
___