## Advertising Ad Campaign



## 1. Introduction

The task is to predict who will most likely click on the ad. Let's consider that we are working for a marketing company. Firstly, we have to understand what constitutes a profit and a loss.

Let's assume that you have a marketing campaign for which we spend 1000USD per potential customer. For each customer that we target with our ad campaign and that clicks on the ad, we get an overall profit of 100USD. However, if we target a customer that ends up not clicking on the ad, we incur a net loss of 1050USD. Therefore we can conclude that for each customer that was not targeted by the campaign and who clicks on the ad, we get an overall profit of 1100USD. Unfortunately, we have no information about the advertized product; this information could have guided us through our understanding of the user behavior.

In [None]:
# Importing Libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


# Defining categorical, numerical, and datetime variables that we will use later
categorical_vars = ["Ad Topic Line", "City", "Country"]
numerical_vars = ["Daily Time Spent on Site", "Area Income", "Daily Internet Usage", "Male", "log_age"]
datetime_vars = "Timestamp"
target = "Clicked on Ad

## 2. Explore the dataset
    2.1 Explore the data

In [None]:
advertising_df = pd.read_csv('advertising_dsdj.csv')
advertising_df.head()

Let's look at the main characteristics of our dataset, such as the number of observations, the type of variables, the summary statistics for each variables, the number of missing data, etc

In [None]:
advertising_df.info()

In [None]:
advertising_df.isnull().any()

In [None]:
# Our Target variable contains null values. So we delete these values.
advertising_df = advertising_df.dropna(axis=0)

In [None]:
# Assessing if duplicated records are found in the dataset
print("The number of duplicated records in the dataset:", advertising_df.duplicated().sum())

In [None]:
# Removing the duplicated rows from the dataset
advertising_df = advertising_df.drop_duplicates()

## 3. Exploratory Data Analysis
    3.1 Describe Features

In [None]:
# Check for class imbalance
click_rate = advertising_df['Clicked on Ad'].value_counts()

In [None]:
click_rate

As we can see this is an balanced dataset so there is no issue of class imbalance.

In [None]:
# Descriptive Statistics 
advertising_df.describe()

As we can see the mean and the median are fairly same so there is no skwedness in data. Therefore we need not perform any transformation in data. 

But in the age column there is some difference as the min value is negative and max is 999. Lets investigate this stuff.

In [None]:
# Keep relevant values in age column.
sorted_age_arr = sorted(advertising_df['Age'])
idx = []
for i in range(len(sorted_age_arr)):
    idx.append(sorted_age_arr[i])

x = idx
y = sorted_age_arr

plt.figure(figsize = (15,12))
plt.scatter(x, y, s=10)
plt.axhline(y=0, linestyle='--', color='r')
plt.axhline(y=100, linestyle='--', color='r')

In [None]:
# This dataset consist of Negative and Positive values which might not be helpful for the prediction. 
advertising_df[(advertising_df['Age'] > 100) | (advertising_df['Age'] < 18)]

In [None]:
# Remove extreme age values
advertising_df = advertising_df[(advertising_df['Age'] >= 18) | (advertising_df['Age'] < 100)]

In [None]:
advertising_df.head()

In [None]:
# Let's check out if the 'Daily TIme Spend on Site' 
# is actually smaller or equal to the Daily Internet Usage'

advertising_df['delta'] = advertising_df['Daily Internet Usage'] - advertising_df['Daily Time Spent on Site']
sum(advertising_df['delta'] < 0)

In [None]:
# Removing rows with a delta smaller than zero
advertising_df = advertising_df[advertising_df['delta'] >= 0]

# I'll remove the column that I just created, but you could definitely keep it :-) 
advertising_df = advertising_df.drop('delta', axis=1)