# Avazu User Click-Through Rate (CTR) prediction for ads 

1. create data if it is huge and load it efficiently
2. import data and check for Nans
3. explore data: explore all columns and various relations
4. check for multi-collinearity
5. check for correlation analysis
6. check for 


In [205]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# init_notebook_mode(connected=True)
config={'showLink': True, 'displayModeBar': True}
from IPython.display import IFrame
from IPython.core.display import HTML, display

## Importing Data

We have ad data provided by Avazu. The data was taken from a Kaggle Competition of [Click-Through Rate Prediction](https://www.kaggle.com/c/avazu-ctr-prediction/data). We use pandas to import a subset of a very large data. The original data is ~7GB in size, so we have taken the first 200000 rows for our analysis. 

In [196]:
df = pd.read_csv("data/train_subset.csv")

In [197]:
df.head()

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,10015140740686523448,0,2014-10-21 00:00:00,1005,0,85f751fd,c4e18dd6,50e219e0,c51f82bc,d9b5648e,...,1,0,21611,320,50,2480,3,297,100111,61
1,10070328440095985756,1,2014-10-21 00:00:00,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15701,320,50,1722,0,35,100084,79
2,10093977800236804132,1,2014-10-21 00:00:00,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,-1,79
3,10104245282042838695,0,2014-10-21 00:00:00,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15701,320,50,1722,0,35,100084,79
4,10105971003478261107,0,2014-10-21 00:00:00,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15701,320,50,1722,0,35,-1,79


Given above data, it is necessary we understand what each column means.

1. id: ad identifier more specifically the add ID
2. click: 0/1 for non-click/click (this is the target variable. indicates if this ad was clicked or not)
3. hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC. 
4. C1: anonymized categorical variable
5. banner_pos: position when the ad might be displayed
6. site_id: ID of the website
7. site_domain: domain where the site is hosted
8. site_category: category to which the site belongs
9. app_id: application ID
10. app_domain: application domain
11. app_category: application category
12. device_id: ID of the device from which the ad is clicked
13. device_ip: Network IP to which the device was connected to while clicking on the ad (eg: 192.145.86.35)
14. device_model: model of the device used for clicking the ad.
15. device_type: type of device used for clicking the ad (eg: laptop, desktop, mobile)
16. device_conn_type: connection type of the device (LAN, wifi, etc)
17. C14-C21: anonymized categorical variables (although being anonymous, variables C15 and C16 seems to give the dimensions of the ad on the page in terms of pixels)

In [198]:
df.shape

(200000, 24)

Our data has 200000 rows / records and 24 features / columns. 

In [199]:
df.dtypes

id                  uint64
click                int64
hour                object
C1                   int64
banner_pos           int64
site_id             object
site_domain         object
site_category       object
app_id              object
app_domain          object
app_category        object
device_id           object
device_ip           object
device_model        object
device_type          int64
device_conn_type     int64
C14                  int64
C15                  int64
C16                  int64
C17                  int64
C18                  int64
C19                  int64
C20                  int64
C21                  int64
dtype: object

Our features can be broadly classified into following categories:
    
1. **site features:** site_id, site_category, site_domain
2. **app features:** app_id, app_domain, app_category
3. **device features:** device_id, device_ip, device_model, device_type, device_conn_type
4. **anonymized categorical features** C1 & C14-C21
5. **other features:** hour, banner_pos
6. **target variable:** click

In [200]:
df["hour"] = pd.to_datetime(df["hour"])

## Exploring and Preprocessing Data with Feature Engineering

### 1. Checking for NaNs and missing values

In [201]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 24 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                200000 non-null  uint64        
 1   click             200000 non-null  int64         
 2   hour              200000 non-null  datetime64[ns]
 3   C1                200000 non-null  int64         
 4   banner_pos        200000 non-null  int64         
 5   site_id           200000 non-null  object        
 6   site_domain       200000 non-null  object        
 7   site_category     200000 non-null  object        
 8   app_id            200000 non-null  object        
 9   app_domain        200000 non-null  object        
 10  app_category      200000 non-null  object        
 11  device_id         200000 non-null  object        
 12  device_ip         200000 non-null  object        
 13  device_model      200000 non-null  object        
 14  devi

The first thing which we check usually is if there is NaN values. Our data doesn't seem to have null values and all the data seems to be there. If in case we had null values, we could have replaced them by mean of values of the column one at a time, or looking at other related columns and then adapting our missing values accordingly. 

### 2. Identify unique values for columns

In [202]:
def count_unique(d, columns):
    for column in columns:
        print("Number of Unique values in column {} is {}".format(column, str(len(d[column].unique()))))

In [203]:
columns = list(df.columns)
count_unique(df, columns)

Number of Unique values in column id is 200000
Number of Unique values in column click is 2
Number of Unique values in column hour is 240
Number of Unique values in column C1 is 7
Number of Unique values in column banner_pos is 7
Number of Unique values in column site_id is 1788
Number of Unique values in column site_domain is 1707
Number of Unique values in column site_category is 21
Number of Unique values in column app_id is 1732
Number of Unique values in column app_domain is 117
Number of Unique values in column app_category is 22
Number of Unique values in column device_id is 33169
Number of Unique values in column device_ip is 143819
Number of Unique values in column device_model is 3756
Number of Unique values in column device_type is 4
Number of Unique values in column device_conn_type is 4
Number of Unique values in column C14 is 1884
Number of Unique values in column C15 is 8
Number of Unique values in column C16 is 9
Number of Unique values in column C17 is 404
Number of Un

### 3. Exploring Click-Through Rate (CTR)

Here, we would like to know the Click-Through Rate (CTR) of the given dataset which we have. CTR is defined as the number of users who click an ad on a particular page to the total number of users who happen to visit that page. A higher CTR indicates that a lot of users were interested in our ads which we hosted on a particular website / page.
Below we try to observe how many people in our data actually clicked the ad. We then calculate the CTR to get an estimate of how well our ad is performing.

In [209]:
fig = px.histogram(df, x="click")
fig.update_layout(title="Click histogram")
fig.write_html("plots/click_histogram.html")
plot(fig, filename = 'plots/click_histogram.html')
display(HTML)
# IFrame("plots/click_histogram.html", width=1000, height=600)

In [188]:
df["click"].value_counts()

0    165748
1     34252
Name: click, dtype: int64

In [189]:
CTR = len(df[df["click"] == 1]) / len(df)
print("Click-Through Rate (CTR): {}".format(str(CTR)))

Click-Through Rate (CTR): 0.17126


From the above histogram and statistic, we can observe that about 166132 users do not click on our ad but only a small fraction of 33868 users happen to click our ad. Our CTR is **~17%**. This means that about **83%** of the people do not click on the ad at all! 

### 4. Effects of Days and Time on Clicks

Here we explore our datetime feature **hour** and observe how CTR varies based on days and various time of the day. This is particularly important since it will help us give insights as to what kind of ads can we put and at what time of the day or week is the traffic most promising for giving us profits. We know that the amount of people doing online shopping on **Black Friday** or **Cyber Monday** is high. These days usually occur every year in October-November period. Likewise, since most of the people are working during the day, we can assume the amount of people visiting websites or clicking ads is higher during night time or any time after **5-6 pm** more specifically after the official business hours. We also need to consider weekends as well as geographic locations for which our CTR scores can be significantly impacted. Other possible events include, **election periods**, **festivals**, etc where we might expect people to click on ads more. Hence, **datetime** based analysis needs to be done to understand the CTR trend

#### 4.1 Days vs Clicks per hour

In [13]:
df["hour"].describe()

count                  200000
unique                    240
top       2014-10-22 09:00:00
freq                     2238
first     2014-10-21 00:00:00
last      2014-10-30 23:00:00
Name: hour, dtype: object

In [14]:
click_day = df.groupby('hour').agg({'click':'sum'}).reset_index()
click_day.head()

Unnamed: 0,hour,click
0,2014-10-21 00:00:00,101
1,2014-10-21 01:00:00,133
2,2014-10-21 02:00:00,156
3,2014-10-21 03:00:00,189
4,2014-10-21 04:00:00,182


In [15]:
fig = go.Figure(go.Scatter(name="clicks/day",
    x = click_day['hour'],
    y = click_day['click'],
    hovertemplate='Date: %{x|%d %B %Y} <br>Time: %{x|%H:%M:%S} <br>Day: %{x|%A} <br>Clicks: %{y}'
))

fig.update_layout(
    title = 'Trend of clicks grouped by day for all hours',
    xaxis_tickformat = '%d %B <br>%Y',
    xaxis_title = "Hourly clicks for 10 days",
    yaxis_title = "Number of Clicks"
)

fig.show()

Above we have a plot of the amount of clicks made every hour for the 10 days of data given between **21st October 2014** to **31st October 2014**. We see peaks in the clicks made on **22nd** and **28th** of October somewhere around mid-day. Likewise, we see a surprising dip during **25th** of October at night. Apart from these 3 outlier peaks, the hourly click rate seems pretty stationary and the trend seems to be almost the same for the rest of the days.

#### 4.2 Hours vs Clicks 

Earlier we plotted days vs clicks done by users per hour. Now, we would like to see how many clicks were made for each hour for all the days. Basically we sum all the clicks made for the first hour of all the days, the second hour for all the days, etc for all the hours in the day. Our X-axis will be hours all the 24 hours. This will give us the trend of the how the clicks vary every day for a particular hour. We perform feature engineering to achieve our plot

In [16]:
df['hour_of_day'] = df["hour"].apply(lambda x: str(x.time())[:5])
click_hr = df.groupby('hour_of_day').agg({'click':'sum'}).reset_index()
click_hr.head()

Unnamed: 0,hour_of_day,click
0,00:00,738
1,01:00,948
2,02:00,1063
3,03:00,1228
4,04:00,1463


In [17]:
fig = go.Figure(go.Scatter(name="clicks/hr",
    x = click_hr['hour_of_day'],
    y = click_hr['click'],
    hovertemplate='Time: %{x} <br>Clicks: %{y}'
))

fig.update_layout(
    title = 'Trend of clicks grouped by hours for all the days',
    xaxis_title = "Hours",
    yaxis_title = "Number of Clicks"
)

fig.show()

From the above trend the highest clicks are made every day during **12:00 pm** to **2:00 pm**. The amount of clicks done is less during the initial and the later part of the day. This means that people become more active during the business hours of the day, rather than towards the end of the day or the beginning of the day! This confirms our earlier observation as described by the earlier chart.

#### 4.3 Hourly Impressions based on clicks

**Impressions** are when ads are rendered on a user screen or any other form of digital media platform. Impressions are not action-based and are merely defined by a user potentially seeing the advertisement. Hence, it doesn't really matter if someone clicked the ad or not, the **impressions** are just the fact that the ad was observed by any person and they saw it with/without any action taken on it.

In [18]:
df.head()

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21,hour_of_day
0,10015140740686523448,0,2014-10-21,1005,0,85f751fd,c4e18dd6,50e219e0,c51f82bc,d9b5648e,...,0,21611,320,50,2480,3,297,100111,61,00:00
1,10070328440095985756,1,2014-10-21,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,0,15701,320,50,1722,0,35,100084,79,00:00
2,10093977800236804132,1,2014-10-21,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,0,15704,320,50,1722,0,35,-1,79,00:00
3,10104245282042838695,0,2014-10-21,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,0,15701,320,50,1722,0,35,100084,79,00:00
4,10105971003478261107,0,2014-10-21,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,0,15701,320,50,1722,0,35,-1,79,00:00


We group our data firstly based on hour and then based on click. This helps us achieve multi-level grouping. However, we would like to bring all data to one level hence we unstack() it and plot a graph comparing clicks and non-clicks for every hour for all the days

In [19]:
impressions = df.groupby(['hour_of_day', 'click']).size().unstack().reset_index()
impressions.head()

click,hour_of_day,0,1
0,00:00,3538,738
1,01:00,3952,948
2,02:00,5003,1063
3,03:00,5612,1228
4,04:00,7997,1463


In [20]:
fig = go.Figure(data=[
    go.Bar(name='Clicked', x=impressions["hour_of_day"], y=impressions[1], 
           hovertemplate='Time: %{x} <br>Clicks: %{y}', marker_color='rgb(55, 83, 109)'),
    go.Bar(name='Not Clicked', x=impressions["hour_of_day"], y=impressions[0], 
           hovertemplate='Time: %{x} <br>Clicks: %{y}', marker_color='rgb(26, 118, 255)')
])
# Change the bar mode
fig.update_layout(
    title = 'Hourly Impressions based on Clicks',
    xaxis_title = "Hour of the day",
    yaxis_title = "Impressions / hr",
    barmode='group',
    )
fig.show()

Above figure shows us hourly impressions, which means that for every hour, a significantly high number of people saw the ads but only a fraction of them actually clicked it and were forwarded to a landing page.

#### 4.4 Hourly Click-Through Rate (CTR)

Earlier we saw hourly and daily clicks made on our ads. Now we would like to observe the hourly Click-Through Rate (CTR). **Click-Through Rate** is the number of times the ad was clicked by the total impressions. We calculate how many times the ad was clicked in an hour and divide it by the total impressions of that hour. This will give us hourly Click-Through Rate. 

In [21]:
just_clicks = df[df['click'] == 1]
hourly_ctr = df[["hour_of_day", "click"]].groupby(["hour_of_day"]).count().reset_index()
hourly_ctr = hourly_ctr.rename(columns={'click': 'impressions'})
hourly_ctr["clicks"] = just_clicks[["hour_of_day", "click"]].groupby(["hour_of_day"]).count().reset_index()["click"]
hourly_ctr["CTR"] = hourly_ctr["clicks"] / hourly_ctr["impressions"] * 100
hourly_ctr.head()

Unnamed: 0,hour_of_day,impressions,clicks,CTR
0,00:00,4276,738,17.259121
1,01:00,4900,948,19.346939
2,02:00,6066,1063,17.523904
3,03:00,6840,1228,17.953216
4,04:00,9460,1463,15.465116


In [39]:
fig = px.bar(hourly_ctr, x='hour_of_day', y='CTR',
             labels={"hour_of_day": "Time"}, 
             color='CTR',
             height=400)

fig.update_layout(
    title = 'Hourly CTR',
    xaxis_title = "Hour of the day",
    yaxis_title = "Click-Through Rate (CTR)",
    )
fig.show()

Contrary to what we observed earlier of the clicks being higher during the afternoon time or during the mid of the day, the CTR values suggest that a higher number of users click an ad relative to the impressions during midnight at around **1:00 am** and likewise, the second highest peaks are at **3:00 pm. - 4:00 pm** in the evening. If we just consider impressions then mid-night had less impressions relative to other times of the day and the same goes to the number of clicks done on the ad; however, considering both, it's an interesting trend to see the CTR to be high during the early time of the day.

#### 4.5 Daily CTR 

Now that we know how the CTR trend is for every hour of the day, let's observe how it is for every day in the week. We will basically observe 3 things:

1. Number of clicks made for each day i.e trend of clicks for each day
2. Number of impressions we had for each day for both the click and no-click cases
3. Daily CTR projections to observe how the trend if for each day

In [23]:
df.head()

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21,hour_of_day
0,10015140740686523448,0,2014-10-21,1005,0,85f751fd,c4e18dd6,50e219e0,c51f82bc,d9b5648e,...,0,21611,320,50,2480,3,297,100111,61,00:00
1,10070328440095985756,1,2014-10-21,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,0,15701,320,50,1722,0,35,100084,79,00:00
2,10093977800236804132,1,2014-10-21,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,0,15704,320,50,1722,0,35,-1,79,00:00
3,10104245282042838695,0,2014-10-21,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,0,15701,320,50,1722,0,35,100084,79,00:00
4,10105971003478261107,0,2014-10-21,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,0,15701,320,50,1722,0,35,-1,79,00:00


In [24]:
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
df["day_of_week"] = df["hour"].apply(lambda x: days[x.weekday()])
click_days = df.groupby("day_of_week").agg({"click": "sum"}).reset_index()
click_days['day_of_week'] = pd.Categorical(click_days['day_of_week'], categories=days, ordered=True)
click_days = click_days.sort_values('day_of_week')
click_days.head(7)

Unnamed: 0,day_of_week,click
1,Monday,2922
5,Tuesday,7693
6,Wednesday,7249
4,Thursday,6967
0,Friday,2873
2,Saturday,3079
3,Sunday,3469


In [25]:
fig = go.Figure(go.Scatter(name="clicks/day",
    x = click_days['day_of_week'],
    y = click_days['click'],
    hovertemplate='Day: %{x} <br>Clicks: %{y}',
    marker_color="darkolivegreen"
))

fig.update_layout(
    title = 'Trend of clicks grouped by days',
    xaxis_title = "Day of Week",
    yaxis_title = "Number of Clicks"
)

fig.show()

In [26]:
impressions = df.groupby(['day_of_week', 'click']).size().unstack().reset_index()
impressions['day_of_week'] = pd.Categorical(impressions['day_of_week'], categories=days, ordered=True)
impressions = impressions.sort_values('day_of_week')
impressions.head(7)

click,day_of_week,0,1
1,Monday,13083,2922
5,Tuesday,38951,7693
6,Wednesday,37999,7249
4,Thursday,32918,6967
0,Friday,13579,2873
2,Saturday,13605,3079
3,Sunday,15613,3469


In [27]:
fig = go.Figure(data=[
    go.Bar(name='Clicked', x=impressions["day_of_week"], y=impressions[1], 
           hovertemplate='Day: %{x} <br>Clicks: %{y}', marker_color='indianred'),
    go.Bar(name='Not Clicked', x=impressions["day_of_week"], y=impressions[0], 
           hovertemplate='Day: %{x} <br>Clicks: %{y}', marker_color='lightsalmon')
])
# Change the bar mode
fig.update_layout(
    title = 'Daily Impressions based on Clicks',
    xaxis_title = "Day of week",
    yaxis_title = "Impressions / day",
    barmode='group',
    )
fig.show()

In [28]:
just_clicks = df[df['click'] == 1]
daily_ctr = df[["day_of_week", "click"]].groupby(["day_of_week"]).count().reset_index()
daily_ctr = daily_ctr.rename(columns={'click': 'impressions'})
daily_ctr["clicks"] = just_clicks[["day_of_week", "click"]].groupby(["day_of_week"]).count().reset_index()["click"]
daily_ctr["CTR"] = daily_ctr["clicks"] / daily_ctr["impressions"] * 100
daily_ctr['day_of_week'] = pd.Categorical(daily_ctr['day_of_week'], categories=days, ordered=True)
daily_ctr = daily_ctr.sort_values('day_of_week')
daily_ctr.head(7)

Unnamed: 0,day_of_week,impressions,clicks,CTR
1,Monday,16005,2922,18.256795
5,Tuesday,46644,7693,16.493011
6,Wednesday,45248,7249,16.020598
4,Thursday,39885,6967,17.46772
0,Friday,16452,2873,17.462922
2,Saturday,16684,3079,18.454807
3,Sunday,19082,3469,18.179436


In [29]:
fig = px.bar(daily_ctr, x='day_of_week', y='CTR',
             labels={"day_of_week": "Day"}, 
             color='CTR',
             height=400)

fig.update_layout(
    title = 'Daily CTR',
    xaxis_title = "Day of week",
    yaxis_title = "Click-Through Rate (CTR)",
    )
fig.show()

Our daily CTR graph shows that on Saturday and on Sunday the chances of the ad being clicked is higher. This is reasonable since on weekends people will have more time to spend online and come across ads and click them. 

Now that we have understood the effect of clicks based on hours and days in the week, with different combinations, let us understand effect of other variables on the target click.

### 5. Effect of Site features of Clicks

The kind of website hosting our ads as a huge impact on our clicks. Is the website a famous one, is it a commercial e-commerce one, or is it just a blogging site, etc; a lot of factors related to site plays an important role into whether a person will click an ad rendered on it. Firstly as calculated earlier, we have **1788** unique websites. 

#### 5.1 Effects of Site id on clicks and CTR

In [30]:
print("Number of unique websites: {}".format(str(len(df["site_id"].unique()))))

Number of unique websites: 1788


In [113]:
# top5 websites based on number of ads displayed in them
siteids = df["site_id"].value_counts()[:5].index
site_impressions = df["site_id"].value_counts()[:5].values
print("Top5 websites based on impressions: \n{}".format(siteids))

Top5 websites based on impressions: 
Index(['85f751fd', '1fbe01fe', 'e151e245', 'd9750ee7', '5b08c53b'], dtype='object')


In [114]:
top5_sites = df[(df["site_id"].isin(siteids))]
top5_sites_click = top5_sites.groupby(['site_id', 'click']).size().unstack().reset_index()
top5_sites_click = top5_sites_click.sort_values(by=1, ascending=False).reset_index()
top5_sites_click["site_impressions"] = site_impressions
top5_sites_click = top5_sites_click.rename(columns={0: 'Not Clicked', 1: "Clicked"})
top5_sites_click.columns.name = None
top5_sites_click = top5_sites_click.drop(["index"], axis=1)
top5_sites_click.head()

Unnamed: 0,site_id,Not Clicked,Clicked,site_impressions
0,85f751fd,63793,8706,72499
1,1fbe01fe,25416,6749,32165
2,e151e245,9218,3901,13119
3,5b08c53b,2330,2123,4741
4,d9750ee7,3361,1380,4453


In [115]:
fig = go.Figure(data=[
    go.Bar(name='Clicked', x=top5_sites_click["site_id"], y=top5_sites_click["Clicked"],
           hovertemplate='Site ID: %{x} <br>Clicks: %{y}', marker_color='seagreen'),
    go.Bar(name='Not Clicked', x=top5_sites_click["site_id"], y=top5_sites_click["Not Clicked"], 
           hovertemplate='Site ID: %{x} <br>Clicks: %{y}', marker_color='firebrick')
])
# Change the bar mode
fig.update_layout(
    title = 'Top5 Sites based on Clicks',
    xaxis_title = "Top5 Site IDs",
    yaxis_title = "Impressions / site",
    barmode='group',
    )
fig.show()

Of the 1788 sites on which our ads are placed, we have the top 5 sites in terms of amount of impressions they had. As before, a lot of people happen to see the ads but only few of them end up clicking on them. This is evident by the green bars shown above.

In [116]:
top5_sites_click['CTR'] = top5_sites_click['Clicked'] / top5_sites_click['site_impressions'] * 100
top5_sites_click.head()

Unnamed: 0,site_id,Not Clicked,Clicked,site_impressions,CTR
0,85f751fd,63793,8706,72499,12.008441
1,1fbe01fe,25416,6749,32165,20.982434
2,e151e245,9218,3901,13119,29.735498
3,5b08c53b,2330,2123,4741,44.779582
4,d9750ee7,3361,1380,4453,30.990344


In [117]:
fig = px.bar(top5_sites_click, x='site_id', y='CTR',
             labels={"site_id": "Site Id"}, 
             color='CTR',
             height=400)

fig.update_layout(
    title = 'CTR values of Top5 Sites',
    xaxis_title = "Top5 Site Ids",
    yaxis_title = "Click-Through Rate (CTR)",
    )
fig.show()

We see that although site id **85f751fd** had more impressions, site id **5b08c53b** had high CTR value. So it might be the case that this sight must be having keywords which really describe the ads and that on having clicked on it the user is directed to an appropriate landing page. 

#### 5.2 Effects of Site Domain on Clicks and CTR

In [129]:
print("Number of unique domains: {}".format(str(len(df["site_domain"].unique()))))

Number of unique domains: 1707


In [131]:
# top5 domains based on number of ads displayed in them
sitedomains = df["site_domain"].value_counts()[:5].index
domain_impressions = df["site_domain"].value_counts()[:5].values
print("Top5 site domains based on impressions: \n{}".format(sitedomains))

Top5 site domains based on impressions: 
Index(['c4e18dd6', 'f3845767', '7e091613', '7687a86e', '98572c79'], dtype='object')


In [123]:
top5_domains = df[(df["site_domain"].isin(sitedomains))]
top5_domains_click = top5_domains.groupby(['site_domain', 'click']).size().unstack().reset_index()
top5_domains_click = top5_domains_click.sort_values(by=1, ascending=False).reset_index()
top5_domains_click["domain_impressions"] = domain_impressions
top5_domains_click = top5_domains_click.rename(columns={0: 'Not Clicked', 1: "Clicked"})
top5_domains_click.columns.name = None
top5_domains_click = top5_domains_click.drop(["index"], axis=1)
top5_domains_click.head()

Unnamed: 0,site_domain,Not Clicked,Clicked,domain_impressions
0,c4e18dd6,65812,9307,75119
1,f3845767,25416,6749,32165
2,7e091613,12265,4288,16553
3,7687a86e,3381,2949,6330
4,98572c79,3500,1395,4895


In [124]:
fig = go.Figure(data=[
    go.Bar(name='Clicked', x=top5_domains_click["site_domain"], y=top5_domains_click["Clicked"],
           hovertemplate='Domain ID: %{x} <br>Clicks: %{y}', marker_color='seagreen'),
    go.Bar(name='Not Clicked', x=top5_domains_click["site_domain"], y=top5_domains_click["Not Clicked"], 
           hovertemplate='Domain ID: %{x} <br>Clicks: %{y}', marker_color='firebrick')
])
# Change the bar mode
fig.update_layout(
    title = 'Top5 Domains based on Clicks',
    xaxis_title = "Top5 Site Domains",
    yaxis_title = "Impressions / domain",
    barmode='group',
    )
fig.show()

Our websites are described by a domain. If a domain is descriptive and apt then chances of people visiting it is higher, although it does not necessarily guarantee they will click the ad. If the ad is not relevant to your content or related to your core idea, it will have less CTR.

In [125]:
top5_domains_click['CTR'] = top5_domains_click['Clicked'] / top5_domains_click['domain_impressions'] * 100
top5_domains_click.head()

Unnamed: 0,site_domain,Not Clicked,Clicked,domain_impressions,CTR
0,c4e18dd6,65812,9307,75119,12.389675
1,f3845767,25416,6749,32165,20.982434
2,7e091613,12265,4288,16553,25.90467
3,7687a86e,3381,2949,6330,46.587678
4,98572c79,3500,1395,4895,28.498468


In [128]:
fig = px.bar(top5_domains_click, x='site_domain', y='CTR',
             labels={"site_domain": "Domain ID"}, 
             color='CTR',
             height=400)

fig.update_layout(
    title = 'CTR values of Top5 Domains',
    xaxis_title = "Top5 Domains",
    yaxis_title = "Click-Through Rate (CTR)"
    )
fig.show()

Again, the 4th site has higher CTR although it had less impressions overall as compared to site 1

#### 5.3 Effects of Site Category on Clicks and CTR

In [130]:
print("Number of website categories: {}".format(str(len(df["site_category"].unique()))))

Number of website categories: 21


In [134]:
# top5 site categories based on number of ads displayed in them
sitecategories = df["site_category"].value_counts()[:5].index
category_impressions = df["site_category"].value_counts()[:5].values
print("Top5 site categories based on impressions: \n{}".format(sitecategories))

Top5 site categories based on impressions: 
Index(['50e219e0', 'f028772b', '28905ebd', '3e814130', 'f66779e6'], dtype='object')


In [135]:
top5_categories = df[(df["site_category"].isin(sitecategories))]
top5_categories_click = top5_categories.groupby(['site_category', 'click']).size().unstack().reset_index()
top5_categories_click = top5_categories_click.sort_values(by=1, ascending=False).reset_index()
top5_categories_click["category_impressions"] = category_impressions
top5_categories_click = top5_categories_click.rename(columns={0: 'Not Clicked', 1: "Clicked"})
top5_categories_click.columns.name = None
top5_categories_click = top5_categories_click.drop(["index"], axis=1)
top5_categories_click.head()

Unnamed: 0,site_category,Not Clicked,Clicked,category_impressions
0,f028772b,51392,11330,82064
1,50e219e0,71442,10622,62722
2,28905ebd,28778,7714,36492
3,3e814130,10608,4264,14872
4,f66779e6,1111,50,1161


In [138]:
fig = go.Figure(data=[
    go.Bar(name='Clicked', x=top5_categories_click["site_category"], y=top5_categories_click["Clicked"],
           hovertemplate='Category ID: %{x} <br>Clicks: %{y}', marker_color='seagreen'),
    go.Bar(name='Not Clicked', x=top5_categories_click["site_category"], y=top5_categories_click["Not Clicked"], 
           hovertemplate='Category ID: %{x} <br>Clicks: %{y}', marker_color='firebrick')
])
# Change the bar mode
fig.update_layout(
    title = 'Top5 Categories based on Clicks',
    xaxis_title = "Top5 Site Categories",
    yaxis_title = "Impressions / site category",
    barmode='group',
    )
fig.show()

Sites can belong to various categories - ecommerce websites, healthcare websites, education websites, etc. Each category has various websites of different domains. Above graph shows how impressions vary based on site category. For instance the 2nd site category has highest impressions. Maybe it might depict ecommerce site like Amazon or Ebay which has higher footprint then a relatively less visited website like a hospital website or maybe an educational blog site catered to a specific audience.

In [139]:
top5_categories_click['CTR'] = top5_categories_click['Clicked'] / top5_categories_click['category_impressions'] * 100
top5_categories_click.head()

Unnamed: 0,site_category,Not Clicked,Clicked,category_impressions,CTR
0,f028772b,51392,11330,82064,13.806298
1,50e219e0,71442,10622,62722,16.935047
2,28905ebd,28778,7714,36492,21.13888
3,3e814130,10608,4264,14872,28.671329
4,f66779e6,1111,50,1161,4.306632


In [140]:
fig = px.bar(top5_categories_click, x='site_category', y='CTR',
             labels={"site_category": "Site Category ID"}, 
             color='CTR',
             height=400)

fig.update_layout(
    title = 'CTR values of Top5 Site Categories',
    xaxis_title = "Top5 Site Categories",
    yaxis_title = "Click-Through Rate (CTR)"
    )
fig.show()

As before CTR values is higher for 4th site category although its impressions are lower. 

### 6. Effect of Device Features on Clicks



#### 6.1 Effect of Device id on Clicks

#### 6.2 Effect of Device Type on Clicks

### 7. Effect of App Features on Clicks

#### 7.1 Effect of App Category on Clicks