insert intro
First, let's read the relevant columns into memory.

In [31]:
from pandas import read_csv
ads = read_csv('data/ads.csv', usecols=['click', 'AdvertiserID', 'AdExchange', 'Adslotwidth',
                                           'Adslotheight', 'Adslotvisibility', 'Adslotformat', 'Biddingprice', 'Browser', 'imp', 'interest_news',
                                           'interest_eduation', 'interest_automobile', 'interest_realestate',
                                           'interest_IT', 'interest_electronicgame', 'interest_fashion',
                                           'interest_entertainment', 'interest_luxury', 'interest_homeandlifestyle',
                                           'interest_health', 'interest_food', 'interest_divine',
                                           'interest_motherhood_parenting', 'interest_sports',
                                           'interest_travel_outdoors',
                                           'interest_social', 'interest_art_photography_design',
                                           'interest_onlineliterature', 'interest_3c', 'interest_culture',
                                           'interest_sex',
                                           'Inmarket_3cproduct', 'Inmarket_appliances', 'Inmarket_clothing_shoes_bags',
                                           'Inmarket_Beauty_PersonalCare', 'Inmarket_infant_momproducts',
                                           'Inmarket_sportsitem', 'Inmarket_outdoor', 'Inmarket_healthcareproducts',
                                           'Inmarket_luxury', 'Inmarket_realestate', 'Inmarket_automobile',
                                           'Inmarket_finance', 'Inmarket_travel', 'Inmarket_education',
                                           'Inmarket_service', 'Inmarket_electronicgame', 'Inmarket_book',
                                           'Inmarket_medicine', 'Inmarket_food_drink', 'Inmarket_homeimprovement',
                                           'Demographic_gender_male', 'Demographic_gender_famale', 'Payingprice'])
if 'Unnamed: 0' in ads:
    ads.drop('Unnamed: 0', axis=1, inplace=True)

  interactivity=interactivity, compiler=compiler, result=result)


Let's inspect the columns of the dataset.

In [2]:
ads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4621497 entries, 0 to 4621496
Data columns (total 55 columns):
Browser                            object
AdExchange                         float64
Adslotwidth                        float64
Adslotheight                       float64
Adslotvisibility                   object
Adslotformat                       object
Biddingprice                       float64
AdvertiserID                       float64
imp                                float64
click                              float64
interest_news                      float64
interest_eduation                  float64
interest_automobile                float64
interest_realestate                float64
interest_IT                        float64
interest_electronicgame            float64
interest_fashion                   float64
interest_entertainment             float64
interest_luxury                    float64
interest_homeandlifestyle          float64
interest_health               

We see that we have ~ 4 million observations and 55 variables. The most important variable is the `click` variable, which indicates whether the ad was clicked on by the user. Let's see how many missing values we have.

In [44]:
ads['click'].isnull().sum()

1182166

Around 1 million bids do not contain a value for the `click` variable. However, this variable is critical, because we will regress it later on. Therefore, we need to remove these observations.

In [45]:
ads.dropna(subset=['click'], inplace=True)

Additionaly, let's move the `click` variable to the front.

In [22]:
cols = ['click'] + [col for col in ads if col != 'click']
ads = ads[cols]

Let's fix some typos.

In [23]:
ads.rename(
    columns={'interest_eduation': 'interest_education', 'Demographic_gender_famale': 'Demographic_gender_female'},
    inplace=True)

We see that most variables are detected as floating point numbers. However, most of them are actually boolean variables! We need to account for that and convert them. First, let's look at the average missing values per variable.

In [9]:
boolean_cols = ['imp', 'click', 'interest_news',
                'interest_education', 'interest_automobile', 'interest_realestate',
                'interest_IT', 'interest_electronicgame', 'interest_fashion',
                'interest_entertainment', 'interest_luxury', 'interest_homeandlifestyle',
                'interest_health', 'interest_food', 'interest_divine',
                'interest_motherhood_parenting', 'interest_sports', 'interest_travel_outdoors',
                'interest_social', 'interest_art_photography_design',
                'interest_onlineliterature', 'interest_3c', 'interest_culture', 'interest_sex',
                'Inmarket_3cproduct', 'Inmarket_appliances', 'Inmarket_clothing_shoes_bags',
                'Inmarket_Beauty_PersonalCare', 'Inmarket_infant_momproducts',
                'Inmarket_sportsitem', 'Inmarket_outdoor', 'Inmarket_healthcareproducts',
                'Inmarket_luxury', 'Inmarket_realestate', 'Inmarket_automobile',
                'Inmarket_finance', 'Inmarket_travel', 'Inmarket_education',
                'Inmarket_service', 'Inmarket_electronicgame', 'Inmarket_book',
                'Inmarket_medicine', 'Inmarket_food_drink', 'Inmarket_homeimprovement',
                'Demographic_gender_male', 'Demographic_gender_female']
missings = 0
for col in boolean_cols:
    missings_col = ads[col].isnull().sum()
    if col == 'imp':
        print('Missing values in `imp`', missings_col)
    missings += ads[col].isnull().sum()
missings / len(boolean_cols)



Missing values in `imp` 0


1409622.2608695652

These are quite a lot, and we cannot afford to loose them. As the variables are simply dummy variables of a categorical variable, we decide to simply use the default bool conversion strategy in this case, ie convert all missing values to `False`. Intuitively, if we don't know whether a user is interested in food, we simply assume that the user isn't interested in it. This way, we won't skew the analysis significantly. Further note, that the `imp` variable has no missing values. 

In [25]:
ads[boolean_cols] = ads[boolean_cols].fillna(0)
ads[boolean_cols] = ads[boolean_cols].astype(bool)

How many bids did not win impressions?

In [54]:
(ads['imp'] == False).sum()

1473696

Obviously, if the advertiser didn't win the impression, the user won't be able to click. Therefore, we use only the won impressions and remove the column. As an aside, it would be an interesting analysis to regress `imp` itself, ie whether an advertiser will win a bid. However, we figure that the provided variables are not conclusive enough to classify `imp` correctly.

In [30]:
print('Lost impressions, but clicked', ((ads['imp'] == 0) & (ads['click'] == 1)).sum())
ads = ads[ads['imp']]
ads.drop(['imp'], axis=1, inplace=True)