Skip to content

jfraj/click-through

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

click-through

Analysis of click-through rate using the avazu data set from a finished Kaggle competition.

Definition of successful

Practically, the success is defined as: sum(clicks)/(sum(clicks) + sum(non-clicks)) for more details and notes about confidence intervals, see the notebook about successful app_domain

Code for this report

Each discrete questions is answered in an ipython notebook. These notebook go through the reasoning leading to the answer. The backend of these notebooks consists of two python modules:

Data storing with pickle

In order to save time, the dataframe created from the csv file can be saved (pickled) in a given directory with the following command

storing_data.save_df_by_chunk('data/train.csv')

the dataframe will by saved in 10 pickle files in the directory pkld_data under the names pkled_chunk_?.pkl, where ?=0,1,2....

The dataframe can be loaded in the following way

storing_data.get_df('pkld_data/')

The above code assumes that the storing_data.py module is imported:

import storing_data

It also assumes that the directory pkld_data/ exists and that the csv file is in the ./data/ directory.

As an example, the notebook notebook/distinct_ads.ipynb uses the pickled dataframe that was saved in pkld_data/. Finally, for testing purposes, the argument max_chunks=[some int < 10] can be provided to get_df to get a dataframe with less data.

Warning: make sure there are no pkled_chunk_?.pkl files already in pkld_data/. This check should be added to the code.

Language and tools

  • python 2.7
  • ipython 3.0.0
  • pandas 0.16.1
  • matplotlib 1.4.3

The code assumes the data to be in ./data/

Discrete questions

Data insight

(plots for this section were produced in this notebook) As a reference for any successful field, the following plot shows the overall success of all the adds for each day of the data set.

alt text

Here is a bar chart of the most successful value for each field

alt text

The most successful ones, device_id & device_ip, have a small number of clicks, reflected by the larger error bars, but seems to be credible. The small number of impressions, if true, makes those successful field values less interesting for prediction since they happens so rarely.

Field values correlations

Relationship between fields can be checked with variation correlations over time. Here is an example of highly correlated values,

alt text

This can be use to predict the success of one value when there is only data on a correlated value.

Some field contains more correlations between their values than others, here is the distributions of the 100 most successful paired values of the C20 field:

alt text

A closer look at the most successful app_domain

alt text

Clearly, app_domain 99b4c806 is only significantly successful one day, the 28th, but its number of impressions is so large that they overwhelm clicks from all other days of lower success. The following plot shows how the distributions of some fields on the 28th and the rest of the data set.

alt text

(code to produce this last plot in notebooks/app_domain_28th.ipynb) There are four fields that are clearly different on the 28th: C14, C17, C18 & C19. Some of them most likely influence the success. Indeed, C14 and C17 have values that can have very high success. This could be investigated by looking at which combinations are successful on the 28th and see if they are still successful (but less frequent) on other days.

Conclusion

The avazu data set is a clear example of the need for machine learning tools to achieve predictions on the click-through rate. Some trends have been shown above, but not all are not consistent in time. A multidimensional analysis could take into consideration all the fields of the data and learn the click behavior. Finally, these click-through rate is likely to be unreliable in "real-life" data since "click records and non-click records are subsampled based on different sampling strategies" as explained in the kaggle forum.

About

Analysis of click-through rate

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages