# Challenge 


We expect you to: 
  - finish the test in 2-3 hours
  - return the results in 1 week but the sooner the better
  - build an end-to-end pipeline for the task
  - showcase your understanding of various aspects of ML: ETL, model building and selection, evaluation, etc.
  - develop in python (jupyter notebooks) with reasonable comments
  - use version control with appropriate commit messages

The test is about building a CTR prediction model with one of the datasets:
  - https://www.kaggle.com/c/online-advertising-challenge-spring-2018
  - https://www.kaggle.com/c/avazu-ctr-prediction
  - https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection
  - https://www.kaggle.com/c/criteo-display-ad-challenge

# Data 

For this exercise [Avazu dataset](https://www.kaggle.com/c/avazu-ctr-prediction/overview) is used. There is no particular reason for choosing one dataset over another. 

Data was downloaded from the link above and saved into the `data` folder. After decompressing, the files were renamed to *csv* extension. Please refer to the link for more data description. 

In [12]:
! head ./data/avazu-ctr-prediction/train.csv

id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,ddd2926e,44956a24,1,2,15706,320,50,1722,0,35,-1,79
10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,96809ac8,711ee120,1,0,15704,320,50,1722,0,35,100084,79
10000371904215119486,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,b3cf8def,8a4875bd,1,0,15704,320,50,1722,0,35,100084,79
10000640724480838376,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,e8275b8f,6332421a,1,0,15706,320,50,1722,0,35,100084,79
10000679056417042096,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,07d7df22,a99f214a,9644d0bf,779d90c2,1,0,18993,320,50,2161,0,35,-1,157
10000720757801103869,0,1

In [1]:
# Reading the data with Dask - in parallel 
import dask.dataframe as dd

# Define schema dict; this will force schema when reding data. 
dict_schema = {'id': 'float64'}

train = dd.read_csv(urlpath='./data/avazu-ctr-prediction/train.csv', dtype = dict_schema)
test = dd.read_csv(urlpath='./data/avazu-ctr-prediction/test.csv', dtype = dict_schema)

# print data head and shape
print('train shape = {}'.format(train.compute().shape))
print('test shape = {}'.format(test.compute().shape))

display(train.head())

train shape = (40428967, 24)
test shape = (4577464, 23)


Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1.000009e+18,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79
1,1.000017e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
2,1.000037e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
3,1.000064e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15706,320,50,1722,0,35,100084,79
4,1.000068e+19,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,...,1,0,18993,320,50,2161,0,35,-1,157


In [4]:
train.columns

Index(['id', 'click', 'hour', 'C1', 'banner_pos', 'site_id', 'site_domain',
       'site_category', 'app_id', 'app_domain', 'app_category', 'device_id',
       'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14',
       'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21'],
      dtype='object')

In [5]:
test.columns

Index(['id', 'hour', 'C1', 'banner_pos', 'site_id', 'site_domain',
       'site_category', 'app_id', 'app_domain', 'app_category', 'device_id',
       'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14',
       'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21'],
      dtype='object')

In [6]:
type(train)

dask.dataframe.core.DataFrame

**NOTE**

**This dataset appears to be big enough to not accommodate light ML training on local machine. In this context, I will limit the analysis to 1M observations in the training dataset and the respective about 2.5% in the testing dataset.**

**This is a limitation when it comes to the accuracy of the model. I assume the task is not necessarily concerned with accuracy (even though this is important), but rather with building end to end ML and feature engineering. Having said that, I can certainly use the cloud for more compute resources, but this is not explored here.**

In [56]:
# Sample the data and save to disk
sample_fraction = 0.025
train_sample = train.sample(frac=sample_fraction)
train_sample.compute().to_csv('./data/avazu-ctr-prediction/train_sample.csv', index = None)

test_sample = test.sample(frac=sample_fraction)
test_sample.compute().to_csv('./data/avazu-ctr-prediction/test_sample.csv', index = None)

In [57]:
# Reading the data back -- not needed now; just in case you start fresh. 

dict_schema = {'id': 'float64'}

train_sample = dd.read_csv(urlpath='./data/avazu-ctr-prediction/train_sample.csv', dtype = dict_schema)
test_sample = dd.read_csv(urlpath='./data/avazu-ctr-prediction/test_sample.csv', dtype = dict_schema)