# Conversion Rate Prediction Service

The notebook contains EDA, data curation and modeling experimentation

**Objective**: to build model(s) for predicting users conversion rate based on the activity features per entity and device type

**DoD for the model performance**

MAE to be less then the ones can be done with the following naïve/baseline models:

- predicting 0 for all entities
- predicting the mean of the entire train data set
- predicting the mean of the train data set by device
- predicting the mean of the train data set by entity_id

In [1]:
import os
import pandas as pd
import numpy as np
import pandas_profiling
import time

np.random.seed(2019)

In [2]:
DIR = os.getcwd()
PATH_DATA = os.path.join(os.path.dirname(DIR), 'bucket/data')
PATH_MODEL = os.path.join(os.path.dirname(DIR), 'bucket/model')

In [3]:
df = pd.read_csv(os.path.join(PATH_DATA, 'input/technical_test_training_data.csv'))

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56181 entries, 0 to 56180
Columns: 165 entries, entity_id to Conversions
dtypes: int64(164), object(1)
memory usage: 70.7+ MB


Let's have a look at the data profile

In [None]:
df.profile_report()

Based on the data set profiling, the following conclusions/observation can be made:

- the target class is unbalanced with only about 17% of data points corresponding to a user conversion
- device type distribution is also skewed with the ratio of classes tablet:smartphone:computer ~ 1:1:2
- the data set has the time-series dimensiton and has two weeks worth of data
- number of features can be reduced by omiting some columns due to their correlation:
    - att2 ~ att6 ~ att28
    - att10 ~ att11 ~ att44
    - att17 ~ att19
    - att9 = att35
    - att4 = att26

## Types casting+downcasting


This step is advised to use (especially when it comes to floating point data type) to remove unnessesarly precision to reduce the memory consumbtion and reduce computation complexety

In [None]:
[i for i in df.columns if not i.startswith('att')]

In [None]:
col_int = ['Conversions', 'Clicks']
col_cat = ['entity_id', 'device', 'week']

In [None]:
types = [f"np.int{i}" for i in [8, 16, 32, 64]]
for col in col_int:
    col_min = df[col].min()
    col_max = df[col].max()
    
    for int_type in types:
        if np.iinfo(eval(int_type)).min <= col_min and \
           col_max <= np.iinfo(eval(int_type)).max:
                df[col] = df[col].astype(eval(int_type), copy=True) 
                break

In [None]:
df[col_cat].astype("category")

In [None]:
type(np.int8)

In [None]:
df[['Conversions']].max()