# Conversion rate challenge



Contents
--------
1. [Data loading and preprocessing](#loading)
2. [Preliminary EDA](#eda)
3. [A first model](#model1)
4. [A second model](#model2)
4. [Conclusion and perspectives](#conclusion)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## <a name="loading"></a>Data loading and preprocessing

A few remarks on this dataset are in order before proceeding:
- Although gathering data about `'country'`, `'source'`, `'total_pages_visited'` would be easy by monitoring server activity, this is not the case for the `'age'`. This is a hint that the dataset is somewhat artificial.
- Before being a recurring user, a user should have been a `'new_user'` at some point. This raises the question of whether a given user could have been included multiple times in the dataset until they subscribe (or not) to the newsletter.
- Only 4 countries are considered here, which is unlikely for a website. 

In the analysis provided next, we ignore all these issues and assume the dataset honest, without any user selection bias.

In [9]:
df = pd.read_csv('./conversion_data_train.csv')
df

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
0,China,22,1,Direct,2,0
1,UK,21,1,Ads,3,0
2,Germany,20,0,Seo,14,1
3,US,23,1,Seo,3,0
4,US,28,1,Direct,3,0
...,...,...,...,...,...,...
284575,US,36,1,Ads,1,0
284576,US,31,1,Seo,2,0
284577,US,41,1,Seo,5,0
284578,US,31,1,Direct,4,0


In [10]:
df.describe(include='all')

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
count,284580,284580.0,284580.0,284580,284580.0,284580.0
unique,4,,,3,,
top,US,,,Seo,,
freq,160124,,,139477,,
mean,,30.564203,0.685452,,4.873252,0.032258
std,,8.266789,0.464336,,3.341995,0.176685
min,,17.0,0.0,,1.0,0.0
25%,,24.0,0.0,,2.0,0.0
50%,,30.0,1.0,,4.0,0.0
75%,,36.0,1.0,,7.0,0.0


There are 284 520 observations, with no missing data. Most feature observations are consistent with what could be expected, except for the `'age'` which has a maximal value of 123. This is larger than the age of the oldest person ever verified. Let us explore if we have other outliers.

In [11]:
df.loc[df['age'] > 70]

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
11331,UK,111,0,Ads,10,1
104541,US,72,1,Direct,4,0
175251,US,73,1,Seo,5,0
230590,US,79,1,Direct,1,0
233196,Germany,123,0,Seo,15,1
268311,US,77,0,Direct,4,0


There is a 32-year age gap between 79 and 111, with 2 website visitors older than 80 years old. We choose to remove those 2 records from the dataset.

In [13]:
df = df.loc[df['age'] < 80]
df.describe(include='all')

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
count,284578,284578.0,284578.0,284578,284578.0,284578.0
unique,4,,,3,,
top,US,,,Seo,,
freq,160124,,,139476,,
mean,,30.563596,0.685457,,4.873198,0.032251
std,,8.263627,0.464334,,3.341939,0.176667
min,,17.0,0.0,,1.0,0.0
25%,,24.0,0.0,,2.0,0.0
50%,,30.0,1.0,,4.0,0.0
75%,,36.0,1.0,,7.0,0.0
