In [15]:
import numpy as np
import pandas as pd
import sklearn
import catboost
import shap

from pandas_profiling import ProfileReport

# Problem Statement

Work with the attached dataset to identify users who are more likely to do a sale.

It's important for the business to understand **what makes someone more likely to buy** an insurance.
For this purpose, we assembled a dataset with features of users and their sessions and an indicator if they resulted in a sale or not.

Your goal is to work with this dataset and **identify users who are more likely to convert** so that we can personalise the experience for them.

We don't expect a perfect solution (actually, we believe there isn't one). But please take several hours to implement a well structured approach in a Jupyter notebook.

Afterwards, we'll discuss your solution in detail. Be ready to answer question like, why did you choose this approach and not others, how will you measure the performance of your solution and how do you interpret the final results


# Data understanding

| field name   	| description                                                  	|
|--------------	|--------------------------------------------------------------	|
| id           	| unique identifier of the rows                                	|
| date         	| date of the session                                          	|
| campaign_id  	| id of the advertising campaign that led the user to the site 	|
| group_id     	| id of the group that lead the user to the site               	|
| age_group    	| age range of the user                                        	|
| gender       	| gender of the user                                           	|
| user_type    	| internal id of the type of user                              	|
| platform     	| device type of the user                                      	|
| state_id     	| US state id of the user location                             	|
| interactions 	| number of interactions of the user                           	|
| sale         	| boolean indicator if the user has made a sale or not         	|

## Read-in the data

In [13]:
sales_data = pd.read_csv('sales_data.csv.gz', compression='gzip', sep=',', header=0, index_col=0)

In [14]:
sales_data

Unnamed: 0_level_0,date,campaign_id,group_id,age_group,gender,user_type,platform,state_id,interactions,sale
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
277133,2021-05-04,3,372,45-49,F,0,desktop,19,5249,0.0
270342,2021-05-04,3,313,35-39,F,4,desktop,16,1522,0.0
161280,2021-05-02,3,321,30-34,M,1,desktop,25,2,0.0
252773,2021-05-04,3,426,30-34,F,0,,27,2,1.0
118886,2021-05-01,3,337,40-44,M,8,desktop,16,2,0.0
...,...,...,...,...,...,...,...,...,...,...
71168,2021-05-01,3,404,30-34,F,2,,5,2,0.0
8348,2021-05-01,3,329,35-39,F,2,desktop,26,1625,0.0
242565,2021-05-04,3,361,35-39,F,7,desktop,12,2,0.0
199292,2021-05-03,3,377,40-44,M,7,desktop,20,1259,0.0


In [21]:
profile = ProfileReport(sales_data, title="Sales Data Report", explorative=True, dark_mode=True)

In [23]:
profile.to_widgets()

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [24]:
# profile.to_notebook_iframe()

**Notes**:

   - missing values in sales. remove from analysis
   - imbalance in sales. need to rebalance
   - missing values in platform. remove?
   - imabalance in platform. rebalance?
   - encode categorical string variables 
   - log transform interactions?

## Re-balancing

# Modelling

# Explainability

# Fairness