# Challenge : predict conversions 🏆🏆
In this project, you will participate to a machine learning competition like the ones that are organized by https://www.kaggle.com/. You will be able to work with jupyter notebooks as usual, but in the end you'll have to submit your model's predictions to your teacher/TA, so your model's performances will be evaluated in an independent way. The scores achieved by the different teams will be stored into a leaderboard 🏅🏅

## Description of a machine learning challenge 🚴🚴
- In machine learning challenges, the dataset is always separated into to files :
    - *data_train.csv* contains **labelled data**, which means there are both X (explanatory variables) and Y (the target to be predicted). You will use this file to train your model as usual : make the train/test split, preprocessings, assess performances, try different models, fine-tune hyperparameters etc...
    - *data_test.csv* contains "new" examples that have not be used to train the model, in the same format as in *data_train.csv* but it is **unlabeled**, which means the target Y has been removed from the file. Once you've trained a model, you will use *data_test.csv* to make some predictions that you will send to the organizing team. They will then be able to assess the performances of your model in an independent way, by preventing cheating 🤸
- Your model's predictions will be compared to the true labels and releases a leaderboard where the scores of all the teams around the world are stored
- All the participants are informed about the metric that will be used to assess the scores. You have to make sure you're using the same metric to evaluate your train/test performances !

## Company's Description 📇
www.datascienceweekly.org is a famous newsletter curated by independent data scientists. Anyone can register his/her e-mail address on this website to receive weekly news about data science and its applications !

## Project 🚧
The data scientists who created the newsletter would like to understand better the behaviour of the users visiting their website. They would like to know if it's possible to build a model that predicts if a given user will subscribe to the newsletter, by using just a few information about the user. They would like to analyze the parameters of the model to highlight features that are important to explain the behaviour of the users, and maybe discover a new lever for action to improve the newsletter's conversion rate.

They designed a competition aiming at building a model that allows to predict the *conversions* (i.e. when a user will subscribe to the newsletter). To do so, they open-sourced a dataset containing some data about the traffic on their website. To assess the rankings of the different competing teams, they decided to use the **f1-score**.

## Goals 🎯
The project can be cut into four steps :
- Part 1 : make an EDA and the preprocessings and train a baseline model with the file *data_train.csv*
- Part 2 : improve your model's f1-score on your test set (you can try feature engineering, feature selection, regularization, non-linear models, hyperparameter optimization by grid search, etc...)
- Part 3 : Once you're satisfied with your model's score, you can use it to make some predictions with the file *data_test.csv*. You will have to dump the predictions into a .csv file that will be sent to Kaggle (actually, to your teacher/TA 🤓). You can make as many submissions as you want, feel free to try different models !
- Part 4 : Take some time to analyze your best model's parameters. Are there any lever for action that would help to improve the newsletter's conversion rate ? What recommendations would you make to the team ?

## Deliverable 📬
To complete this project, your team should: 
- Create some relevant figures for EDA
- Train at least one model that predicts the conversions and evaluate its performances (f1, confusion matrices)
- Make at least one submission to the leaderboard 
- Analyze your best model's parameters and try to make some recommendations to improve the conversion rate in the future


# Import librairies

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_curve
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) # to avoid deprecation warnings

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
# setting Jedha color palette as default
pio.templates["jedha"] = go.layout.Template(
    layout_colorway=["#4B9AC7", "#4BE8E0", "#9DD4F3", "#97FBF6", "#2A7FAF", "#23B1AB", "#0E3449", "#015955"]
)
pio.templates.default = "jedha"

pio.renderers.default = "iframe_connected" # to be replaced by "iframe" if working on JULIE

## Import dataset

In [2]:
# Import dataset

print("Loading dataset...")
dataset = pd.read_csv("conversion_data_train.csv")
print("...Done.")
print()

Loading dataset...
...Done.



In [3]:

# Separate target variable Y from features X
print("Separating labels from features...")
features_list = ['total_pages_visited','age','new_user','country','source']
target_variable = "converted"

X = dataset.loc[:,features_list]
Y = dataset.loc[:,target_variable]

print("...Done.")
print()

print('Y : ')
print(Y.head())
print()
print('X :')
print(X.head())


Separating labels from features...
...Done.

Y : 
0    0
1    0
2    1
3    0
4    0
Name: converted, dtype: int64

X :
   total_pages_visited  age  new_user  country  source
0                    2   22         1    China  Direct
1                    3   21         1       UK     Ads
2                   14   20         0  Germany     Seo
3                    3   23         1       US     Seo
4                    3   28         1       US  Direct


In [4]:
dataset.shape

(284580, 6)

In [5]:
dataset.describe(include='all')

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
count,284580,284580.0,284580.0,284580,284580.0,284580.0
unique,4,,,3,,
top,US,,,Seo,,
freq,160124,,,139477,,
mean,,30.564203,0.685452,,4.873252,0.032258
std,,8.266789,0.464336,,3.341995,0.176685
min,,17.0,0.0,,1.0,0.0
25%,,24.0,0.0,,2.0,0.0
50%,,30.0,1.0,,4.0,0.0
75%,,36.0,1.0,,7.0,0.0


In [6]:

# Automatically detect positions of numeric/categorical features
idx = 0
numeric_features = []
numeric_indices = []
categorical_features = []
categorical_indices = []
for i,t in X.dtypes.iteritems():
    if ('float' in str(t)) or ('int' in str(t)) :
        numeric_features.append(i)
        numeric_indices.append(idx)
    else :
        categorical_features.append(i)
        categorical_indices.append(idx)

    idx = idx + 1

print('Found numeric features ', numeric_features,' at positions ', numeric_indices)
print('Found categorical features ', categorical_features,' at positions ', categorical_indices)

# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
# WARNING : don't forget stratify=Y for classification problems
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0, stratify = Y)
print("...Done.")
print()

Found numeric features  ['total_pages_visited', 'age', 'new_user']  at positions  [0, 1, 2]
Found categorical features  ['country', 'source']  at positions  [3, 4]
Dividing into train and test sets...
...Done.



In [7]:
# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X_train = X_train.values
X_test = X_test.values
Y_train = Y_train.tolist()
Y_test = Y_test.tolist()
print("...Done")

print(X_train[0:5,:])
print(X_test[0:2,:])
print()
print(Y_train[0:5])
print(Y_test[0:2])


Convert pandas DataFrames to numpy arrays...
...Done
[[1 19 1 'China' 'Seo']
 [5 33 1 'US' 'Direct']
 [2 51 1 'US' 'Ads']
 [1 17 0 'China' 'Seo']
 [5 28 1 'China' 'Seo']]
[[1 34 1 'UK' 'Ads']
 [5 32 0 'UK' 'Ads']]

[0, 0, 0, 0, 0]
[0, 0]


In [8]:
# Encoding categorical features and standardizing numerical features
print("Encoding categorical features and standardizing numerical features...")
print()
print(X_train[0:5,:])

# Normalization
numeric_transformer = StandardScaler()

# OHE / dummyfication
categorical_transformer = OneHotEncoder(drop='first')

featureencoder = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_indices),    
        ('num', numeric_transformer, numeric_indices)
        ]
    )

X_train = featureencoder.fit_transform(X_train)
print("...Done")
print(X_train[0:5,:])

Encoding categorical features and standardizing numerical features...

[[1 19 1 'China' 'Seo']
 [5 33 1 'US' 'Direct']
 [2 51 1 'US' 'Ads']
 [1 17 0 'China' 'Seo']
 [5 28 1 'China' 'Seo']]
...Done
[[ 0.          0.          0.          0.          1.         -1.15935344
  -1.3990984   0.67651656]
 [ 0.          0.          1.          1.          0.          0.03743241
   0.29299544  0.67651656]
 [ 0.          0.          1.          0.          0.         -0.86015697
   2.46854467  0.67651656]
 [ 0.          0.          0.          0.          1.         -1.15935344
  -1.64082609 -1.47816042]
 [ 0.          0.          0.          0.          1.          0.03743241
  -0.31132378  0.67651656]]


In [9]:
# Label encoding
print("Encoding labels...")
print(Y_train[0:5])
encoder = LabelEncoder()
Y_train = encoder.fit_transform(Y_train)
print("...Done")
print(Y_train[0:5])

Encoding labels...
[0, 0, 0, 0, 0]
...Done
[0 0 0 0 0]


In [10]:
# Train model
print("Train model...")
classifier = LogisticRegression()
classifier.fit(X_train, Y_train)
print("...Done.")

Train model...
...Done.


In [11]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = classifier.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()

Predictions on training set...
...Done.
[0 0 0 ... 0 0 0]



In [12]:
# Missing values

print("Imputing missing values...")
print(X_test[0:5,:])

# Encoding categorical features and standardizing numerical features
print("Encoding categorical features and standardizing numerical features...")
print(X_test[0:5,:])


X_test = featureencoder.transform(X_test)
print("...Done")
print(X_test[0:5,:])


# Label encoding
print("Encoding labels...")
print(Y_test[0:5])
Y_test = encoder.transform(Y_test)
print("...Done")
print(Y_test[0:5])

Imputing missing values...
[[1 34 1 'UK' 'Ads']
 [5 32 0 'UK' 'Ads']
 [1 44 1 'US' 'Ads']
 [1 35 1 'US' 'Direct']
 [3 29 1 'US' 'Direct']]
Encoding categorical features and standardizing numerical features...
[[1 34 1 'UK' 'Ads']
 [5 32 0 'UK' 'Ads']
 [1 44 1 'US' 'Ads']
 [1 35 1 'US' 'Direct']
 [3 29 1 'US' 'Direct']]
...Done
[[ 0.          1.          0.          0.          0.         -1.15935344
   0.41385929  0.67651656]
 [ 0.          1.          0.          0.          0.          0.03743241
   0.1721316  -1.47816042]
 [ 0.          0.          1.          0.          0.         -1.15935344
   1.62249775  0.67651656]
 [ 0.          0.          1.          1.          0.         -1.15935344
   0.53472314  0.67651656]
 [ 0.          0.          1.          1.          0.         -0.56096051
  -0.19045994  0.67651656]]
Encoding labels...
[0, 0, 0, 0, 0]
...Done
[0 0 0 0 0]


In [13]:

# Predictions on test set
print("Predictions on test set...")
Y_test_pred = classifier.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

# Print scores
print("accuracy on training set : ", accuracy_score(Y_train, Y_train_pred))
print("accuracy on test set : ", accuracy_score(Y_test, Y_test_pred))
print()

print("f1-score on training set : ", f1_score(Y_train, Y_train_pred))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred))
print()




Predictions on test set...
...Done.
[0 0 0 ... 0 0 0]

accuracy on training set :  0.9863351254480287
accuracy on test set :  0.9857685009487666

f1-score on training set :  0.765543748586932
f1-score on test set :  0.7554347826086957



In [14]:
X = np.append(X_train,X_test,axis=0)
Y = np.append(Y_train,Y_test)

classifier.fit(X,Y)

LogisticRegression()

In [15]:
# Read data without labels
data_without_labels = pd.read_csv('conversion_data_test.csv')
print('Prediction set (without labels) :', data_without_labels.shape)

# Warning : check consistency of features_list (must be the same than the features 
# used by your best classifier)
features_list = ['total_pages_visited','age','new_user','country','source']
X_without_labels = data_without_labels.loc[:, features_list]

# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X_without_labels = X_without_labels.values
print("...Done")

print(X_without_labels[0:5,:])

Prediction set (without labels) : (31620, 5)
Convert pandas DataFrames to numpy arrays...
...Done
[[16 28 0 'UK' 'Seo']
 [5 22 1 'UK' 'Direct']
 [1 32 1 'China' 'Seo']
 [6 32 1 'US' 'Ads']
 [3 25 0 'China' 'Seo']]


In [16]:
# WARNING : PUT HERE THE SAME PREPROCESSING AS FOR YOUR TEST SET
# CHECK YOU ARE USING X_without_labels
print("Encoding categorical features and standardizing numerical features...")

X_without_labels = featureencoder.transform(X_without_labels)
print("...Done")
print(X_without_labels[0:5,:])

Encoding categorical features and standardizing numerical features...
...Done
[[ 0.          1.          0.          0.          1.          3.3285935
  -0.31132378 -1.47816042]
 [ 0.          1.          0.          1.          0.          0.03743241
  -1.03650686  0.67651656]
 [ 0.          0.          0.          0.          1.         -1.15935344
   0.1721316   0.67651656]
 [ 0.          0.          1.          0.          0.          0.33662888
   0.1721316   0.67651656]
 [ 0.          0.          0.          0.          1.         -0.56096051
  -0.67391532 -1.47816042]]


In [17]:
# Make predictions and dump to file
# WARNING : MAKE SURE THE FILE IS A CSV WITH ONE COLUMN NAMED 'converted' AND NO INDEX !
# WARNING : FILE NAME MUST HAVE FORMAT 'conversion_data_test_predictions_[name].csv'
# where [name] is the name of your team/model separated by a '-'
# For example : [name] = AURELIE-model1
data = {
    'converted': classifier.predict(X_without_labels)
}

Y_predictions = pd.DataFrame(columns=['converted'],data=data)
Y_predictions.to_csv('conversion_data_test_predictions_PA_model_4.csv', index=False)
