### Objective of the notebook:

In this notebook, we create a machine learning model pipeline to predict probabilities. We compare roc-auc and accuracy scores of 6 models and chose the best classifier.

- Import libraries & datasets
- Change datatypes and reduce memory usage
- Label encode categorical features (Features already encoded for this task.)
- Fill null values with the forward-filling method
- Add recency, number of days from the first order
- Add year, month, week, day, is weekend features
- Convert raw data to a session format
- Compare model results

### Importing libraries

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta 

from sklearn import metrics
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost
import lightgbm 

import warnings
warnings.filterwarnings("ignore") 

kfold = KFold(n_splits=5, shuffle=True, random_state=42)
break_point = datetime(2017, 2, 28)

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


### Importing datasets

In [2]:
def read_data():
    
    print('Reading files...')    
    order_df = pd.read_csv('../input/machine_learning_challenge_order_data.csv')
    print('Order data has {} rows and {} columns'.format(order_df.shape[0], order_df.shape[1]))
    label_df = pd.read_csv('../input/machine_learning_challenge_labeled_data.csv')
    print('Label data has {} rows and {} columns'.format(label_df.shape[0], label_df.shape[1]))
    df = order_df.merge(label_df, on='customer_id')
    print('The final data has {} rows and {} columns'.format(df.shape[0], df.shape[1]))
    print("")
    return df

### Change data types and reduce memory usage

In [3]:
def reduce_mem_usage(df, verbose=False):
    
    start_mem = df.memory_usage().sum() / 1024 ** 2
    int_columns = df.select_dtypes(include=["int"]).columns
    float_columns = df.select_dtypes(include=["float"]).columns

    for col in int_columns:
        df[col] = pd.to_numeric(df[col], downcast="integer")

    for col in float_columns:
        df[col] = pd.to_numeric(df[col], downcast="float")

    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    print("")
    return df

### Label encode categorical features

In [4]:
def transform_data(df):

    labelencoder = LabelEncoder()

    for i in ['restaurant_id', 'city_id', 'payment_id', 'platform_id', 'transmission_id']:
        df[i] = labelencoder.fit_transform(df[i])

    return df

### Convert raw data to a session format
Fill the order rank with the forward-filling method. As a baseline model, we only keep the last record of each customer. Because we assume that the last record of a user gives more info such as recency, order count, cancel, etc.)

In [5]:
def feature_engineering(df, break_point):

    df['customer_order_rank'] = df['customer_order_rank'].fillna(method='ffill')

    df['order_date'] = pd.to_datetime(df['order_date']) 
    df['recency'] = (break_point - df['order_date']) / np.timedelta64(1, 'D')
    df['first_order_date'] = df.groupby(['customer_id'])['order_date'].transform('first')
    df['age_of_user'] = (break_point - df['first_order_date']) / np.timedelta64(1, 'D')

    df['year'] = df['order_date'].dt.year
    df['month'] = df['order_date'].dt.month
    df['week'] = df['order_date'].dt.week
    df['day'] = df['order_date'].dt.day
    df['dayofweek'] = df['order_date'].dt.dayofweek
    df["is_weekend"] = df["dayofweek"].isin([5, 6]).astype(np.int8)
   
    df = df.groupby('customer_id').last().reset_index()
    return df

### Run models and compare results

In [6]:
classifiers=['KNN', 'Decision Tree','RandomForestClassifier', 
         'Logistic Regression','XGBoostClassifier', 'LGBMClaffier']

models = [
            KNeighborsClassifier(), 
            DecisionTreeClassifier(random_state=42),
            RandomForestClassifier(random_state=42), 
            LogisticRegression(random_state=42),
            xgboost.XGBClassifier(random_state=42), 
            lightgbm.LGBMClassifier(random_state=42)]

auc_mean=[]
auc_std=[]
acc_mean=[]
acc_std=[]

In [7]:
def run_models(df):
    
    y = df['is_returning_customer']
    X = df.drop(columns=['customer_id', 'order_date', 'is_returning_customer', 'first_order_date'])    
    
    for model in models:
        cv_result = cross_val_score(model, X, y, cv = kfold, scoring = "roc_auc")
        auc_mean.append(cv_result.mean())
        auc_std.append(cv_result.std())

        cv_result = cross_val_score(model, X, y, cv = kfold, scoring = "accuracy")
        acc_mean.append(cv_result.mean())
        acc_std.append(cv_result.std())

    results=round(pd.DataFrame({'Roc-Auc Mean':auc_mean,'Roc-Auc Std':auc_std,
                                'Accuracy Mean':acc_mean, 'Accuracy Std':acc_std}, 
                                 index=classifiers), 4)  
    print(results)

### Execute all pipeline

In [8]:
def execute_pipeline():
    
    df = read_data()
    df = reduce_mem_usage(df, True)
    df = transform_data(df)
    df = feature_engineering(df, break_point)
    run_models(df)
    
execute_pipeline()

Reading files...
Order data has 786600 rows and 13 columns
Label data has 245455 rows and 2 columns
The final data has 786600 rows and 14 columns

Mem. usage decreased to 42.76 Mb (52.5% reduction)

                        Roc-Auc Mean  Roc-Auc Std  Accuracy Mean  Accuracy Std
KNN                           0.7263       0.0010         0.7980        0.0015
Decision Tree                 0.6428       0.0021         0.7415        0.0020
RandomForestClassifier        0.7669       0.0018         0.8230        0.0007
Logistic Regression           0.8017       0.0017         0.8220        0.0007
XGBoostClassifier             0.8160       0.0014         0.8363        0.0009
LGBMClaffier                  0.8164       0.0013         0.8366        0.0008
