### Objective of the notebook:

In this notebook, we create a machine learning model pipeline to predict probabilities and calculate model scores.
- Importing libraries
- Importing datasets
- Change datatypes and reduce memory usage
- Label encode categorical features (Features already encoded for this task.)
- Fill null values
- Convert raw data to a session format
- Run the xgboost model
- Calculate model scores

### Importing libraries

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta 

from sklearn import metrics
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
import xgboost
import lightgbm 

import warnings
warnings.filterwarnings("ignore") 

kfold = KFold(n_splits=5, random_state=42)
break_point = datetime(2017, 2, 28)

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


### Importing datasets

In [2]:
def read_data():
    
    print('Reading files...')    
    order_df = pd.read_csv('../input/machine_learning_challenge_order_data.csv')
    print('Order data has {} rows and {} columns'.format(order_df.shape[0], order_df.shape[1]))
    label_df = pd.read_csv('../input/machine_learning_challenge_labeled_data.csv')
    print('Label data has {} rows and {} columns'.format(label_df.shape[0], label_df.shape[1]))
    df = order_df.merge(label_df, on='customer_id')
    print('The final data has {} rows and {} columns'.format(df.shape[0], df.shape[1]))
    return df

### Change data types and reduce memory usage

In [3]:
def reduce_mem_usage(df, verbose=False):
    
    start_mem = df.memory_usage().sum() / 1024 ** 2
    int_columns = df.select_dtypes(include=["int"]).columns
    float_columns = df.select_dtypes(include=["float"]).columns

    for col in int_columns:
        df[col] = pd.to_numeric(df[col], downcast="integer")

    for col in float_columns:
        df[col] = pd.to_numeric(df[col], downcast="float")

    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

### Label encode categorical features

In [4]:
def transform_data(df):

    labelencoder = LabelEncoder()

    for i in ['restaurant_id', 'city_id', 'payment_id', 'platform_id', 'transmission_id']:
        df[i] = labelencoder.fit_transform(df[i])

    return df

### Convert raw data to a session format
Fill the order rank with the forward-filling method. As a baseline model, we only keep the last record of each customer. Because we assume that the last record of a user gives more info such as recency, order count, cancel, etc.)

For modeling, we chose the tree-based xgboost model because it is easy to implement and ready for production in cloud services and also more successful in tabular datasets.

In [5]:
def feature_engineering(df, break_point):

    df['customer_order_rank'] = df['customer_order_rank'].fillna(method='ffill')
    df = df.groupby('customer_id').last().reset_index()
    return df

### Run xgb model and calculate scores

In [6]:
from sklearn.linear_model import LogisticRegression

LogisticRegression(random_state=42)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [16]:
classifiers=['KNN', 'Decision Tree','RandomForestClassifier', 
             'Logistic Regression','XGBoostClassifier', 'LGBMClaffier']

models = [KNeighborsClassifier(), 
       DecisionTreeClassifier(random_state=42),
       RandomForestClassifier(random_state=42), 
       LogisticRegression(random_state=42),
       xgboost.XGBClassifier(random_state=42), 
       lightgbm.LGBMClassifier()]

def cross_val(X, Y):
    
    auc_mean=[]
    auc_std=[]
    acc_mean=[]
    acc_std=[]
    
    for model in models:
        cv_result = cross_val_score(model, X, Y, cv = kfold, scoring = "roc_auc")
        auc_mean.append(cv_result.mean())
        auc_std.append(cv_result.std())

        cv_result = cross_val_score(model, X, Y, cv = kfold, scoring = "accuracy")
        acc_mean.append(cv_result.mean())
        acc_std.append(cv_result.std())

    results=round(pd.DataFrame({'Roc-Auc Mean':auc_mean,'Roc-Auc Std':auc_std,
                                'Accuracy Mean':acc_mean, 'Accuracy Std':acc_std}, index=classifiers), 4)  
    print(results)

In [17]:
def run_xgb(df):
    
    y = df['is_returning_customer']
    X = df.drop(columns=['customer_id', 'order_date', 'is_returning_customer'])    
    
    cross_val(X,y)

    clf = xgboost.XGBClassifier(objective= 'binary:logistic', n_jobs= -1, scale_pos_weight=2)
    
    y_pred = cross_val_predict(clf, X, y, cv=kfold)
    
    print('Accuracy Score:  ',round(metrics.accuracy_score(y, y_pred), 2))
    print('Roc Auc Score:  ',round(roc_auc_score(y, y_pred), 2))
    print('Classification Report: \n', classification_report(y, y_pred, target_names=['0', '1']))
    return df

### Execute all pipeline

In [None]:
def execute_pipeline():
    
    df = read_data()
    print()
    df = reduce_mem_usage(df, True)
    df = transform_data(df)
    df = feature_engineering(df, break_point)
    print()
    run_xgb(df)
    
execute_pipeline()

Reading files...
Order data has 786600 rows and 13 columns
Label data has 245455 rows and 2 columns
The final data has 786600 rows and 14 columns

Mem. usage decreased to 42.76 Mb (52.5% reduction)



### What is next? 

The accuracy and roc-auc scores are 0.8 and 0.7 relatively. 

We give more weights on class 1 using the scale pos weight parameter to label returned customers more correctly. In other words, we weight roc-auc rather than accuracy.

We will add recency and time-related features for the new version and compare model results with the baseline model.