### Objective of the notebook:

In this notebook, we create a machine learning model pipeline to predict probabilities and calculate model scores.
- Importing libraries
- Importing datasets
- Change datatypes and reduce memory usage
- Label encode categorical features (Features already encoded for this task.)
- Fill null values
- Convert raw data to a session format
- Run the xgboost model
- Calculate model scores

### Importing libraries

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

import xgboost
from sklearn import metrics
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

import warnings
warnings.filterwarnings("ignore") 

kfold = KFold(n_splits=5, random_state=42)
break_point = datetime(2017, 2, 28)

### Importing datasets

In [2]:
def read_data():
    
    print('Reading files...')    
    order_df = pd.read_csv('../input/machine_learning_challenge_order_data.csv')
    print('Order data has {} rows and {} columns'.format(order_df.shape[0], order_df.shape[1]))
    label_df = pd.read_csv('../input/machine_learning_challenge_labeled_data.csv')
    print('Label data has {} rows and {} columns'.format(label_df.shape[0], label_df.shape[1]))
    df = order_df.merge(label_df, on='customer_id')
    print('The final data has {} rows and {} columns'.format(df.shape[0], df.shape[1]))
    return df

### Change data types and reduce memory usage

In [3]:
def reduce_mem_usage(df, verbose=False):
    
    start_mem = df.memory_usage().sum() / 1024 ** 2
    int_columns = df.select_dtypes(include=["int"]).columns
    float_columns = df.select_dtypes(include=["float"]).columns

    for col in int_columns:
        df[col] = pd.to_numeric(df[col], downcast="integer")

    for col in float_columns:
        df[col] = pd.to_numeric(df[col], downcast="float")

    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

### Label encode categorical features

In [4]:
def transform_data(df):

    labelencoder = LabelEncoder()

    for i in ['restaurant_id', 'city_id', 'payment_id', 'platform_id', 'transmission_id']:
        df[i] = labelencoder.fit_transform(df[i])

    return df

### Convert raw data to a session format
Fill order rank with forward-filling method. As a baseline model, we only keep last record of each customer.

In [5]:
def feature_engineering(df, break_point):

    df['customer_order_rank'] = df['customer_order_rank'].fillna(method='ffill')
    df = df.groupby('customer_id').last().reset_index()
    return df

### Run xgb model and calculate scores

In [6]:
def run_xgb(df):
    
    y = df['is_returning_customer']
    X = df.drop(columns=['customer_id', 'order_date', 'is_returning_customer'])    
    
    clf = xgboost.XGBClassifier(objective= 'binary:logistic', n_jobs= -1, scale_pos_weight=2)
    
    y_pred = cross_val_predict(clf, X, y, cv=kfold)
    
    print('Accuracy Score:  ',round(metrics.accuracy_score(y, y_pred), 2))
    print('Roc Auc Score:  ',round(roc_auc_score(y, y_pred), 2))
    print('Classification Report: \n', classification_report(y, y_pred, target_names=['0', '1']))
    return df

### Execute all pipeline

In [7]:
def execute_pipeline():
    
    df = read_data()
    print()
    df = reduce_mem_usage(df, True)
    df = transform_data(df)
    df = feature_engineering(df, break_point)
    print()
    run_xgb(df)
    
execute_pipeline()

Reading files...
Order data has 786600 rows and 13 columns
Label data has 245455 rows and 2 columns
The final data has 786600 rows and 14 columns

Mem. usage decreased to 42.76 Mb (52.5% reduction)

Accuracy Score:   0.8
Roc Auc Score:   0.7
Classification Report: 
               precision    recall  f1-score   support

           0       0.86      0.88      0.87    189948
           1       0.55      0.52      0.53     55507

   micro avg       0.80      0.80      0.80    245455
   macro avg       0.71      0.70      0.70    245455
weighted avg       0.79      0.80      0.79    245455



### What is next? 

The accuracy and roc-auc scores are 0.8 and 0.7 relatively. 

We give more weights on class 1 using the scale pos weight parameter to label returned customers more correctly. In other words, we weight roc-auc rather than accuracy.

We will add recency and time-related features for the new version and compare model results with the baseline model.