## Food Demand Forecasting Challenge(Practice Problem at analytics Vidhya)
 
* about : https://datahack.analyticsvidhya.com/contest/genpact-machine-learning-hackathon-1/#About) 
* data description : https://datahack.analyticsvidhya.com/contest/genpact-machine-learning-hackathon-1/#ProblemStatement
* forecasting target : num_orders
* Evaluation : 100 * RMSLE ( root of mean squared logarithmic error )


## 2. Baseline model(LGBM)

In [74]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.metrics import mean_squared_error, r2_score


%matplotlib inline
warnings.filterwarnings("ignore")

In [4]:
df_train = pd.read_csv('./data/train.csv')
df_test = pd.read_csv('./data/test.csv')
df_info_meal = pd.read_csv('./data/meal_info.csv')
df_info_fulfil = pd.read_csv('./data/fulfilment_center_info.csv')

In [5]:
df_train = pd.merge(df_train, df_info_fulfil, how="left",
                    left_on='center_id',
                    right_on='center_id')

df_train = pd.merge(df_train, df_info_meal, how='left',
                    left_on='meal_id',
                    right_on='meal_id')

df_test = pd.merge(df_test, df_info_fulfil, how="left",
                   left_on='center_id',
                   right_on='center_id')

df_test = pd.merge(df_test, df_info_meal, how='left',
                   left_on='meal_id',
                   right_on='meal_id')

In [6]:
label_encoder = preprocessing.LabelEncoder()

In [7]:
X = df_train.loc[:, df_train.columns != "num_orders"]
X.drop(['id'], inplace=True, axis = 1)
y = df_train['num_orders'].values

In [8]:
c_col = X.columns[X.dtypes == 'object'].tolist()

In [9]:
for col in c_col:
    X[col] = label_encoder.fit_transform(X[col].astype(str))

In [10]:
feature_names = X.columns.tolist()

In [34]:
X.head()

Unnamed: 0,week,center_id,meal_id,checkout_price,base_price,emailer_for_promotion,homepage_featured,city_code,region_code,center_type,op_area,category,cuisine
0,1,55,1885,136.83,152.29,0,0,647,56,2,2.0,0,3
1,1,55,1993,136.83,135.83,0,0,647,56,2,2.0,0,3
2,1,55,2539,134.86,135.86,0,0,647,56,2,2.0,0,3
3,1,55,2139,339.5,437.53,0,0,647,56,2,2.0,0,1
4,1,55,2631,243.5,242.5,0,0,647,56,2,2.0,0,1


In [11]:
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.25, random_state=42)
    
# LightGBM dataset formatting 
lgtrain = lgb.Dataset(X_train, y_train,
                feature_name=feature_names,
                categorical_feature = c_col)

lgvalid = lgb.Dataset(X_valid, y_valid,
                feature_name=feature_names,
                categorical_feature = c_col)

In [72]:
params = {'learning_rate': 0.01, 
          'max_depth': 16, 
          'boosting': 'gbdt', 
          'objective': 'regression', 
          'metric': 'l2', 
          'is_training_metric': True, 
          'num_leaves': 100
         }

In [73]:
model = lgb.train(params, lgtrain, 3000, 
                  lgvalid, verbose_eval=100, 
                  early_stopping_rounds=100,
                 categorical_feature=c_col)

Training until validation scores don't improve for 100 rounds
[100]	valid_0's l2: 58847.9
[200]	valid_0's l2: 38637.4
[300]	valid_0's l2: 31358.8
[400]	valid_0's l2: 27347.3
[500]	valid_0's l2: 25101.4
[600]	valid_0's l2: 23577
[700]	valid_0's l2: 22489.5
[800]	valid_0's l2: 21759.9
[900]	valid_0's l2: 21128.9
[1000]	valid_0's l2: 20634.1
[1100]	valid_0's l2: 20231
[1200]	valid_0's l2: 19881.4
[1300]	valid_0's l2: 19583.5
[1400]	valid_0's l2: 19319.8
[1500]	valid_0's l2: 19077.2
[1600]	valid_0's l2: 18870.4
[1700]	valid_0's l2: 18662.8
[1800]	valid_0's l2: 18482.5
[1900]	valid_0's l2: 18315.9
[2000]	valid_0's l2: 18163.8
[2100]	valid_0's l2: 18006.4
[2200]	valid_0's l2: 17847.7
[2300]	valid_0's l2: 17699.4
[2400]	valid_0's l2: 17573.5
[2500]	valid_0's l2: 17446.8
[2600]	valid_0's l2: 17348.9
[2700]	valid_0's l2: 17249.8
[2800]	valid_0's l2: 17146.7
[2900]	valid_0's l2: 17058.8
[3000]	valid_0's l2: 16975.7
Did not meet early stopping. Best iteration is:
[3000]	valid_0's l2: 16975.7


In [78]:
predict_train = model.predict(X_train)
predict_test = model.predict(X_valid)

In [79]:
mse = mean_squared_error(y_valid, predict_test)
r2 = r2_score(y_valid, predict_test)

In [80]:
print('Mean squared error: ', mse)
print('R2 score: ', r2)

Mean squared error:  16975.689965497197
R2 score:  0.889041073025336
