## Food Demand Forecasting Challenge(Practice Problem at analytics Vidhya)
 
* about : https://datahack.analyticsvidhya.com/contest/genpact-machine-learning-hackathon-1/#About) 
* data description : https://datahack.analyticsvidhya.com/contest/genpact-machine-learning-hackathon-1/#ProblemStatement
* forecasting target : num_orders
* Evaluation : 100 * RMSLE ( root of mean squared logarithmic error )


## 2. Baseline model(LGBM)

In [89]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore")

In [90]:
df_train = pd.read_csv('./data/train.csv')
df_test = pd.read_csv('./data/test.csv')
df_info_meal = pd.read_csv('./data/meal_info.csv')
df_info_fulfil = pd.read_csv('./data/fulfilment_center_info.csv')

In [91]:
df_train = pd.merge(df_train, df_info_fulfil,
                    how="left",
                    left_on='center_id',
                    right_on='center_id')

df_train = pd.merge(df_train, df_info_meal,
                    how='left',
                    left_on='meal_id',
                    right_on='meal_id')

In [92]:
df_train.head()

Unnamed: 0,id,week,center_id,meal_id,checkout_price,base_price,emailer_for_promotion,homepage_featured,num_orders,city_code,region_code,center_type,op_area,category,cuisine
0,1379560,1,55,1885,136.83,152.29,0,0,177,647,56,TYPE_C,2.0,Beverages,Thai
1,1466964,1,55,1993,136.83,135.83,0,0,270,647,56,TYPE_C,2.0,Beverages,Thai
2,1346989,1,55,2539,134.86,135.86,0,0,189,647,56,TYPE_C,2.0,Beverages,Thai
3,1338232,1,55,2139,339.5,437.53,0,0,54,647,56,TYPE_C,2.0,Beverages,Indian
4,1448490,1,55,2631,243.5,242.5,0,0,40,647,56,TYPE_C,2.0,Beverages,Indian


In [93]:
df_train.head()

Unnamed: 0,id,week,center_id,meal_id,checkout_price,base_price,emailer_for_promotion,homepage_featured,num_orders,city_code,region_code,center_type,op_area,category,cuisine
0,1379560,1,55,1885,136.83,152.29,0,0,177,647,56,TYPE_C,2.0,Beverages,Thai
1,1466964,1,55,1993,136.83,135.83,0,0,270,647,56,TYPE_C,2.0,Beverages,Thai
2,1346989,1,55,2539,134.86,135.86,0,0,189,647,56,TYPE_C,2.0,Beverages,Thai
3,1338232,1,55,2139,339.5,437.53,0,0,54,647,56,TYPE_C,2.0,Beverages,Indian
4,1448490,1,55,2631,243.5,242.5,0,0,40,647,56,TYPE_C,2.0,Beverages,Indian


In [94]:
df_test = pd.merge(df_test, df_info_fulfil,
                   how="left",
                   left_on='center_id',
                   right_on='center_id')

df_test = pd.merge(df_test, df_info_meal,
                   how='left',
                   left_on='meal_id',
                   right_on='meal_id')

* numeric형 피쳐 중 emailer_for_promotion, meal_id 등 value값이 category형태와 유사한 피쳐들과 category형 피쳐 label encoding
* lightgbm은 카테고리형태의 타입도 인식하지만(자동 분기), 일반적으로 label encoding처리를 했을 경우 정확도가 좀더 높아 지는 것으로 알려져 있음

In [95]:
label_encode_columns = ['center_id', 
                        'meal_id', 
                        'emailer_for_promotion', 
                        'homepage_featured',                          
                        'city_code', 
                        'region_code', 
                        'op_area',
                        'center_type',
                        'category',
                        'cuisine'
                       ]

In [96]:
le = preprocessing.LabelEncoder()

for col in label_encode_columns:
    le.fit(df_train[col])
    df_train[col + '_encoded'] = le.transform(df_train[col])
    df_test[col + '_encoded'] = le.transform(df_test[col])

In [97]:
feature_name = [col for col in df_train.columns if col not in label_encode_columns]

In [98]:
feature_name

['id',
 'week',
 'checkout_price',
 'base_price',
 'num_orders',
 'center_id_encoded',
 'meal_id_encoded',
 'emailer_for_promotion_encoded',
 'homepage_featured_encoded',
 'city_code_encoded',
 'region_code_encoded',
 'op_area_encoded',
 'center_type_encoded',
 'category_encoded',
 'cuisine_encoded']

* 학습에 불필요한 "id"컬럼과 target 컬럼인 "num_orders"를 제외

In [99]:
feature_name.remove('id')
feature_name.remove('num_orders')

In [100]:
feature_name

['week',
 'checkout_price',
 'base_price',
 'center_id_encoded',
 'meal_id_encoded',
 'emailer_for_promotion_encoded',
 'homepage_featured_encoded',
 'city_code_encoded',
 'region_code_encoded',
 'op_area_encoded',
 'center_type_encoded',
 'category_encoded',
 'cuisine_encoded']

In [101]:
categorical_columns = ['center_id_encoded',
                       'meal_id_encoded',
                       'emailer_for_promotion_encoded',
                       'homepage_featured_encoded',
                       'city_code_encoded',
                       'region_code_encoded',
                       'op_area_encoded',
                       'center_type_encoded',
                       'category_encoded',
                       'cuisine_encoded']

In [102]:
numerical_columns = [col for col in feature_name if col not in categorical_columns]

In [103]:
numerical_columns

['week', 'checkout_price', 'base_price']

* train set을  80 :20(train : valid) 으로 나눠 학습실행

In [116]:
X = df_train[categorical_columns + numerical_columns]
y = df_train['num_orders']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, 
                                                    test_size=0.02, 
                                                    shuffle=False)

In [117]:
params = {'boosting_type' : 'gbdt',
          'objective': 'regression',
          'num_leaves':100,
          'learning_rate':0.01,
          'n_estimators':3000,
          'max_depth':20,
          'metric':'rmse',
          }

### LightGBM Modeling

* Gradient Boosting Decision Tree
* Ensemble

참고 : https://lsjsj92.tistory.com/548


In [118]:
model = lgb.LGBMRegressor(**params)

In [119]:
model

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.01, max_depth=20,
              metric='rmse', min_child_samples=20, min_child_weight=0.001,
              min_split_gain=0.0, n_estimators=3000, n_jobs=-1, num_leaves=100,
              objective='regression', random_state=None, reg_alpha=0.0,
              reg_lambda=0.0, silent=True, subsample=1.0,
              subsample_for_bin=200000, subsample_freq=0)

In [120]:
params_fit = {'early_stopping_rounds':100,
             'feature_name':numerical_columns+categorical_columns,
             'categorical_feature':categorical_columns,
             'eval_set':[(X_train,y_train), (X_valid, y_valid)]
             }

In [109]:
model.fit(X_train, y_train, **params_fit, verbose=200)

Training until validation scores don't improve for 100 rounds
[200]	training's rmse: 182.969	valid_1's rmse: 144.97
[400]	training's rmse: 145.825	valid_1's rmse: 130.535
[600]	training's rmse: 129.312	valid_1's rmse: 125.242
[800]	training's rmse: 120.197	valid_1's rmse: 123.393
[1000]	training's rmse: 113.595	valid_1's rmse: 122.486
[1200]	training's rmse: 108.151	valid_1's rmse: 121.363
[1400]	training's rmse: 103.609	valid_1's rmse: 120.508
[1600]	training's rmse: 100.08	valid_1's rmse: 119.866
[1800]	training's rmse: 97.0136	valid_1's rmse: 119.181
[2000]	training's rmse: 94.4662	valid_1's rmse: 118.784
[2200]	training's rmse: 92.1993	valid_1's rmse: 118.709
[2400]	training's rmse: 90.0399	valid_1's rmse: 118.331
[2600]	training's rmse: 88.0985	valid_1's rmse: 118.018
[2800]	training's rmse: 86.3033	valid_1's rmse: 117.683
[3000]	training's rmse: 84.7414	valid_1's rmse: 117.52
Did not meet early stopping. Best iteration is:
[3000]	training's rmse: 84.7414	valid_1's rmse: 117.52


LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.01, max_depth=20,
              metric='rmse', min_child_samples=20, min_child_weight=0.001,
              min_split_gain=0.0, n_estimators=3000, n_jobs=-1, num_leaves=100,
              objective='regression', random_state=None, reg_alpha=0.0,
              reg_lambda=0.0, silent=True, subsample=1.0,
              subsample_for_bin=200000, subsample_freq=0)

In [113]:
predict_valid = model.predict(X_valid)

In [114]:
mse_valid = mean_squared_error(y_valid, predict_test)

In [115]:
print('Mean Squared Error: ', mse_valid)
print('Root Mean Squared Error: ', mse_valid**0.5)

Mean Squared Error:  14192.004815076787
Root Mean Squared Error:  119.13020110398868
