## Food Demand Forecasting Challenge(Practice Problem at analytics Vidhya)
 
* about : https://datahack.analyticsvidhya.com/contest/genpact-machine-learning-hackathon-1/#About) 
* data description : https://datahack.analyticsvidhya.com/contest/genpact-machine-learning-hackathon-1/#ProblemStatement
* forecasting target : num_orders
* Evaluation : 100 * RMSLE ( root of mean squared logarithmic error )


## 2. Baseline model(LGBM)

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore")

In [2]:
df_train = pd.read_csv('./data/train.csv')
df_test = pd.read_csv('./data/test.csv')
df_info_meal = pd.read_csv('./data/meal_info.csv')
df_info_fulfil = pd.read_csv('./data/fulfilment_center_info.csv')

In [3]:
df_train = pd.merge(df_train, df_info_fulfil,
                    how="left",
                    left_on='center_id',
                    right_on='center_id')

df_train = pd.merge(df_train, df_info_meal,
                    how='left',
                    left_on='meal_id',
                    right_on='meal_id')

In [4]:
df_train.head()

Unnamed: 0,id,week,center_id,meal_id,checkout_price,base_price,emailer_for_promotion,homepage_featured,num_orders,city_code,region_code,center_type,op_area,category,cuisine
0,1379560,1,55,1885,136.83,152.29,0,0,177,647,56,TYPE_C,2.0,Beverages,Thai
1,1466964,1,55,1993,136.83,135.83,0,0,270,647,56,TYPE_C,2.0,Beverages,Thai
2,1346989,1,55,2539,134.86,135.86,0,0,189,647,56,TYPE_C,2.0,Beverages,Thai
3,1338232,1,55,2139,339.5,437.53,0,0,54,647,56,TYPE_C,2.0,Beverages,Indian
4,1448490,1,55,2631,243.5,242.5,0,0,40,647,56,TYPE_C,2.0,Beverages,Indian


In [5]:
df_train.head()

Unnamed: 0,id,week,center_id,meal_id,checkout_price,base_price,emailer_for_promotion,homepage_featured,num_orders,city_code,region_code,center_type,op_area,category,cuisine
0,1379560,1,55,1885,136.83,152.29,0,0,177,647,56,TYPE_C,2.0,Beverages,Thai
1,1466964,1,55,1993,136.83,135.83,0,0,270,647,56,TYPE_C,2.0,Beverages,Thai
2,1346989,1,55,2539,134.86,135.86,0,0,189,647,56,TYPE_C,2.0,Beverages,Thai
3,1338232,1,55,2139,339.5,437.53,0,0,54,647,56,TYPE_C,2.0,Beverages,Indian
4,1448490,1,55,2631,243.5,242.5,0,0,40,647,56,TYPE_C,2.0,Beverages,Indian


In [6]:
df_test = pd.merge(df_test, df_info_fulfil,
                   how="left",
                   left_on='center_id',
                   right_on='center_id')

df_test = pd.merge(df_test, df_info_meal,
                   how='left',
                   left_on='meal_id',
                   right_on='meal_id')

* numeric형 피쳐 중 emailer_for_promotion, meal_id 등 value값이 category형태와 유사한 피쳐들과 category형 피쳐 label encoding
* lightgbm은 카테고리형태의 타입도 인식하지만(자동 분기), 일반적으로 label encoding처리를 했을 경우 정확도가 좀더 높아 지는 것으로 알려져 있음

In [7]:
label_encode_columns = ['center_id', 
                        'meal_id', 
                        'emailer_for_promotion', 
                        'homepage_featured',                          
                        'city_code', 
                        'region_code', 
                        'op_area',
                        'center_type',
                        'category',
                        'cuisine'
                       ]

In [8]:
le = preprocessing.LabelEncoder()

for col in label_encode_columns:
    le.fit(df_train[col])
    df_train[col + '_encoded'] = le.transform(df_train[col])
    df_test[col + '_encoded'] = le.transform(df_test[col])

In [9]:
feature_name = [col for col in df_train.columns if col not in label_encode_columns]

In [10]:
feature_name

['id',
 'week',
 'checkout_price',
 'base_price',
 'num_orders',
 'center_id_encoded',
 'meal_id_encoded',
 'emailer_for_promotion_encoded',
 'homepage_featured_encoded',
 'city_code_encoded',
 'region_code_encoded',
 'op_area_encoded',
 'center_type_encoded',
 'category_encoded',
 'cuisine_encoded']

* 학습에 불필요한 "id"컬럼과 target 컬럼인 "num_orders"를 제외

In [11]:
feature_name.remove('id')
feature_name.remove('num_orders')

In [12]:
feature_name

['week',
 'checkout_price',
 'base_price',
 'center_id_encoded',
 'meal_id_encoded',
 'emailer_for_promotion_encoded',
 'homepage_featured_encoded',
 'city_code_encoded',
 'region_code_encoded',
 'op_area_encoded',
 'center_type_encoded',
 'category_encoded',
 'cuisine_encoded']

In [13]:
categorical_columns = ['center_id_encoded',
                       'meal_id_encoded',
                       'emailer_for_promotion_encoded',
                       'homepage_featured_encoded',
                       'city_code_encoded',
                       'region_code_encoded',
                       'op_area_encoded',
                       'center_type_encoded',
                       'category_encoded',
                       'cuisine_encoded']

In [14]:
numerical_columns = [col for col in feature_name if col not in categorical_columns]

In [15]:
numerical_columns

['week', 'checkout_price', 'base_price']

* train set을  80 :20(train : valid) 으로 나눠 학습실행

In [16]:
X = df_train[categorical_columns + numerical_columns]
y = df_train['num_orders']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, 
                                                    test_size=0.02, 
                                                    shuffle=False)

In [17]:
params = {'boosting_type' : 'gbdt',
          'objective': 'regression',
          'num_leaves':100,
          'learning_rate':0.01,
          'n_estimators':3000,
          'max_depth':20,
          'metric':'rmse',
          }

### LightGBM Modeling

* Gradient Boosting Decision Tree
* Ensemble

참고 : https://lsjsj92.tistory.com/548


In [18]:
model = lgb.LGBMRegressor(**params)

In [19]:
model

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.01, max_depth=20,
              metric='rmse', min_child_samples=20, min_child_weight=0.001,
              min_split_gain=0.0, n_estimators=3000, n_jobs=-1, num_leaves=100,
              objective='regression', random_state=None, reg_alpha=0.0,
              reg_lambda=0.0, silent=True, subsample=1.0,
              subsample_for_bin=200000, subsample_freq=0)

In [20]:
params_fit = {'early_stopping_rounds':100,
             'feature_name':numerical_columns+categorical_columns,
             'categorical_feature':categorical_columns,
             'eval_set':[(X_train,y_train), (X_valid, y_valid)]
             }

In [21]:
model.fit(X_train, y_train, **params_fit, verbose=200)

Training until validation scores don't improve for 100 rounds.
[200]	training's rmse: 182.969	valid_1's rmse: 144.97
[400]	training's rmse: 145.825	valid_1's rmse: 130.535
[600]	training's rmse: 129.312	valid_1's rmse: 125.242
[800]	training's rmse: 120.197	valid_1's rmse: 123.393
[1000]	training's rmse: 113.595	valid_1's rmse: 122.486
[1200]	training's rmse: 108.151	valid_1's rmse: 121.363
[1400]	training's rmse: 103.609	valid_1's rmse: 120.508
[1600]	training's rmse: 100.08	valid_1's rmse: 119.866
[1800]	training's rmse: 97.0136	valid_1's rmse: 119.181
[2000]	training's rmse: 94.4662	valid_1's rmse: 118.784
[2200]	training's rmse: 92.1993	valid_1's rmse: 118.709
[2400]	training's rmse: 90.0399	valid_1's rmse: 118.331
[2600]	training's rmse: 88.0985	valid_1's rmse: 118.018
[2800]	training's rmse: 86.3033	valid_1's rmse: 117.683
[3000]	training's rmse: 84.7414	valid_1's rmse: 117.52
Did not meet early stopping. Best iteration is:
[3000]	training's rmse: 84.7414	valid_1's rmse: 117.52


LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.01, max_depth=20,
              metric='rmse', min_child_samples=20, min_child_weight=0.001,
              min_split_gain=0.0, n_estimators=3000, n_jobs=-1, num_leaves=100,
              objective='regression', random_state=None, reg_alpha=0.0,
              reg_lambda=0.0, silent=True, subsample=1.0,
              subsample_for_bin=200000, subsample_freq=0)

In [22]:
predict_valid = model.predict(X_valid)

In [24]:
mse_valid = mean_squared_error(y_valid, predict_valid)

In [25]:
print('Mean Squared Error: ', mse_valid)
print('Root Mean Squared Error: ', mse_valid**0.5)

Mean Squared Error:  14192.004815076787
Root Mean Squared Error:  119.13020110398868


In [67]:
X = df_test[categorical_columns + numerical_columns]

In [68]:
pred = model.predict(X)

In [94]:
submission_df = df_test.copy()
submission_df['num_orders'] = pred

In [95]:
submission_df = submission_df[['id', 'num_orders','meal_id']]

In [96]:
submission_df.head()

Unnamed: 0,id,num_orders,meal_id
0,1028232,184.511291,1885
1,1127204,193.711861,1993
2,1212707,163.131424,2539
3,1082698,49.456387,2631
4,1400926,24.381996,1248


In [97]:
submission_df[submission_df['num_orders']<1]['num_orders']

14      -31.568857
1014    -33.056594
1241    -46.477392
1328     -5.947542
1568     -6.146581
1823      0.591158
1960      0.540960
2147     -0.737236
2350    -11.889428
2621     -7.643872
2622    -53.112200
3254     -0.164572
3257    -10.959458
3991      0.334014
4199     -6.364614
4487    -20.227197
4497     -2.980351
4575     -6.956172
5282     -5.058088
5614     -7.469956
5618    -12.029585
7264    -12.605886
7487     -2.793627
7663     -7.184854
7874     -7.637569
7875    -12.376226
8111     -4.788759
8263     -3.510491
8687    -28.460293
8731     -1.123356
           ...    
17684    -2.926241
18250     0.411843
18771    -3.123259
19166   -16.944840
20167     0.088579
20245   -12.384202
20901   -16.624062
22007   -56.905560
22397   -19.891944
22916   -18.360263
23867     0.536298
24040   -19.271312
24630   -19.290655
24937     0.460769
25270    -2.250804
25534     0.232657
26785    -9.162945
27110   -14.635522
27188   -13.981654
27304    -1.123845
27429    -6.813376
27877    -2.

* 예측한 'num_orders'값 중에서 1보다 작거나 음수 값이 존재
* 음수 또는 1보다 작은 예측한 'num_order'값을, meal_id 별 평균 주문량으로 대체 

In [75]:
meal_id_mean = pd.DataFrame(df_train.groupby(['meal_id'])['num_orders'].mean())

In [76]:
meal_id_mean.reset_index(inplace=True)

In [80]:
meal_id_mean.columns = ['meal_id', 'num_orders_mean']

In [81]:
meal_id_mean.head()

Unnamed: 0,meal_id,num_orders_mean
0,1062,423.165574
1,1109,571.921412
2,1198,242.101759
3,1207,166.653341
4,1216,55.034966


In [89]:
meal_id_mean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 2 columns):
meal_id            51 non-null int64
num_orders_mean    51 non-null float64
dtypes: float64(1), int64(1)
memory usage: 896.0 bytes


In [98]:
submission_df = submission_df.merge(meal_id_mean, how='left', left_on = 'meal_id', right_on='meal_id')

In [101]:
for i in range(len(submission_df)):
    
    if submission_df['num_orders'][i] < 1:
        
        submission_df['num_orders'][i] = submission_df['num_orders_mean'][i]
        
    else :
        
        pass

In [103]:
submission_df = submission_df[['id', 'num_orders']]
submission_df.to_csv('submission.csv', index=False)

* 평균주문량으로 대체 시 score = 60.6642916161, Rank 1,344명중 521등, 상위 38%