## Food Demand Forecasting Challenge(Practice Problem at analytics Vidhya)
 
* about : https://datahack.analyticsvidhya.com/contest/genpact-machine-learning-hackathon-1/#About) 
* data description : https://datahack.analyticsvidhya.com/contest/genpact-machine-learning-hackathon-1/#ProblemStatement
* forecasting target : num_orders
* Evaluation : 100 * RMSLE ( root of mean squared logarithmic error )


## 3. Feature Engineering  & Parameter tuning

In [23]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import lightgbm as lgb
from sklearn.model_selection import train_test_split,ParameterGrid
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore")

In [24]:
df_train = pd.read_csv('./data/train.csv')
df_test = pd.read_csv('./data/test.csv')
df_info_meal = pd.read_csv('./data/meal_info.csv')
df_info_fulfil = pd.read_csv('./data/fulfilment_center_info.csv')

In [25]:
df_train = pd.merge(df_train, df_info_fulfil,
                    how="left",
                    left_on='center_id',
                    right_on='center_id')

df_train = pd.merge(df_train, df_info_meal,
                    how='left',
                    left_on='meal_id',
                    right_on='meal_id')

In [26]:
df_train.head()

Unnamed: 0,id,week,center_id,meal_id,checkout_price,base_price,emailer_for_promotion,homepage_featured,num_orders,city_code,region_code,center_type,op_area,category,cuisine
0,1379560,1,55,1885,136.83,152.29,0,0,177,647,56,TYPE_C,2.0,Beverages,Thai
1,1466964,1,55,1993,136.83,135.83,0,0,270,647,56,TYPE_C,2.0,Beverages,Thai
2,1346989,1,55,2539,134.86,135.86,0,0,189,647,56,TYPE_C,2.0,Beverages,Thai
3,1338232,1,55,2139,339.5,437.53,0,0,54,647,56,TYPE_C,2.0,Beverages,Indian
4,1448490,1,55,2631,243.5,242.5,0,0,40,647,56,TYPE_C,2.0,Beverages,Indian


In [27]:
df_train.head()

Unnamed: 0,id,week,center_id,meal_id,checkout_price,base_price,emailer_for_promotion,homepage_featured,num_orders,city_code,region_code,center_type,op_area,category,cuisine
0,1379560,1,55,1885,136.83,152.29,0,0,177,647,56,TYPE_C,2.0,Beverages,Thai
1,1466964,1,55,1993,136.83,135.83,0,0,270,647,56,TYPE_C,2.0,Beverages,Thai
2,1346989,1,55,2539,134.86,135.86,0,0,189,647,56,TYPE_C,2.0,Beverages,Thai
3,1338232,1,55,2139,339.5,437.53,0,0,54,647,56,TYPE_C,2.0,Beverages,Indian
4,1448490,1,55,2631,243.5,242.5,0,0,40,647,56,TYPE_C,2.0,Beverages,Indian


In [28]:
df_test = pd.merge(df_test, df_info_fulfil,
                   how="left",
                   left_on='center_id',
                   right_on='center_id')

df_test = pd.merge(df_test, df_info_meal,
                   how='left',
                   left_on='meal_id',
                   right_on='meal_id')

(Feature Engineering) base_price와 checkout_price(할인 적용 가격)의 percentage(%)기준 차이값 컬럼과 그 차이가 10%이상일 경우 "UP", -10%이하 일경우  "Down", 10% 와 -10% 사이일 경우 'not change"인 범주형 컬럼을 생성

In [29]:
df_train['price_diff_percent'] = (df_train['base_price'] - df_train['checkout_price']) / df_train['base_price']
df_test['price_diff_percent'] =  (df_test['base_price'] - df_test['checkout_price']) / df_test['base_price']

In [30]:
df_train.loc[df_train['price_diff_percent'] > 0.1, 'big_diff'] = "UP"
df_train.loc[df_train['price_diff_percent'] < -0.1, 'big_diff'] = "DOWM"
df_train['big_diff'] = df_train['big_diff'].fillna("notchange")

df_test.loc[df_test['price_diff_percent'] > 0.1, 'big_diff'] = "UP"
df_test.loc[df_test['price_diff_percent'] < -0.1, 'big_diff'] = "DOWM"
df_test['big_diff'] = df_test['big_diff'].fillna("notchange")

In [31]:
label_encode_columns = ['center_id', 
                        'meal_id', 
                        'emailer_for_promotion', 
                        'homepage_featured',                          
                        'city_code', 
                        'region_code', 
                        'op_area',
                        'center_type',
                        'category',
                        'cuisine',
                        'big_diff'
                       ]

In [32]:
df_train['big_diff']

0                UP
1         notchange
2         notchange
3                UP
4         notchange
            ...    
456543    notchange
456544    notchange
456545           UP
456546           UP
456547    notchange
Name: big_diff, Length: 456548, dtype: object

In [33]:
le = preprocessing.LabelEncoder()

for col in label_encode_columns:
    le.fit(df_train[col])
    df_train[col + '_encoded'] = le.transform(df_train[col])
    df_test[col + '_encoded'] = le.transform(df_test[col])

In [34]:
feature_name = [col for col in df_train.columns if col not in label_encode_columns]

In [35]:
feature_name

['id',
 'week',
 'checkout_price',
 'base_price',
 'num_orders',
 'price_diff_percent',
 'center_id_encoded',
 'meal_id_encoded',
 'emailer_for_promotion_encoded',
 'homepage_featured_encoded',
 'city_code_encoded',
 'region_code_encoded',
 'op_area_encoded',
 'center_type_encoded',
 'category_encoded',
 'cuisine_encoded',
 'big_diff_encoded']

* 학습에 불필요한 "id"컬럼과 target 컬럼인 "num_orders"를 제외

In [36]:
feature_name.remove('id')
feature_name.remove('num_orders')

In [37]:
feature_name

['week',
 'checkout_price',
 'base_price',
 'price_diff_percent',
 'center_id_encoded',
 'meal_id_encoded',
 'emailer_for_promotion_encoded',
 'homepage_featured_encoded',
 'city_code_encoded',
 'region_code_encoded',
 'op_area_encoded',
 'center_type_encoded',
 'category_encoded',
 'cuisine_encoded',
 'big_diff_encoded']

In [38]:
categorical_columns = ['center_id_encoded',
                       'meal_id_encoded',
                       'emailer_for_promotion_encoded',
                       'homepage_featured_encoded',
                       'city_code_encoded',
                       'region_code_encoded',
                       'op_area_encoded',
                       'center_type_encoded',
                       'category_encoded',
                       'cuisine_encoded',
                       'big_diff_encoded'
                     ]

In [39]:
numerical_columns = [col for col in feature_name if col not in categorical_columns]

In [40]:
#numerical_columns.remove('checkout_price')
#numerical_columns.remove('base_price')

In [41]:
numerical_columns

['week', 'checkout_price', 'base_price', 'price_diff_percent']

* train set을  80 :20(train : valid) 으로 나눠 학습실행

(Feature Engineering) 'num_orders'의 값을 log transform

In [42]:
X = df_train[categorical_columns + numerical_columns]
#y = df_train['num_orders']
y = np.log1p(df_train['num_orders'])

X_train, X_valid, y_train, y_valid = train_test_split(X, y, 
                                                    test_size=0.02, 
                                                    shuffle=False)

* (parameter tuning) Grid search 를 통해 최적의 파라미터를 계산함
* {'colsample_bytree': 0.4, 'min_child_samples': 5, 'num_leaves': 127'}

In [44]:
params = {'boosting_type' : 'gbdt',
          'objective': 'regression',
          'num_leaves':127,
          'learning_rate':0.01,
          'colsample_bytree': 0.4,
          'min_child_samples': 5,
          'n_estimators':10000,
          'max_depth':20,
          'metric':'rmse',
          }

### LightGBM Modeling

* Gradient Boosting Decision Tree
* Ensemble

참고 : https://lsjsj92.tistory.com/548


In [45]:
model = lgb.LGBMRegressor(**params)

In [46]:
model

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=0.4,
              importance_type='split', learning_rate=0.01, max_depth=20,
              metric='rmse', min_child_samples=5, min_child_weight=0.001,
              min_split_gain=0.0, n_estimators=10000, n_jobs=-1, num_leaves=127,
              objective='regression', random_state=None, reg_alpha=0.0,
              reg_lambda=0.0, silent=True, subsample=1.0,
              subsample_for_bin=200000, subsample_freq=0)

In [47]:
params_fit = {'early_stopping_rounds':100,
             'feature_name':numerical_columns+categorical_columns,
             'categorical_feature':categorical_columns,
             'eval_set':[(X_train,y_train), (X_valid, y_valid)]
             }

In [48]:
model.fit(X_train, y_train, **params_fit, verbose=500)

Training until validation scores don't improve for 100 rounds
[500]	training's rmse: 0.489794	valid_1's rmse: 0.538008
[1000]	training's rmse: 0.448842	valid_1's rmse: 0.510759
[1500]	training's rmse: 0.430766	valid_1's rmse: 0.501639
[2000]	training's rmse: 0.42062	valid_1's rmse: 0.497379
[2500]	training's rmse: 0.413151	valid_1's rmse: 0.494632
[3000]	training's rmse: 0.407727	valid_1's rmse: 0.493009
[3500]	training's rmse: 0.402698	valid_1's rmse: 0.49139
[4000]	training's rmse: 0.398055	valid_1's rmse: 0.490194
[4500]	training's rmse: 0.39421	valid_1's rmse: 0.489397
[5000]	training's rmse: 0.390514	valid_1's rmse: 0.488749
[5500]	training's rmse: 0.387084	valid_1's rmse: 0.488259
[6000]	training's rmse: 0.383807	valid_1's rmse: 0.487633
[6500]	training's rmse: 0.380644	valid_1's rmse: 0.486929
[7000]	training's rmse: 0.377476	valid_1's rmse: 0.486452
[7500]	training's rmse: 0.374249	valid_1's rmse: 0.485705
[8000]	training's rmse: 0.371152	valid_1's rmse: 0.485236
[8500]	trainin

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=0.4,
              importance_type='split', learning_rate=0.01, max_depth=20,
              metric='rmse', min_child_samples=5, min_child_weight=0.001,
              min_split_gain=0.0, n_estimators=10000, n_jobs=-1, num_leaves=127,
              objective='regression', random_state=None, reg_alpha=0.0,
              reg_lambda=0.0, silent=True, subsample=1.0,
              subsample_for_bin=200000, subsample_freq=0)

In [49]:
X = df_test[categorical_columns + numerical_columns]

In [50]:
pred = model.predict(X)

In [51]:
pred = np.expm1(pred)

In [52]:
submission_df = df_test.copy()
submission_df['num_orders'] = pred
submission_df = submission_df[['id', 'num_orders']]
submission_df.to_csv('submission_log_numorders_price_diff_big_diff_best_param.csv', index=False)

In [53]:
submission_df[submission_df['num_orders']<1]

Unnamed: 0,id,num_orders


(Result)
* Baseline modeling : score = 60.6642916161, 1,344명중 521등, 상위 38% rank
* 1차 피쳐엔지니어링 & parameter tuning : score = 55.5030926505, 1,344명중 267등, 상위 19% rank

(NEXT)
* 2차 피쳐엔지니어링 또는 다른 모델 적용?

### (grid search for hyperparameter)

In [22]:
scores = []
params = []

param_grid = {'num_leaves': [31, 127, 255],
              'min_child_samples': [5, 10, 30],
              'colsample_bytree': [0.4, 0.6, 0.8]}

for i, g in enumerate(ParameterGrid(param_grid)):
    print("param grid {}/{}".format(i, len(ParameterGrid(param_grid)) - 1))
    #pprint.pprint(g)
    
    estimator = lgb.LGBMRegressor(learning_rate=0.01,
                              n_estimators=10000,
                              silent=False,
                              **g)
    
    fit_params = {'feature_name': categorical_columns + numerical_columns,
                  'categorical_feature': categorical_columns,
                  'eval_set': [(X_train, y_train), (X_valid, y_valid)]}

    estimator.fit(X_train, y_train, **fit_params, verbose=1000)
    
    scores.append(estimator.best_score_['valid_1']['l2'])
    params.append(g)


print("Best score = {}".format(np.min(scores)))
print("Best params =")
print(params[np.argmin(scores)])

param grid 0/26
[1000]	training's l2: 0.256479	valid_1's l2: 0.26472
[2000]	training's l2: 0.230352	valid_1's l2: 0.237476
[3000]	training's l2: 0.218373	valid_1's l2: 0.22785
[4000]	training's l2: 0.209887	valid_1's l2: 0.222574
[5000]	training's l2: 0.203698	valid_1's l2: 0.219712
[6000]	training's l2: 0.198551	valid_1's l2: 0.217401
[7000]	training's l2: 0.194274	valid_1's l2: 0.215379
[8000]	training's l2: 0.190545	valid_1's l2: 0.213849
[9000]	training's l2: 0.18741	valid_1's l2: 0.212658
[10000]	training's l2: 0.184416	valid_1's l2: 0.211811
param grid 1/26
[1000]	training's l2: 0.214608	valid_1's l2: 0.232245
[2000]	training's l2: 0.192685	valid_1's l2: 0.217901
[3000]	training's l2: 0.181583	valid_1's l2: 0.211864
[4000]	training's l2: 0.173773	valid_1's l2: 0.209293
[5000]	training's l2: 0.167615	valid_1's l2: 0.207727
[6000]	training's l2: 0.162517	valid_1's l2: 0.207206
[7000]	training's l2: 0.158296	valid_1's l2: 0.207418
[8000]	training's l2: 0.154429	valid_1's l2: 0.20699

[9000]	training's l2: 0.117538	valid_1's l2: 0.20784
[10000]	training's l2: 0.113199	valid_1's l2: 0.207976
param grid 15/26
[1000]	training's l2: 0.246994	valid_1's l2: 0.251855
[2000]	training's l2: 0.223297	valid_1's l2: 0.228997
[3000]	training's l2: 0.211787	valid_1's l2: 0.22228
[4000]	training's l2: 0.203539	valid_1's l2: 0.219373
[5000]	training's l2: 0.197071	valid_1's l2: 0.217544
[6000]	training's l2: 0.191929	valid_1's l2: 0.216092
[7000]	training's l2: 0.187736	valid_1's l2: 0.213979
[8000]	training's l2: 0.184016	valid_1's l2: 0.212673
[9000]	training's l2: 0.180859	valid_1's l2: 0.211741
[10000]	training's l2: 0.177673	valid_1's l2: 0.211276
param grid 16/26
[1000]	training's l2: 0.205954	valid_1's l2: 0.223076
[2000]	training's l2: 0.185406	valid_1's l2: 0.213557
[3000]	training's l2: 0.174096	valid_1's l2: 0.209346
[4000]	training's l2: 0.165914	valid_1's l2: 0.207944
[5000]	training's l2: 0.159386	valid_1's l2: 0.207702
[6000]	training's l2: 0.153974	valid_1's l2: 0.2