# Week 6_ML 기초 실습 과제

## 1. 문제설명

- 이번 과제를 통해 여러분은 음식배달 서비스(배민, 쿠팡이츠 등)를 위한 예측모델을 만들게 될 것입니다! 이 모델이 예측하는 값은 “음식배달에 걸리는 시간"입니다. 배달시간을 정확하게 예측하는 것은 사용자의 경험에 많은 영향을 미치게 됩니다.

- 예측된 배달시간보다 실제 배달시간이 더 걸린 경우(under-prediction)가 반대의 경우(over-prediction)보다 두 배로 사용자의 경험에 안 좋은 영향을 준다고 알려져 있습니다.

- 가능한 실제 배달시간과 가까운 값을 예측하되 동시에 under-prediction을 최소화하는 것이 좋은 예측모델입니다.

## 2. 학습/테스트 데이터
- 파일 “predict_delivery_time.txt”는 다음과 같은 속성들을 가지고 있습니다.

- Restaurant: A unique ID that represents a restaurant.
- Location: The location of the restaurant.
- Cuisines: The cuisines offered by the restaurant.
- Average_Cost: The average cost for one person/order.
- Minimum_Order: The minimum order amount.
- Rating: Customer rating for the restaurant.
- Votes: The total number of customer votes for the restaurant.
- Reviews: The number of customer reviews for the restaurant.
- Delivery_Time: The order delivery time of the restaurant. (Target Classes) 

- Restaurant, Location, Cuisines, AverageCost, MinimumOrder, Rating, Votes, Reviews 속성들을 모델의 입력속성으로 사용하세요. 모델의 학습목표는 DeliveryTime입니다.

- 이 데이터에서 랜덤하게 20%를 추출해서 테스트 데이터로 사용하고 나머지는 학습데이터로 사용하세요.

## 3. 실습

### 3-1. 큰 그림 보기

- 풀어야 하는 문제: 음식배달에 걸리는 시간 예측.
    - 즉, 지도학습, 회귀문제, 배치학습에 해당
    - 평가지표는 MAE를 사용

### 3-2. 데이터 가져오기 및 Cleaning

In [70]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline

DATA_PATH = "./predict_delivery_time.csv"

In [71]:
def load_data(data_path=DATA_PATH):
    return pd.read_csv(data_path)

In [72]:
data = load_data()

In [73]:
data.head()

Unnamed: 0,Restaurant,Location,Cuisines,AverageCost,MinimumOrder,Rating,Votes,Reviews,DeliveryTime
0,ID6321,"FTI College, Law College Road, Pune","Fast Food, Rolls, Burger, Salad, Wraps",200,50,3.5,12.0,4.0,30
1,ID2882,"Sector 3, Marathalli","Ice Cream, Desserts",100,50,3.5,11.0,4.0,30
2,ID1595,Mumbai Central,"Italian, Street Food, Fast Food",150,50,3.6,99.0,30.0,65
3,ID5929,"Sector 1, Noida","Mughlai, North Indian, Chinese",250,99,3.7,176.0,95.0,30
4,ID6123,"Rmz Centennial, I Gate, Whitefield","Cafe, Beverages",200,99,3.2,521.0,235.0,65


In [74]:
data.info()

# AverageCost, Rating 이상치 존재

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11094 entries, 0 to 11093
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Restaurant    11094 non-null  object 
 1   Location      11094 non-null  object 
 2   Cuisines      11094 non-null  object 
 3   AverageCost   11094 non-null  object 
 4   MinimumOrder  11094 non-null  int64  
 5   Rating        9903 non-null   object 
 6   Votes         9020 non-null   float64
 7   Reviews       8782 non-null   float64
 8   DeliveryTime  11094 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 780.2+ KB


In [76]:
def outlier_to_NaN(x):
    try:
        return float(x)
    except ValueError:
        return None


outlier_features = ["AverageCost", "Rating"]
for feature in outlier_features:
    data[feature] = data[feature].apply(outlier_to_NaN)

In [77]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11094 entries, 0 to 11093
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Restaurant    11094 non-null  object 
 1   Location      11094 non-null  object 
 2   Cuisines      11094 non-null  object 
 3   AverageCost   11069 non-null  float64
 4   MinimumOrder  11094 non-null  int64  
 5   Rating        9131 non-null   float64
 6   Votes         9020 non-null   float64
 7   Reviews       8782 non-null   float64
 8   DeliveryTime  11094 non-null  int64  
dtypes: float64(4), int64(2), object(3)
memory usage: 780.2+ KB


### 3-3. 트레이닝 / 테스트 데이터셋 만들기

In [78]:
from sklearn.model_selection import StratifiedShuffleSplit


# 1. 데이터셋 나누기
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["DeliveryTime"]):
    strat_train_set, strat_test_set = data.loc[train_index], data.loc[test_index]

In [79]:
# strat_train_set.info()
# strat_test_set.info()

# data["DeliveryTime"].value_counts() / len(data)
strat_train_set["DeliveryTime"].value_counts() / len(strat_train_set)
# strat_test_set["DeliveryTime"].value_counts() / len(strat_test_set)

30     0.667606
45     0.240225
65     0.083155
120    0.005634
20     0.001803
80     0.001239
10     0.000338
Name: DeliveryTime, dtype: float64

In [80]:
# 상관관계 관찰하기

corr_matrix = data.corr()
corr_matrix["DeliveryTime"].sort_values(ascending=False)

DeliveryTime    1.000000
MinimumOrder    0.254186
Votes           0.198534
AverageCost     0.179804
Reviews         0.170745
Rating          0.130792
Name: DeliveryTime, dtype: float64

In [81]:
# label column 분리

train_set = strat_train_set.drop("DeliveryTime", axis=1)
train_set_labels = strat_train_set["DeliveryTime"].copy()

test_set = strat_test_set.drop("DeliveryTime", axis=1)
test_set_labels = strat_test_set["DeliveryTime"].copy()

# Estimator, Transformer, Predictor

In [82]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer


num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler()),
])

In [83]:
from sklearn.preprocessing import OneHotEncoder


cat_pipeline = Pipeline([
    ('cat_encoder', OneHotEncoder(sparse=True, handle_unknown='ignore')),
])

In [84]:
from sklearn.compose import ColumnTransformer


num_attribs = ["AverageCost", "MinimumOrder", "Rating", "Votes", "Reviews"]
cat_attribs = ["Restaurant", "Location", "Cuisines"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
])

train_set_prepared = full_pipeline.fit_transform(train_set)

# 모델 훈련

# 선형회귀

In [85]:
from sklearn.linear_model import LinearRegression


lin_reg = LinearRegression()
lin_reg.fit(train_set_prepared, train_set_labels)

LinearRegression()

In [86]:
from sklearn.metrics import mean_absolute_error


train_set_predictions = lin_reg.predict(train_set_prepared)
lin_mse = mean_absolute_error(train_set_labels, train_set_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

0.8817393886194105

# DecisionTreeRegressor

In [87]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(train_set_prepared, train_set_labels)

DecisionTreeRegressor(random_state=42)

In [88]:
train_set_predictions = tree_reg.predict(train_set_prepared)
tree_mse = mean_absolute_error(train_set_labels, train_set_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

0.0

# RandomForestRegressor

In [89]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(train_set_prepared, train_set_labels)

RandomForestRegressor(n_estimators=10, random_state=42)

In [90]:
train_set_predictions = forest_reg.predict(train_set_prepared)
forest_mse = mean_absolute_error(train_set_labels, train_set_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

1.3327721823855887

# Cross-Validation을 사용한 평가

In [91]:
from sklearn.model_selection import cross_val_score


def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

## 1. 선형회귀

In [92]:
lin_scores = cross_val_score(lin_reg, train_set_prepared, train_set_labels, scoring="neg_mean_absolute_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [2.79523021 2.79878511 2.6612319  2.7398431  2.68852647 2.78795571
 2.91693192 2.7904857  2.75560755 2.98305829]
Mean: 2.79176559548901
Standard deviation: 0.09167205006247353


## 2. 결정트리 회귀

In [93]:
tree_scores = cross_val_score(tree_reg, train_set_prepared, train_set_labels, scoring="neg_mean_absolute_error", cv=10)
tree_rmse_scores = np.sqrt(-tree_scores)
display_scores(tree_rmse_scores)

Scores: [2.11972716 2.131647   2.1210549  2.18770108 2.04675089 2.18119461
 2.16302848 2.05477419 2.17213054 2.22342758]
Mean: 2.1401436424566977
Standard deviation: 0.054123507921772916


## 3. 랜덤포레스트 회귀

In [94]:
forest_scores = cross_val_score(forest_reg, train_set_prepared, train_set_labels, scoring="neg_mean_absolute_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

Scores: [2.20946979 2.21430647 2.17259238 2.20129978 2.09769822 2.20407535
 2.27381546 2.08554374 2.2161903  2.31253142]
Mean: 2.1987522913425352
Standard deviation: 0.06556547778140517


# 모델 세부 튜닝

In [95]:
from sklearn.model_selection import GridSearchCV


param_grid = [
    {'max_depth': [None, 2, 4, 8, 16, 32, 64, 128, 256, 512]},
]

tree_reg = DecisionTreeRegressor(random_state=42)

grid_search = GridSearchCV(tree_reg, param_grid, cv=5, scoring='neg_mean_absolute_error', return_train_score=True)
grid_search.fit(train_set_prepared, train_set_labels)

GridSearchCV(cv=5, estimator=DecisionTreeRegressor(random_state=42),
             param_grid=[{'max_depth': [None, 2, 4, 8, 16, 32, 64, 128, 256,
                                        512]}],
             return_train_score=True, scoring='neg_mean_absolute_error')

In [96]:
grid_search.best_params_

{'max_depth': None}

In [97]:
grid_search.best_estimator_

DecisionTreeRegressor(random_state=42)

In [98]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

2.1697017384430555 {'max_depth': None}
2.720925869276132 {'max_depth': 2}
2.6823048992419576 {'max_depth': 4}
2.648374319757425 {'max_depth': 8}
2.5558754005263222 {'max_depth': 16}
2.4304393359055814 {'max_depth': 32}
2.2499899470764397 {'max_depth': 64}
2.17193790554536 {'max_depth': 128}
2.1697017384430555 {'max_depth': 256}
2.1697017384430555 {'max_depth': 512}


# 테스트 데이터셋으로 최종 평가하기

In [99]:
final_model = grid_search.best_estimator_
test_set_prepared = full_pipeline.transform(test_set)
final_predictions = final_model.predict(test_set_prepared)

final_mae = mean_absolute_error(test_set_labels, final_predictions)
final_mae

4.513294276701217

In [100]:
final_predictions

array([30., 45., 30., ..., 30., 30., 30.])

In [101]:
test_set_labels

2361     30
9283     45
7226     30
9930     30
5291     30
         ..
3955     30
9429     30
10082    30
7370     30
8858     30
Name: DeliveryTime, Length: 2219, dtype: int64

In [103]:
# Under-prediction의 비율

result = np.array(test_set_labels) > final_predictions
under_prediction = np.count_nonzero(np_result) / len(result)
under_prediction

0.13835060838215413