# Week 6_ML 기초 실습 과제

## 1. 문제설명

- 이번 과제를 통해 여러분은 음식배달 서비스(배민, 쿠팡이츠 등)를 위한 예측모델을 만들게 될 것입니다! 이 모델이 예측하는 값은 “음식배달에 걸리는 시간"입니다. 배달시간을 정확하게 예측하는 것은 사용자의 경험에 많은 영향을 미치게 됩니다.

- 예측된 배달시간보다 실제 배달시간이 더 걸린 경우(under-prediction)가 반대의 경우(over-prediction)보다 두 배로 사용자의 경험에 안 좋은 영향을 준다고 알려져 있습니다.

- 가능한 실제 배달시간과 가까운 값을 예측하되 동시에 under-prediction을 최소화하는 것이 좋은 예측모델입니다.

## 2. 학습/테스트 데이터
- 파일 “predict_delivery_time.txt”는 다음과 같은 속성들을 가지고 있습니다.

- Restaurant: A unique ID that represents a restaurant.
- Location: The location of the restaurant.
- Cuisines: The cuisines offered by the restaurant.
- Average_Cost: The average cost for one person/order.
- Minimum_Order: The minimum order amount.
- Rating: Customer rating for the restaurant.
- Votes: The total number of customer votes for the restaurant.
- Reviews: The number of customer reviews for the restaurant.
- Delivery_Time: The order delivery time of the restaurant. (Target Classes) 

- Restaurant, Location, Cuisines, AverageCost, MinimumOrder, Rating, Votes, Reviews 속성들을 모델의 입력속성으로 사용하세요. 모델의 학습목표는 DeliveryTime입니다.

- 이 데이터에서 랜덤하게 20%를 추출해서 테스트 데이터로 사용하고 나머지는 학습데이터로 사용하세요.

## 3. 실습

### 3-1. 큰 그림 보기

풀어야 하는 문제: 음식배달에 걸리는 시간 예측.
- 즉, 지도학습, 회귀문제, 배치학습에 해당
- 평가지표는 MAE를 사용, under-prediction의 개수 / test set의 크기

### 3-2. 데이터 가져오기 및 Cleaning

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline

DATA_PATH = "./predict_delivery_time.csv"

In [2]:
def load_data(data_path=DATA_PATH):
    return pd.read_csv(data_path)

In [3]:
data = load_data()

In [4]:
data.head()

Unnamed: 0,Restaurant,Location,Cuisines,AverageCost,MinimumOrder,Rating,Votes,Reviews,DeliveryTime
0,ID6321,"FTI College, Law College Road, Pune","Fast Food, Rolls, Burger, Salad, Wraps",200,50,3.5,12.0,4.0,30
1,ID2882,"Sector 3, Marathalli","Ice Cream, Desserts",100,50,3.5,11.0,4.0,30
2,ID1595,Mumbai Central,"Italian, Street Food, Fast Food",150,50,3.6,99.0,30.0,65
3,ID5929,"Sector 1, Noida","Mughlai, North Indian, Chinese",250,99,3.7,176.0,95.0,30
4,ID6123,"Rmz Centennial, I Gate, Whitefield","Cafe, Beverages",200,99,3.2,521.0,235.0,65


In [5]:
data.info()

# AverageCost, Rating 이상치 존재

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11094 entries, 0 to 11093
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Restaurant    11094 non-null  object 
 1   Location      11094 non-null  object 
 2   Cuisines      11094 non-null  object 
 3   AverageCost   11094 non-null  object 
 4   MinimumOrder  11094 non-null  int64  
 5   Rating        9903 non-null   object 
 6   Votes         9020 non-null   float64
 7   Reviews       8782 non-null   float64
 8   DeliveryTime  11094 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 780.2+ KB


# 테스트 데이터셋 만들기

In [68]:
from sklearn.model_selection import StratifiedShuffleSplit


# 1. 데이터셋 나누기
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["DeliveryTime"]):
    strat_train_set, strat_test_set = data.loc[train_index], data.loc[test_index]

In [73]:
# strat_train_set.info()
# strat_test_set.info()

# data["DeliveryTime"].value_counts() / len(data)
strat_train_set["DeliveryTime"].value_counts() / len(strat_train_set)
# strat_test_set["DeliveryTime"].value_counts() / len(strat_test_set)

30     0.667606
45     0.240225
65     0.083155
120    0.005634
20     0.001803
80     0.001239
10     0.000338
Name: DeliveryTime, dtype: float64

In [74]:
# 상관관계 관찰하기

corr_matrix = data.corr()
corr_matrix["DeliveryTime"].sort_values(ascending=False)

DeliveryTime    1.000000
MinimumOrder    0.254186
Votes           0.198534
Reviews         0.170745
Name: DeliveryTime, dtype: float64

In [78]:
# label column 분리

delivery = strat_train_set.drop("DeliveryTime", axis=1)
delivery_labels = strat_train_set["DeliveryTime"].copy()

delivery

Unnamed: 0,Restaurant,Location,Cuisines,AverageCost,MinimumOrder,Rating,Votes,Reviews
10457,ID8173,"Dockyard Road, Mumbai CST Area","North Indian, Chinese, South Indian, Fast Food",250,50,3.7,189.0,84.0
2566,ID1774,"Sector 63A,Gurgaon","Mithai, North Indian, Chinese",200,50,3.1,29.0,9.0
7727,ID752,"MG Road, Pune","Fast Food, Continental",200,50,4.3,133.0,88.0
2406,ID2664,"Rmz Centennial, I Gate, Whitefield","North Indian, Chinese",200,50,3.0,4.0,
6774,ID5261,"Yerawada, Pune, Maharashtra","Cafe, Fast Food",100,50,,,
...,...,...,...,...,...,...,...,...
6656,ID7220,"Sector 1, Noida",Biryani,100,50,NEW,,
6191,ID3161,"MG Road, Pune","Chinese, North Indian, South Indian",250,50,3.4,27.0,11.0
9608,ID6528,"FTI College, Law College Road, Pune","North Indian, Biryani, Seafood",350,50,3.5,30.0,17.0
7344,ID3801,"Sector 1, Noida",Bakery,150,99,,,


In [79]:
# 이상치들을 NaN으로 변경
def remove_outlier(x):
    try:
        return float(x)
    except ValueError:
        return None

    
for feature in ["AverageCost", "Rating", "Votes", "Reviews"]:
    delivery[feature] = delivery[feature].apply(remove_outlier)
    
delivery

Unnamed: 0,Restaurant,Location,Cuisines,AverageCost,MinimumOrder,Rating,Votes,Reviews
10457,ID8173,"Dockyard Road, Mumbai CST Area","North Indian, Chinese, South Indian, Fast Food",250.0,50,3.7,189.0,84.0
2566,ID1774,"Sector 63A,Gurgaon","Mithai, North Indian, Chinese",200.0,50,3.1,29.0,9.0
7727,ID752,"MG Road, Pune","Fast Food, Continental",200.0,50,4.3,133.0,88.0
2406,ID2664,"Rmz Centennial, I Gate, Whitefield","North Indian, Chinese",200.0,50,3.0,4.0,
6774,ID5261,"Yerawada, Pune, Maharashtra","Cafe, Fast Food",100.0,50,,,
...,...,...,...,...,...,...,...,...
6656,ID7220,"Sector 1, Noida",Biryani,100.0,50,,,
6191,ID3161,"MG Road, Pune","Chinese, North Indian, South Indian",250.0,50,3.4,27.0,11.0
9608,ID6528,"FTI College, Law College Road, Pune","North Indian, Biryani, Seafood",350.0,50,3.5,30.0,17.0
7344,ID3801,"Sector 1, Noida",Bakery,150.0,99,,,


In [80]:
# 결측치 NaN를 평균으로 채우기
from sklearn.impute import SimpleImputer


delivery_num = delivery[["AverageCost", "Rating", "Votes", "Reviews"]]
delivery.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8875 entries, 10457 to 1995
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Restaurant    8875 non-null   object 
 1   Location      8875 non-null   object 
 2   Cuisines      8875 non-null   object 
 3   AverageCost   8857 non-null   float64
 4   MinimumOrder  8875 non-null   int64  
 5   Rating        7291 non-null   float64
 6   Votes         7211 non-null   float64
 7   Reviews       7017 non-null   float64
dtypes: float64(4), int64(1), object(3)
memory usage: 624.0+ KB


In [81]:
imputer = SimpleImputer(strategy="median")
imputer.fit(delivery_num)
# imputer.statistics_

tmp = imputer.transform(delivery_num)
delivery_tr = pd.DataFrame(tmp, columns=delivery_num.columns, index=delivery.index)

In [83]:
incomplete_rows = delivery[delivery.isnull().any(axis=1)]
incomplete_rows = incomplete_rows[["AverageCost", "Rating", "Votes", "Reviews"]]
incomplete_rows

Unnamed: 0,AverageCost,Rating,Votes,Reviews
2406,200.0,3.0,4.0,
6774,100.0,,,
7276,250.0,,,
9928,300.0,,,
2255,150.0,,,
...,...,...,...,...
570,200.0,,,
4172,150.0,,,
2988,200.0,,,
6656,100.0,,,


In [84]:
delivery_num.loc[incomplete_rows.index.values]

Unnamed: 0,AverageCost,Rating,Votes,Reviews
2406,200.0,3.0,4.0,
6774,100.0,,,
7276,250.0,,,
9928,300.0,,,
2255,150.0,,,
...,...,...,...,...
570,200.0,,,
4172,150.0,,,
2988,200.0,,,
6656,100.0,,,


In [85]:
delivery_tr.loc[incomplete_rows.index.values]

Unnamed: 0,AverageCost,Rating,Votes,Reviews
2406,200.0,3.0,4.0,26.0
6774,100.0,3.6,62.0,26.0
7276,250.0,3.6,62.0,26.0
9928,300.0,3.6,62.0,26.0
2255,150.0,3.6,62.0,26.0
...,...,...,...,...
570,200.0,3.6,62.0,26.0
4172,150.0,3.6,62.0,26.0
2988,200.0,3.6,62.0,26.0
6656,100.0,3.6,62.0,26.0


In [86]:
for feature in ["AverageCost", "Rating", "Votes", "Reviews"]:
    delivery[feature] = delivery_tr[feature]

delivery.info()
delivery

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8875 entries, 10457 to 1995
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Restaurant    8875 non-null   object 
 1   Location      8875 non-null   object 
 2   Cuisines      8875 non-null   object 
 3   AverageCost   8875 non-null   float64
 4   MinimumOrder  8875 non-null   int64  
 5   Rating        8875 non-null   float64
 6   Votes         8875 non-null   float64
 7   Reviews       8875 non-null   float64
dtypes: float64(4), int64(1), object(3)
memory usage: 944.0+ KB


Unnamed: 0,Restaurant,Location,Cuisines,AverageCost,MinimumOrder,Rating,Votes,Reviews
10457,ID8173,"Dockyard Road, Mumbai CST Area","North Indian, Chinese, South Indian, Fast Food",250.0,50,3.7,189.0,84.0
2566,ID1774,"Sector 63A,Gurgaon","Mithai, North Indian, Chinese",200.0,50,3.1,29.0,9.0
7727,ID752,"MG Road, Pune","Fast Food, Continental",200.0,50,4.3,133.0,88.0
2406,ID2664,"Rmz Centennial, I Gate, Whitefield","North Indian, Chinese",200.0,50,3.0,4.0,26.0
6774,ID5261,"Yerawada, Pune, Maharashtra","Cafe, Fast Food",100.0,50,3.6,62.0,26.0
...,...,...,...,...,...,...,...,...
6656,ID7220,"Sector 1, Noida",Biryani,100.0,50,3.6,62.0,26.0
6191,ID3161,"MG Road, Pune","Chinese, North Indian, South Indian",250.0,50,3.4,27.0,11.0
9608,ID6528,"FTI College, Law College Road, Pune","North Indian, Biryani, Seafood",350.0,50,3.5,30.0,17.0
7344,ID3801,"Sector 1, Noida",Bakery,150.0,99,3.6,62.0,26.0


# Estimator, Transformer, Predictor

In [90]:
from sklearn.preprocessing import OneHotEncoder

delivery_cat = [["Restaurant", "Location", "Cuisines"]]
cat_encoder = OneHotEncoder()
delivery_cat_1hot = cat_encoder.fit_transform(delivery_cat)
delivery_cat_1hot

<1x3 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [94]:
from sklearn.base import BaseEstimator, TransformerMixin


class CombinedAttributesAdder():
    pass

In [96]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    # ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

In [160]:
from sklearn.compose import ColumnTransformer


num_attribs = list(delivery_num)
cat_attribs = ["Restaurant", "Location", "Cuisines"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])

delivery_prepared = full_pipeline.fit_transform(delivery)

delivery_prepared

<8875x8332 sparse matrix of type '<class 'numpy.float64'>'
	with 62125 stored elements in Compressed Sparse Row format>

# 모델 훈련

# 선형회귀

In [167]:
from sklearn.linear_model import LinearRegression


lin_reg = LinearRegression()
lin_reg.fit(delivery_prepared, delivery_labels)

LinearRegression()

In [168]:
lin_reg.coef_

array([-0.84162521,  0.30991093,  8.45620258, ...,  5.03334011,
       13.71678158, -3.46492841])

In [169]:
# 몇 개의 데이터에 대한 예측

some_data = delivery.head()
some_lables = delivery_labels.head()
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared).round(decimals=1))

Predictions: [45.  30.  29.7 45.7 30. ]


In [170]:
print("Labels:", list(some_lables))

Labels: [45, 30, 30, 45, 30]


In [171]:
# 전체 훈련 데이터셋에 대한 RMSE 측정
from sklearn.metrics import mean_squared_error


delivery_predictions = lin_reg.predict(delivery_prepared)
lin_mse = mean_squared_error(delivery_labels, delivery_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

2.9314264819459086

# DecisionTreeRegressor

In [172]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(delivery_prepared, delivery_labels)

DecisionTreeRegressor(random_state=42)

In [173]:
delivery_predictions = tree_reg.predict(delivery_prepared)
tree_mse = mean_squared_error(delivery_labels, delivery_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

0.0

# RandomForestRegressor

In [174]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(delivery_prepared, delivery_labels)

RandomForestRegressor(random_state=42)

In [175]:
delivery_predictions = forest_reg.predict(delivery_prepared)
forest_mse = mean_squared_error(delivery_labels, delivery_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

3.5181517634954176

# Cross-Validation을 사용한 평가

In [176]:
from sklearn.model_selection import cross_val_score


def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

## 1. 선형회귀

In [177]:
lin_scores = cross_val_score(lin_reg, delivery_prepared, delivery_labels, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [13.62244532 13.48284314 12.03263418 13.40621338 11.36678296 13.16619669
 13.38141562 12.63874916 12.84822178 15.596999  ]
Mean: 13.154250123024076
Standard deviation: 1.0593342406067587


## 2. 결정트리 회귀

In [178]:
tree_scores = cross_val_score(tree_reg, delivery_prepared, delivery_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-tree_scores)
display_scores(tree_rmse_scores)

Scores: [10.52934552 11.13112874 11.2105163  11.48956636  9.07302705 11.18160028
 11.83311235  9.69663873 11.50829317 11.5973406 ]
Mean: 10.925056910286568
Standard deviation: 0.8497497272184871


## 3. 랜덤포레스트 회귀

In [None]:
forest_scores = cross_val_score(forest_reg, delivery_prepared, delivery_labels, scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

# 모델 세부 튜닝

In [None]:
from sklearn.model_selection import GridSearchCV


param_grid = [
    {'n_estimators': [10, 20, 50], 'max_features': [2, 4, 6, 7]},
    {'bootstrap': [False], 'n_estimators': [10, 20, 50], 'max_features': [2, 4, 6, 7]},
]

forest_reg = RandomForestRegressor(random_state=42)

grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(delivery_prepared, delivery_labels)

In [142]:
grid_search.best_params_

{'max_features': 2, 'n_estimators': 100}

In [143]:
grid_search.best_estimator_

RandomForestRegressor(max_features=2, random_state=42)

In [144]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

12.618525427249807 {'max_features': 2, 'n_estimators': 2}
11.342966923666337 {'max_features': 2, 'n_estimators': 5}
10.924658866308256 {'max_features': 2, 'n_estimators': 10}
10.694212134507566 {'max_features': 2, 'n_estimators': 20}
10.498894235066444 {'max_features': 2, 'n_estimators': 50}
10.42540299180319 {'max_features': 2, 'n_estimators': 100}
12.63815249140807 {'max_features': 3, 'n_estimators': 2}
11.398872649560431 {'max_features': 3, 'n_estimators': 5}
10.933156927408698 {'max_features': 3, 'n_estimators': 10}
10.70249452659193 {'max_features': 3, 'n_estimators': 20}
10.526967756195457 {'max_features': 3, 'n_estimators': 50}
10.463927611576361 {'max_features': 3, 'n_estimators': 100}
12.70515057152944 {'max_features': 4, 'n_estimators': 2}
11.535697641979777 {'max_features': 4, 'n_estimators': 5}
11.010766441039852 {'max_features': 4, 'n_estimators': 10}
10.751513347079001 {'max_features': 4, 'n_estimators': 20}
10.602625325825352 {'max_features': 4, 'n_estimators': 50}
10.53

# 테스트 데이터셋으로 최종 평가하기

In [146]:
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("DeliveryTime", axis=1)
Y_test = strat_test_set["DeliveryTime"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

final_mse

ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: '1,200'