## 반복 실험을 위한 Sacred
- [Sacred Github](https://github.com/IDSIA/sacred)
- Sacred란?
    - Sacred is a tool to help you configure, organize, log and reproduce experiments developed at IDSIA
    - 머신러닝 모델링을 진행할 때 설정을 저장해주고 관리하는 것을 도와주는 도구
- 필요성
    - Kaggle에서 자주 발생하는 상황
        - 사용한 Feature는?
        - 사용한 파라미터는?
        - 그 결과는?
    - 다양한 실험을 빠르게 진행하며, 손으로 기록하지 않고 자동으로 기록될 수 있도록 도와줄 도구가 필요

- Sacred의 Main mechanisms
    - ConfigScopes : 함수의 local 변수를 편리하게 다룰 수 있음 @ex.config 데코레이터로 사용
    - Config Injection : 모든 함수에 있는 설정을 접근할 수 있음
    - Command-line interface : 커맨드 라인으로 파라미터를 바꿔서 실행할 수 있음
    - Observers : 실험의 모든 정보를 Observers에게 제공해 저장. MongoDB / S3 등 
    - Automatic seeding : 실험의 무작위를 컨트롤할 때 도와줌

In [1]:
!pip3 install sacred

Collecting sacred
  Downloading sacred-0.8.1.tar.gz (90 kB)
Collecting docopt<1.0,>=0.3
  Downloading docopt-0.6.2.tar.gz (25 kB)
Collecting jsonpickle<2.0,>=1.2
  Downloading jsonpickle-1.4.1-py2.py3-none-any.whl (36 kB)
Collecting munch<3.0,>=2.0.2
  Downloading munch-2.5.0-py2.py3-none-any.whl (10 kB)
Collecting py-cpuinfo>=4.0
  Downloading py-cpuinfo-7.0.0.tar.gz (95 kB)
Collecting GitPython
  Downloading GitPython-3.1.7-py3-none-any.whl (158 kB)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.5-py3-none-any.whl (63 kB)
Collecting smmap<4,>=3.0.1
  Downloading smmap-3.0.4-py2.py3-none-any.whl (25 kB)
Building wheels for collected packages: sacred, docopt, py-cpuinfo
  Building wheel for sacred (setup.py): started
  Building wheel for sacred (setup.py): finished with status 'done'
  Created wheel for sacred: filename=sacred-0.8.1-py2.py3-none-any.whl size=105023 sha256=bf9fd4dc2242a51d622ba1fc8af79a3a6c9a8a71f0dba1feb9de16d824ed128f
  Stored in directory: c:\users\user\appdata\l

### Sacred를 활용한 선형회귀 실시

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.linear_model import LinearRegression
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
from ipywidgets import interact
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
import os
from numpy.random import permutation
from sklearn import svm, datasets
from sacred import Experiment
from sacred.observers import FileStorageObserver

plt.style.use('ggplot')
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'

PROJECT_ID='nyc-taxi-demand' 

In [3]:
ex = Experiment('nyc-taxi-demand-prediction', interactive=True)

# experiment_dir가 없으면 폴더 생성하고 FileStorageObserver로 저장
experiment_dir = os.path.join('./', 'experiments')
if not os.path.isdir(experiment_dir): 
    os.makedirs(experiment_dir)
ex.observers.append(FileStorageObserver.create(experiment_dir))

### 전처리

In [4]:
%%time
query = """
WITH base_data AS 
(
  SELECT nyc_taxi.*, gis.* EXCEPT (zip_code_geom)
  FROM (
    SELECT *
    FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2015`
    WHERE 
        EXTRACT(MONTH from pickup_datetime) = 1
        and pickup_latitude  <= 90 and pickup_latitude >= -90
    ) AS nyc_taxi
  JOIN (
    SELECT zip_code, state_code, state_name, city, county, zip_code_geom
    FROM `bigquery-public-data.geo_us_boundaries.zip_codes`
    WHERE state_code='NY'
    ) AS gis 
  ON ST_CONTAINS(zip_code_geom, st_geogpoint(pickup_longitude, pickup_latitude))
)

SELECT 
    zip_code,
    DATETIME_TRUNC(pickup_datetime, hour) as pickup_hour,
    EXTRACT(MONTH FROM pickup_datetime) AS month,
    EXTRACT(DAY FROM pickup_datetime) AS day,
    CAST(format_datetime('%u', pickup_datetime) AS INT64) -1 AS weekday,
    EXTRACT(HOUR FROM pickup_datetime) AS hour,
    CASE WHEN CAST(FORMAT_DATETIME('%u', pickup_datetime) AS INT64) IN (6, 7) THEN 1 ELSE 0 END AS is_weekend,
    COUNT(*) AS cnt
FROM base_data 
GROUP BY zip_code, pickup_hour, month, day, weekday, hour, is_weekend
ORDER BY pickup_hour
"""

base_df = pd.read_gbq(query=query, dialect='standard', project_id=PROJECT_ID)

Downloading: 100%|██████████████████| 87020/87020 [00:07<00:00, 11761.97rows/s]

Wall time: 29.4 s





### Feautre Engineering

In [6]:
import numpy as np

# One-Hot Encoding
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(base_df[['zip_code']])
ohe_output = enc.transform(base_df[['zip_code']]).toarray()
ohe_df = pd.concat([base_df, pd.DataFrame(ohe_output, columns='zip_code_'+enc.categories_[0])], axis=1)
ohe_df['log_cnt'] = np.log10(ohe_df['cnt'])

In [7]:
def split_train_and_test_period(df, period):
    """
    Dataframe에서 train_df, test_df로 나눠주는 함수
    
    df : 시계열 데이터 프레임
    period : 기간(정수 값, ex) 3 -> 3일)
    """
    criteria = max(df['pickup_hour']) - pd.Timedelta(days=period)  # 기준 일 계산
    train_df = df[df['pickup_hour'] <= criteria]
    test_df = df[df['pickup_hour'] > criteria]
    return train_df, test_df

### Train / Test 나누기

In [8]:
train_df, test_df = split_train_and_test_period(ohe_df, 7)

In [9]:
train_df.tail()

Unnamed: 0,zip_code,pickup_hour,month,day,weekday,hour,is_weekend,cnt,zip_code_10001,zip_code_10002,...,zip_code_12729,zip_code_12771,zip_code_13029,zip_code_13118,zip_code_13656,zip_code_13691,zip_code_14072,zip_code_14527,zip_code_14801,log_cnt
68046,10468,2015-01-24 23:00:00,1,24,5,23,1,1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
68047,10069,2015-01-24 23:00:00,1,24,5,23,1,18,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.255273
68048,11216,2015-01-24 23:00:00,1,24,5,23,1,27,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.431364
68049,10034,2015-01-24 23:00:00,1,24,5,23,1,4,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.60206
68050,11368,2015-01-24 23:00:00,1,24,5,23,1,3,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.477121


- 사용하지 않을 컬럼 삭제

In [10]:
del train_df['zip_code']
del train_df['pickup_hour']
del test_df['zip_code']
del test_df['pickup_hour']

In [11]:
train_df.head(2)

Unnamed: 0,month,day,weekday,hour,is_weekend,cnt,zip_code_10001,zip_code_10002,zip_code_10003,zip_code_10004,...,zip_code_12729,zip_code_12771,zip_code_13029,zip_code_13118,zip_code_13656,zip_code_13691,zip_code_14072,zip_code_14527,zip_code_14801,log_cnt
0,1,1,3,0,0,25,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.39794
1,1,1,3,0,0,10,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [12]:
y_train_raw = train_df.pop('cnt')
y_train_log = train_df.pop('log_cnt')
y_test_raw = test_df.pop('cnt')
y_test_log = test_df.pop('log_cnt')

In [13]:
y_true = y_test_raw.values.copy()

In [14]:
x_train = train_df.copy()
x_test = test_df.copy()

In [15]:
def evaluation(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    score = pd.DataFrame([mape, mae, mse], index=['mape', 'mae', 'mse'], columns=['score']).T
    return score

### 실험 설정
- 위에서 ex = Experiment('nyc-taxi-demand-prediction', interactive=True)했는데, ex.config로 설정을 저장
- ex.capture는 해당 설정을 사용해 함수를 리턴
- ex.main은 실험이 실행될 때 진행할 내용을 담음

In [16]:
@ex.config
def config():
    fit_intercept=True
    normalize=False

In [17]:
@ex.capture
def get_model(fit_intercept, normalize):
    return LinearRegression(fit_intercept, normalize)

In [18]:
# _log과 _run은 별도로 정의하지 않아도 함수의 인자로 사용 가능
@ex.main
def run(_log, _run):
    lr_reg = get_model()
    lr_reg.fit(x_train, y_train_raw)   # 모델 적합
    pred = lr_reg.predict(x_test)      # 모델 예측
    # log File에 로그 저장
    _log.info("Predict End")
    _run.log_scalar('model_name', lr_reg.__class__.__name__)  
    
    score = evaluation(y_test_raw, pred)   # 모델 성능 계산
        
    # Metrics쪽에 저장하고 싶으면 아래처럼 사용(이 방식을 추천함. 로컬에 저장됨 -> 상황에 따라 DB에도 저장 가능)
    _run.log_scalar('metrics', score)
    
    # Result쪽에 저장하고 싶으면 아래처럼 사용(로컬에 저장됨 -> 상황에 따라 DB에도 저장 가능)
    return score.to_dict()

In [19]:
experiment_result = ex.run()

INFO - nyc-taxi-demand-prediction - Running command 'run'
INFO - nyc-taxi-demand-prediction - Started run with ID "1"
INFO - run - Predict End
INFO - nyc-taxi-demand-prediction - Result: {'mape': {'score': 213892460970.38083}, 'mae': {'score': 2138924693.5531514}, 'mse': {'score': 4.821285840609414e+21}}
INFO - nyc-taxi-demand-prediction - Completed after 0:00:02


In [20]:
experiment_result.config

{'fit_intercept': True, 'normalize': False, 'seed': 770910353}

### Experiment 확인하기 위한 Parser
- Experiment에서 log찍는 방식에 따라 사용할 함수가 다름
    - 1) \_run.log\_scalar에 metrics을 저장하는 경우 : 추천
    - 2) @ex.main의 함수에 결과를 return하는 경우 

In [59]:
import json
import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO


# 1) _run.log_scalar에 metrics을 저장하는 경우
def parsing_output(ex_id):
    with open(f'./experiments/{ex_id}/metrics.json') as json_file:
        json_data = json.load(json_file)
    with open(f'./experiments/{ex_id}/config.json') as config_file:
        config_data = json.load(config_file)
    
    output_df = pd.DataFrame(json_data['model_name']['values'], columns=['model_name'], index=['score'])
    output_df['experiment_num'] = ex_id
    output_df['config'] = str(config_data)
    metric_df = pd.read_csv(StringIO(json_data['metrics']['values'][0]['values']), sep=',|\r\n')
    metric_df.index = ['score']

    
    output_df = pd.concat([output_df, metric_df], axis=1)
    output_df = output_df.round(2)
    return output_df

In [27]:
import json

# 2) @ex.main의 함수에 결과를 return하는 경우
def parsing_output(ex_id):
    with open(f'./experiments/{ex_id}/run.json') as json_file:
        json_data = json.load(json_file)
    output = pd.DataFrame(json_data['result'])
    return output

In [60]:
parsing_output(1)

Unnamed: 0,model_name,experiment_num,config,mape,mae,mse
score,LinearRegression,1,"{'fit_intercept': True, 'normalize': False, 's...",213892500000.0,2138925000.0,4.821286e+21


### 더 자세한 내용이 궁금하면
- [Sacred Github](https://github.com/IDSIA/sacred)
- [머신러닝 실험을 도와줄 Python Sacred 소개](https://zzsza.github.io/mlops/2019/07/21/python-sacred/)
- [Sacred와 Omniboard를 활용한 실험 및 로그 모니터링](https://zzsza.github.io/mlops/2019/07/22/sacred-with-omniboard/)