## First MLflow 모델 로깅 노트북
- 출처 : https://mlflow.org/docs/latest/getting-started/logging-first-model/notebooks/index.html
- 수정사항 : 설명문을 한글로 번역

In [1]:
from pprint import pprint

from sklearn.ensemble import RandomForestRegressor

from mlflow import MlflowClient

### MLflow Client 초기화

이 노트북을 실행하는 위치에 따라, 다음 셀에서 MLflow 클라이언트를 초기화하는 방법에 대한 구성이 달라질 수 있습니다. 

이 예에서는 로컬에서 실행 중인 추적 서버를 사용하고 있지만 다른 옵션도 사용할 수 있습니다.  
가장 쉬운 방법은 [데이터브릭스 커뮤니티 에디션](https://community.cloud.databricks.com/) 내의 무료 관리형 서비스를 사용하는 것입니다. 

추적 서버 URL 설정 및 관리형 또는 자체 관리형 MLflow 추적 서버에 대한 액세스 구성에 대한 자세한 내용은 [여기 노트북 실행 가이드](https://www.mlflow.org/docs/latest/getting-started/running-notebooks/index.html)를 참조하세요.

In [2]:
# NOTE: review the links mentioned above for guidance on connecting to a managed tracking server, such as the free Databricks Community Edition

client = MlflowClient(tracking_uri="http://127.0.0.1:8080")

#### MLflow 클라이언트 API를 사용한 검색 실험

In [3]:
# Search experiments without providing query terms behaves effectively as a 'list' action

all_experiments = client.search_experiments()

print(all_experiments)

[<Experiment: artifact_location='file:///D:/강의자료/MLflow/code/mlruns/0', creation_time=1711231739749, experiment_id='0', last_update_time=1711231739749, lifecycle_stage='active', name='Default', tags={}>]


In [4]:
# Extract the experiment name and lifecycle_stage

default_experiment = [
    {"name": experiment.name, "lifecycle_stage": experiment.lifecycle_stage}
    for experiment in all_experiments
    if experiment.name == "Default"
][0]

pprint(default_experiment)

{'lifecycle_stage': 'active', 'name': 'Default'}


### 새 실험 만들기

* 새 MLflow 실험 생성하기
* 실험 태그의 형태로 메타데이터 적용하기

In [5]:
experiment_description = (
    "This is the grocery forecasting project. "
    "This experiment contains the produce models for apples."
)

experiment_tags = {
    "project_name": "grocery-forecasting",
    "store_dept": "produce",
    "team": "stores-ml",
    "project_quarter": "Q3-2023",
    "mlflow.note.content": experiment_description,
}

produce_apples_experiment = client.create_experiment(name="Apple_Models", tags=experiment_tags)

In [6]:
# Use search_experiments() to search on the project_name tag key

apples_experiment = client.search_experiments(
    filter_string="tags.`project_name` = 'grocery-forecasting'"
)

pprint(apples_experiment[0])

<Experiment: artifact_location='mlflow-artifacts:/806769610813887394', creation_time=1711243179271, experiment_id='806769610813887394', last_update_time=1711243179271, lifecycle_stage='active', name='Apple_Models', tags={'mlflow.note.content': 'This is the grocery forecasting project. This '
                        'experiment contains the produce models for apples.',
 'project_name': 'grocery-forecasting',
 'project_quarter': 'Q3-2023',
 'store_dept': 'produce',
 'team': 'stores-ml'}>


In [7]:
# Access individual tag data

print(apples_experiment[0].tags["team"])

stores-ml


### 첫 번째 모델 트레이닝 실행하기

* 간단한 수요 예측 작업과 관련된 합성 데이터 세트를 생성합니다.
* MLflow 실행 시작
* 메트릭, 매개 변수 및 태그를 실행에 기록합니다.
* 실행에 모델 저장
* 모델 로깅 중에 모델 등록

#### 사과 수요에 대한 합성 데이터 생성기

In [8]:
from datetime import datetime, timedelta

import numpy as np
import pandas as pd


def generate_apple_sales_data_with_promo_adjustment(base_demand: int = 1000, n_rows: int = 5000):
    """
    Generates a synthetic dataset for predicting apple sales demand with seasonality and inflation.

    This function creates a pandas DataFrame with features relevant to apple sales.
    The features include date, average_temperature, rainfall, weekend flag, holiday flag,
    promotional flag, price_per_kg, and the previous day's demand. The target variable,
    'demand', is generated based on a combination of these features with some added noise.

    Args:
        base_demand (int, optional): Base demand for apples. Defaults to 1000.
        n_rows (int, optional): Number of rows (days) of data to generate. Defaults to 5000.

    Returns:
        pd.DataFrame: DataFrame with features and target variable for apple sales prediction.

    Example:
        >>> df = generate_apple_sales_data_with_seasonality(base_demand=1200, n_rows=6000)
        >>> df.head()
    """

    # Set seed for reproducibility
    np.random.seed(9999)

    # Create date range
    dates = [datetime.now() - timedelta(days=i) for i in range(n_rows)]
    dates.reverse()

    # Generate features
    df = pd.DataFrame(
        {
            "date": dates,
            "average_temperature": np.random.uniform(10, 35, n_rows),
            "rainfall": np.random.exponential(5, n_rows),
            "weekend": [(date.weekday() >= 5) * 1 for date in dates],
            "holiday": np.random.choice([0, 1], n_rows, p=[0.97, 0.03]),
            "price_per_kg": np.random.uniform(0.5, 3, n_rows),
            "month": [date.month for date in dates],
        }
    )

    # Introduce inflation over time (years)
    df["inflation_multiplier"] = 1 + (df["date"].dt.year - df["date"].dt.year.min()) * 0.03

    # Incorporate seasonality due to apple harvests
    df["harvest_effect"] = np.sin(2 * np.pi * (df["month"] - 3) / 12) + np.sin(
        2 * np.pi * (df["month"] - 9) / 12
    )

    # Modify the price_per_kg based on harvest effect
    df["price_per_kg"] = df["price_per_kg"] - df["harvest_effect"] * 0.5

    # Adjust promo periods to coincide with periods lagging peak harvest by 1 month
    peak_months = [4, 10]  # months following the peak availability
    df["promo"] = np.where(
        df["month"].isin(peak_months),
        1,
        np.random.choice([0, 1], n_rows, p=[0.85, 0.15]),
    )

    # Generate target variable based on features
    base_price_effect = -df["price_per_kg"] * 50
    seasonality_effect = df["harvest_effect"] * 50
    promo_effect = df["promo"] * 200

    df["demand"] = (
        base_demand
        + base_price_effect
        + seasonality_effect
        + promo_effect
        + df["weekend"] * 300
        + np.random.normal(0, 50, n_rows)
    ) * df["inflation_multiplier"]  # adding random noise

    # Add previous day's demand
    df["previous_days_demand"] = df["demand"].shift(1)
    df["previous_days_demand"].fillna(method="bfill", inplace=True)  # fill the first row

    # Drop temporary columns
    df.drop(columns=["inflation_multiplier", "harvest_effect", "month"], inplace=True)

    return df

In [9]:
# Generate the dataset!

data = generate_apple_sales_data_with_promo_adjustment(base_demand=1_000, n_rows=1_000)

data[-20:]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["previous_days_demand"].fillna(method="bfill", inplace=True)  # fill the first row
  df["previous_days_demand"].fillna(method="bfill", inplace=True)  # fill the first row


Unnamed: 0,date,average_temperature,rainfall,weekend,holiday,price_per_kg,promo,demand,previous_days_demand
980,2024-03-05 10:19:39.347354,34.130183,1.454065,0,0,1.449177,0,999.30629,1029.418398
981,2024-03-06 10:19:39.347354,32.353643,9.462859,0,0,2.856503,0,842.129427,999.30629
982,2024-03-07 10:19:39.347354,18.816833,0.39147,0,0,1.326429,0,990.616709,842.129427
983,2024-03-08 10:19:39.347354,34.533012,2.120477,0,0,0.970131,0,1068.802075,990.616709
984,2024-03-09 10:19:39.347354,23.057202,2.365705,1,0,1.049931,0,1346.486305,1068.802075
985,2024-03-10 10:19:39.347354,34.810165,3.089005,1,0,2.035149,0,1329.564672,1346.486305
986,2024-03-11 10:19:39.347354,29.208905,3.673292,0,0,2.518098,0,1086.143402,1329.564672
987,2024-03-12 10:19:39.347354,16.428676,4.077782,0,0,1.268979,0,1093.207186,1086.143402
988,2024-03-13 10:19:39.347354,32.067512,2.734454,0,0,0.762317,0,1069.939894,1093.207186
989,2024-03-14 10:19:39.347354,31.938203,13.883486,0,0,1.153301,0,994.40954,1069.939894


### 모델 훈련 및 기록

``RandomForestRegressor`` 사용

In [10]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

import mlflow

# Use the fluent API to set the tracking uri and the active experiment
mlflow.set_tracking_uri("http://127.0.0.1:8080")

# Sets the current active experiment to the "Apple_Models" experiment and returns the Experiment metadata
apple_experiment = mlflow.set_experiment("Apple_Models")

# Define a run name for this iteration of training.
# If this is not set, a unique name will be auto-generated for your run.
run_name = "apples_rf_test"

# Define an artifact path that the model will be saved to.
artifact_path = "rf_apples"

In [11]:
# Split the data into features and target and drop irrelevant date field and target field
X = data.drop(columns=["date", "demand"])
y = data["demand"]

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

params = {
    "n_estimators": 100,
    "max_depth": 6,
    "min_samples_split": 10,
    "min_samples_leaf": 4,
    "bootstrap": True,
    "oob_score": False,
    "random_state": 888,
}

# Train the RandomForestRegressor
rf = RandomForestRegressor(**params)

# Fit the model on the training data
rf.fit(X_train, y_train)

# Predict on the validation set
y_pred = rf.predict(X_val)

# Calculate error metrics
mae = mean_absolute_error(y_val, y_pred)
mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_val, y_pred)

# Assemble the metrics we're going to write into a collection
metrics = {"mae": mae, "mse": mse, "rmse": rmse, "r2": r2}

# Initiate the MLflow run context
with mlflow.start_run(run_name=run_name) as run:
    # Log the parameters used for the model fit
    mlflow.log_params(params)

    # Log the error metrics that were calculated during validation
    mlflow.log_metrics(metrics)

    # Log an instance of the trained model for later use
    mlflow.sklearn.log_model(sk_model=rf, input_example=X_val, artifact_path=artifact_path)



MLflow UI로 이동하여 방금 생성된 실행을 확인합니다(이름은 "apples_rf_test", 실험 "Apple_Models"에 로깅됨). 