## データセットジェネレータの定義

私たちの例が機能するためには、実際にフィットする何かが必要ですが、あまりにも完璧にフィットしすぎてはいけません。私たちはモデルのハイパーパラメータを変更する効果を示すために複数の反復を学習する予定なので、特徴量セットにはある程度の説明できないばらつきが必要です。しかし、ターゲット変数（この場合、予測したいリンゴの販売データの`demand`）と特徴量セットの間にはある程度の相関が必要です。

この相関を導入するために、私たちの特徴量とターゲットとの間に関係を作り出します。いくつかの要因のランダム要素が説明できないばらつきの部分を処理します。

In [3]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

In [4]:
def generate_apple_sales_data_with_promo_adjustment(
    base_demand: int = 1000,
    n_rows: int = 5000,
    competitor_price_effect: float = -50.0,
):
    """
    Generates a synthetic dataset for predicting apple sales demand with multiple
    influencing factors.

    This function creates a pandas DataFrame with features relevant to apple sales.
    The features include date, average_temperature, rainfall, weekend flag, holiday flag,
    promotional flag, price_per_kg, competitor's price, marketing intensity, stock availability,
    and the previous day's demand. The target variable, 'demand', is generated based on a
    combination of these features with some added noise.

    Args:
        base_demand (int, optional): Base demand for apples. Defaults to 1000.
        n_rows (int, optional): Number of rows (days) of data to generate. Defaults to 5000.
        competitor_price_effect (float, optional): Effect of competitor's price being lower
                                                   on our sales. Defaults to -50.

    Returns:
        pd.DataFrame: DataFrame with features and target variable for apple sales prediction.

    Example:
        >>> df = generate_apple_sales_data_with_promo_adjustment(base_demand=1200, n_rows=6000)
        >>> df.head()
    """

    # Set seed for reproducibility
    np.random.seed(9999)

    # Create date range
    dates = [datetime.now() - timedelta(days=i) for i in range(n_rows)]
    dates.reverse()

    # Generate features
    df = pd.DataFrame(
        {
            "date": dates,
            "average_temperature": np.random.uniform(10, 35, n_rows),
            "rainfall": np.random.exponential(5, n_rows),
            "weekend": [(date.weekday() >= 5) * 1 for date in dates],
            "holiday": np.random.choice([0, 1], n_rows, p=[0.97, 0.03]),
            "price_per_kg": np.random.uniform(0.5, 3, n_rows),
            "month": [date.month for date in dates],
        }
    )

    # Introduce inflation over time (years)
    df["inflation_multiplier"] = 1 + (df["date"].dt.year - df["date"].dt.year.min()) * 0.03

    # Incorporate seasonality due to apple harvests
    df["harvest_effect"] = np.sin(2 * np.pi * (df["month"] - 3) / 12) + np.sin(
        2 * np.pi * (df["month"] - 9) / 12
    )

    # Modify the price_per_kg based on harvest effect
    df["price_per_kg"] = df["price_per_kg"] - df["harvest_effect"] * 0.5

    # Adjust promo periods to coincide with periods lagging peak harvest by 1 month
    peak_months = [4, 10]  # months following the peak availability
    df["promo"] = np.where(
        df["month"].isin(peak_months),
        1,
        np.random.choice([0, 1], n_rows, p=[0.85, 0.15]),
    )

    # Generate target variable based on features
    base_price_effect = -df["price_per_kg"] * 50
    seasonality_effect = df["harvest_effect"] * 50
    promo_effect = df["promo"] * 200

    df["demand"] = (
        base_demand
        + base_price_effect
        + seasonality_effect
        + promo_effect
        + df["weekend"] * 300
        + np.random.normal(0, 50, n_rows)
    ) * df["inflation_multiplier"]  # adding random noise

    # Add previous day's demand
    df["previous_days_demand"] = df["demand"].shift(1)
    df["previous_days_demand"].fillna(method="bfill", inplace=True)  # fill the first row

    # Introduce competitor pricing
    df["competitor_price_per_kg"] = np.random.uniform(0.5, 3, n_rows)
    df["competitor_price_effect"] = (
        df["competitor_price_per_kg"] < df["price_per_kg"]
    ) * competitor_price_effect

    # Stock availability based on past sales price (3 days lag with logarithmic decay)
    log_decay = -np.log(df["price_per_kg"].shift(3) + 1) + 2
    df["stock_available"] = np.clip(log_decay, 0.7, 1)

    # Marketing intensity based on stock availability
    # Identify where stock is above threshold
    high_stock_indices = df[df["stock_available"] > 0.95].index

    # For each high stock day, increase marketing intensity for the next week
    for idx in high_stock_indices:
        df.loc[idx : min(idx + 7, n_rows - 1), "marketing_intensity"] = np.random.uniform(0.7, 1)

    # If the marketing_intensity column already has values, this will preserve them;
    #  if not, it sets default values
    fill_values = pd.Series(np.random.uniform(0, 0.5, n_rows), index=df.index)
    df["marketing_intensity"].fillna(fill_values, inplace=True)

    # Adjust demand with new factors
    df["demand"] = df["demand"] + df["competitor_price_effect"] + df["marketing_intensity"]

    # Drop temporary columns
    df.drop(
        columns=[
            "inflation_multiplier",
            "harvest_effect",
            "month",
            "competitor_price_effect",
            "stock_available",
        ],
        inplace=True,
    )

    return df

In [5]:
data = generate_apple_sales_data_with_promo_adjustment(
    base_demand=1000, n_rows=10_000, competitor_price_effect=-25.0
)
data

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["previous_days_demand"].fillna(method="bfill", inplace=True)  # fill the first row
  df["previous_days_demand"].fillna(method="bfill", inplace=True)  # fill the first row
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["marketing_intensity"].fillna(fill_values, inplace=True

Unnamed: 0,date,average_temperature,rainfall,weekend,holiday,price_per_kg,promo,demand,previous_days_demand,competitor_price_per_kg,marketing_intensity
0,1996-12-02 23:45:11.240034,30.584727,1.831006,0,0,1.578387,1,1001.647352,1026.324266,0.755725,0.323086
1,1996-12-03 23:45:11.240032,15.465069,0.761303,0,0,1.965125,0,843.972638,1026.324266,0.913934,0.030371
2,1996-12-04 23:45:11.240031,10.786525,1.427338,0,0,1.497623,0,890.319248,868.942267,2.879262,0.354226
3,1996-12-05 23:45:11.240030,23.648154,3.737435,0,0,1.952936,0,811.206168,889.965021,0.826015,0.953000
4,1996-12-06 23:45:11.240029,13.861391,5.598549,0,0,2.059993,0,822.279469,835.253168,1.130145,0.953000
...,...,...,...,...,...,...,...,...,...,...,...
9995,2024-04-14 23:45:11.226286,23.358868,7.061220,1,0,1.556829,1,2566.432998,2676.279445,0.560507,0.889971
9996,2024-04-15 23:45:11.226284,14.859048,0.868655,0,0,1.632918,1,2032.827646,2590.543027,2.460766,0.884467
9997,2024-04-16 23:45:11.226283,17.941035,13.739986,0,0,0.827723,1,2167.417581,2031.943179,1.321922,0.884467
9998,2024-04-17 23:45:11.226281,14.533862,1.610512,0,0,0.589172,1,2099.505096,2166.533113,2.604095,0.812706


In [6]:
data.to_csv('apple_data.csv', index=False)

In [7]:
df_ = pd.read_csv('apple_data.csv')
df_

Unnamed: 0,date,average_temperature,rainfall,weekend,holiday,price_per_kg,promo,demand,previous_days_demand,competitor_price_per_kg,marketing_intensity
0,1996-12-02 23:45:11.240034,30.584727,1.831006,0,0,1.578387,1,1001.647352,1026.324266,0.755725,0.323086
1,1996-12-03 23:45:11.240032,15.465069,0.761303,0,0,1.965125,0,843.972638,1026.324266,0.913934,0.030371
2,1996-12-04 23:45:11.240031,10.786525,1.427338,0,0,1.497623,0,890.319248,868.942267,2.879262,0.354226
3,1996-12-05 23:45:11.240030,23.648154,3.737435,0,0,1.952936,0,811.206168,889.965021,0.826015,0.953000
4,1996-12-06 23:45:11.240029,13.861391,5.598549,0,0,2.059993,0,822.279469,835.253168,1.130145,0.953000
...,...,...,...,...,...,...,...,...,...,...,...
9995,2024-04-14 23:45:11.226286,23.358868,7.061220,1,0,1.556829,1,2566.432998,2676.279445,0.560507,0.889971
9996,2024-04-15 23:45:11.226284,14.859048,0.868655,0,0,1.632918,1,2032.827646,2590.543027,2.460766,0.884467
9997,2024-04-16 23:45:11.226283,17.941035,13.739986,0,0,0.827723,1,2167.417581,2031.943179,1.321922,0.884467
9998,2024-04-17 23:45:11.226281,14.533862,1.610512,0,0,0.589172,1,2099.505096,2166.533113,2.604095,0.812706


In [8]:
data_ = generate_apple_sales_data_with_promo_adjustment(base_demand=1_000, n_rows=5_000)
data_

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["previous_days_demand"].fillna(method="bfill", inplace=True)  # fill the first row
  df["previous_days_demand"].fillna(method="bfill", inplace=True)  # fill the first row
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["marketing_intensity"].fillna(fill_values, inplace=True

Unnamed: 0,date,average_temperature,rainfall,weekend,holiday,price_per_kg,promo,demand,previous_days_demand,competitor_price_per_kg,marketing_intensity
0,2010-08-11 23:45:24.467224,30.584727,1.199291,0,0,1.726258,0,851.375336,851.276659,1.935346,0.098677
1,2010-08-12 23:45:24.467223,15.465069,1.037626,0,0,0.576471,0,906.855943,851.276659,2.344720,0.019318
2,2010-08-13 23:45:24.467222,10.786525,5.656089,0,0,2.513328,0,808.304909,906.836626,0.998803,0.409485
3,2010-08-14 23:45:24.467221,23.648154,12.030937,1,0,1.839225,0,1099.833810,857.895424,0.761740,0.872803
4,2010-08-15 23:45:24.467220,13.861391,4.303812,1,0,1.531772,0,1283.949061,1148.961007,2.123436,0.820779
...,...,...,...,...,...,...,...,...,...,...,...
4995,2024-04-14 23:45:24.460841,21.643051,3.821656,1,0,2.391010,1,1875.882437,1880.799278,1.504432,0.756489
4996,2024-04-15 23:45:24.460840,13.808813,1.080603,0,1,0.898693,1,1596.870527,1925.125948,1.343586,0.742145
4997,2024-04-16 23:45:24.460839,11.698227,1.911000,0,0,2.839860,1,1271.065524,1596.128382,2.771896,0.742145
4998,2024-04-17 23:45:24.460836,18.052081,1.000521,0,0,1.188440,1,1681.886638,1320.323379,2.564075,0.742145
