`Notebook Description:` contains different methods for generating forcasting datasets
- 1. Random Number Generation with Uniform Distribution
- 2. Random Number Generation with Normal Distribution
- 3. Generation with Incorporating Seasonality and Trends

## Setup 

In [3]:
import os 
import pandas as pd
import numpy as np 
import random

In [None]:
# Defining directory paths
data_dir = '../data/generated/forecast/'

In [17]:
# Set parameters
num_products = 1000
start_date = '2024-08-01'
end_date = '2024-08-31'
min_batches_per_product = 5
date_range = pd.date_range(start=start_date, end=end_date)

In [None]:
# generate sequantial ids for the products 
products_ids = np.arange(1, num_products + 1)

In [38]:
date_range

DatetimeIndex(['2024-08-01', '2024-08-02', '2024-08-03', '2024-08-04',
               '2024-08-05', '2024-08-06', '2024-08-07', '2024-08-08',
               '2024-08-09', '2024-08-10', '2024-08-11', '2024-08-12',
               '2024-08-13', '2024-08-14', '2024-08-15', '2024-08-16',
               '2024-08-17', '2024-08-18', '2024-08-19', '2024-08-20',
               '2024-08-21', '2024-08-22', '2024-08-23', '2024-08-24',
               '2024-08-25', '2024-08-26', '2024-08-27', '2024-08-28',
               '2024-08-29', '2024-08-30', '2024-08-31'],
              dtype='datetime64[ns]', freq='D')

## Generate a toy forecast dataset 
for at least 1000 products and 31 days prediction horizon.

### 1. Random Number Generation with Uniform Distribution
- Assign forecasted sales using a uniform distribution, where each product has an equal chance of selling any number within a defined range.

In [None]:
# Pick a random number as minmum and another random number as maxmum to be a range for forecast
min_forcasted_sales = 5 
max_forcasted_sales = 200

In [None]:
random_generated_data = []
for product_id in products_ids:
    for date in date_range:
        forecasted_sales  = random.randint(min_forcasted_sales, max_forcasted_sales) 
        random_generated_data.append((product_id, date, forecasted_sales ))

print(len(random_generated_data))

31000


In [22]:
# Convert to DataFrame
random_forecast_df = pd.DataFrame(random_generated_data, columns=["PRODUCT_ID", "DATE", "FORECASTED_SALES"])


In [24]:
random_forecast_df.head()

Unnamed: 0,PRODUCT_ID,DATE,FORECASTED_SALES
0,1,2024-08-01,186
1,1,2024-08-02,137
2,1,2024-08-03,54
3,1,2024-08-04,119
4,1,2024-08-05,36


In [25]:
random_forecast_df.describe()

Unnamed: 0,PRODUCT_ID,DATE,FORECASTED_SALES
count,31000.0,31000,31000.0
mean,500.5,2024-08-16 00:00:00,102.225194
min,1.0,2024-08-01 00:00:00,5.0
25%,250.75,2024-08-08 00:00:00,53.0
50%,500.5,2024-08-16 00:00:00,102.0
75%,750.25,2024-08-24 00:00:00,151.0
max,1000.0,2024-08-31 00:00:00,200.0
std,288.679646,,56.434726


**Considerations:**

- This method is simple but may not reflect realistic sales patterns.
- All products will have similar sales distributions, lacking differentiation.

### 2.Random Number Generation with Normal Distribution
- Use a normal (Gaussian) distribution to generate sales forecasts, allowing for more realistic variation around a mean value.

In [None]:
# generate two random lists of means and stds for each product
mean_sales = np.random.randint(10, 51, num_products)
std_dev_sales = np.random.randint(3, 10, num_products) 

In [33]:
random_ND_generated_data = []
for product_id in range(1,num_products+1):
    mean = mean_sales[product_id-1]
    std_dev = std_dev_sales[product_id-1] 
    
    sales_forcast = np.random.normal(mean, std_dev, len(date_range)) # generate float values and may negative 
    
    sales_forcast = np.maximum(0, sales_forcast).astype(int) 
    
    for date, forecast in zip(date_range, sales_forcast):
        random_ND_generated_data.append(
             {
                "PRODUCT_ID": product_id,
                "DATE": date,
                "FORECASTED_SALES": forecast
             })
    
random_ND_generated_data[:5]

[{'PRODUCT_ID': 1,
  'DATE': Timestamp('2024-08-01 00:00:00'),
  'FORECASTED_SALES': np.int64(44)},
 {'PRODUCT_ID': 1,
  'DATE': Timestamp('2024-08-02 00:00:00'),
  'FORECASTED_SALES': np.int64(45)},
 {'PRODUCT_ID': 1,
  'DATE': Timestamp('2024-08-03 00:00:00'),
  'FORECASTED_SALES': np.int64(49)},
 {'PRODUCT_ID': 1,
  'DATE': Timestamp('2024-08-04 00:00:00'),
  'FORECASTED_SALES': np.int64(49)},
 {'PRODUCT_ID': 1,
  'DATE': Timestamp('2024-08-05 00:00:00'),
  'FORECASTED_SALES': np.int64(50)}]

In [34]:
# generate dataframe from it 
random_ND_forecast_df = pd.DataFrame(random_ND_generated_data)
random_ND_forecast_df.head()

Unnamed: 0,PRODUCT_ID,DATE,FORECASTED_SALES
0,1,2024-08-01,44
1,1,2024-08-02,45
2,1,2024-08-03,49
3,1,2024-08-04,49
4,1,2024-08-05,50


In [35]:
random_ND_forecast_df.describe()

Unnamed: 0,PRODUCT_ID,DATE,FORECASTED_SALES
count,31000.0,31000,31000.0
mean,500.5,2024-08-16 00:00:00,29.516548
min,1.0,2024-08-01 00:00:00,0.0
25%,250.75,2024-08-08 00:00:00,19.0
50%,500.5,2024-08-16 00:00:00,29.0
75%,750.25,2024-08-24 00:00:00,40.0
max,1000.0,2024-08-31 00:00:00,75.0
std,288.679646,,13.467324


**Considerations:**

- Provides more realistic variability.
- May still lack seasonality or trends unless further adjusted.

### 3.Generation with Incorporating Seasonality and Trends
- Simulate patterns such as weekdays vs. weekends, holidays, or promotional periods.

##### Steps:

1. **Identify Seasonal Factors**:
   - Weekends may have higher or lower sales.
   - Certain dates (e.g., August holidays) might influence demand.

2. **Define Baseline Sales**:
   - Establish a base sales number for each product.

3. **Apply Seasonal Adjustments**:
   - Modify the baseline sales using factors that represent seasonality (e.g., increase sales by 20% on weekends).

4. **Add Random Variation**:
   - Introduce randomness around the adjusted sales to mimic real-world unpredictability.


In [49]:
# for date in date_range:
#     print(date, ' ', date.weekday())

In [65]:
seasonality_data = []
for product_id in products_ids:
    base_sales = np.random.randint(min_forcasted_sales, max_forcasted_sales) # use the same range for step 1    
    for date in date_range:
        if date.weekday() in [4,5]:  # Egyption weekdays Friday and Saturday
            sales = base_sales * 1.2  # Increase by 20%
        else:  # weekends
            sales = base_sales
        
        seasonality_data.append({
            "PRODUCT_ID": product_id,
            "DATE": date,
            "FORECASTED_SALES": int(sales) #generate float values 
        })
seasonality_data[:5]

[{'PRODUCT_ID': np.int64(1),
  'DATE': Timestamp('2024-08-01 00:00:00'),
  'FORECASTED_SALES': 22},
 {'PRODUCT_ID': np.int64(1),
  'DATE': Timestamp('2024-08-02 00:00:00'),
  'FORECASTED_SALES': 26},
 {'PRODUCT_ID': np.int64(1),
  'DATE': Timestamp('2024-08-03 00:00:00'),
  'FORECASTED_SALES': 26},
 {'PRODUCT_ID': np.int64(1),
  'DATE': Timestamp('2024-08-04 00:00:00'),
  'FORECASTED_SALES': 22},
 {'PRODUCT_ID': np.int64(1),
  'DATE': Timestamp('2024-08-05 00:00:00'),
  'FORECASTED_SALES': 22}]

In [66]:
# Create DataFrame
seasonality_df = pd.DataFrame(seasonality_data, columns=['PRODUCT_ID', 'DATE', 'FORECASTED_SALES'])
seasonality_df.head()

Unnamed: 0,PRODUCT_ID,DATE,FORECASTED_SALES
0,1,2024-08-01,22
1,1,2024-08-02,26
2,1,2024-08-03,26
3,1,2024-08-04,22
4,1,2024-08-05,22


**Considerations:**
- Reflects more realistic sales patterns.

### 4.Clustering Products into Categories with Different Sales Patterns 
`(Not Implemented)`  **Future Enhancement**

- Group products into categories (e.g., high-demand, medium-demand, low-demand) and assign different sales behaviors.

**Considerations:**
- Introduces product differentiation.
- Mimics real-world scenarios where products have varying demand levels.

### Testing the dataframes lengths of different methods

In [67]:
#enusre shapes of each dataframe
assert random_forecast_df.shape == random_ND_forecast_df.shape == seasonality_df.shape

In [68]:
#enusre columns of each dataframe
assert list(random_forecast_df.columns) == list(random_ND_forecast_df.columns) == list(seasonality_df.columns)

### Saving results csv files

In [69]:
random_forecast_df.to_csv(data_dir+'method_1_uniform_random_forecast.csv', index=False)
random_ND_forecast_df.to_csv(data_dir+'method_2_random_ND_forecast.csv', index=False)
seasonality_df.to_csv(data_dir+'method_3_seasonality_forecast.csv', index=False)