<a href="https://colab.research.google.com/github/priyadharshini13/oxford_ml_project/blob/main/AI_ML_Project_Synthetic_data_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

# Set seed for reproducibility
np.random.seed(42)
random.seed(42)

# Generate date range from the year 2000 to 2024
start_date = datetime(2000, 1, 1)
end_date = datetime(2024, 5, 31)
date_range = pd.date_range(start=start_date, end=end_date, freq='D')

# Ensure we have at least 5000 records
if len(date_range) < 5000:
    end_date = start_date + timedelta(days=5000-1)
    date_range = pd.date_range(start=start_date, end=end_date, freq='D')

# Generate synthetic data
data = {
    'Date of Sale': date_range,
    'Store ID': np.random.choice(range(1, 6), size=len(date_range)),  # 5 different stores
    'Country': np.random.choice(['USA', 'Canada', 'Mexico'], size=len(date_range)),
    'Product ID': np.random.choice(range(1, 21), size=len(date_range)),  # 20 different products
    'Product Category': np.random.choice(['Electronics', 'Clothing', 'Home & Kitchen', 'Sports', 'Books'], size=len(date_range)),
    'Units Sold': np.random.poisson(lam=20, size=len(date_range)),  # Average 20 units sold per day
    'Price Sold': np.round(np.random.uniform(10, 500, size=len(date_range)), 2),  # Price between 10 and 500
    'GDP Growth Rate': np.random.normal(loc=2, scale=0.5, size=len(date_range)),  # GDP growth rate around 2%
    'Inflation Rate': np.random.normal(loc=2, scale=0.5, size=len(date_range))  # Inflation rate around 2%
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Add some seasonality to the Units Sold (more sales in Q4)
df['Quarter'] = df['Date of Sale'].dt.quarter
df.loc[df['Quarter'] == 4, 'Units Sold'] += np.random.poisson(lam=5, size=len(df[df['Quarter'] == 4]))

# Add some promotional events effect (e.g., Black Friday)
promotional_dates = [
    datetime(2000, 11, 24),  # Black Friday 2000
    datetime(2001, 11, 23),  # Black Friday 2001
    datetime(2002, 11, 29),  # Black Friday 2002
    datetime(2003, 11, 28),  # Black Friday 2003
    datetime(2004, 11, 26),  # Black Friday 2004
    datetime(2005, 11, 25),  # Black Friday 2005
    datetime(2006, 11, 24),  # Black Friday 2006
    datetime(2007, 11, 23),  # Black Friday 2007
    datetime(2008, 11, 28),  # Black Friday 2008
    datetime(2009, 11, 27),  # Black Friday 2009
    datetime(2010, 11, 26),  # Black Friday 2010
    datetime(2011, 11, 25),  # Black Friday 2011
    datetime(2012, 11, 23),  # Black Friday 2012
    datetime(2013, 11, 29),  # Black Friday 2013
    datetime(2014, 11, 28),  # Black Friday 2014
    datetime(2015, 11, 27),  # Black Friday 2015
    datetime(2016, 11, 25),  # Black Friday 2016
    datetime(2017, 11, 24),  # Black Friday 2017
    datetime(2018, 11, 23),  # Black Friday 2018
    datetime(2019, 11, 29),  # Black Friday 2019
    datetime(2020, 11, 27),  # Black Friday 2020
    datetime(2021, 11, 26),  # Black Friday 2021
    datetime(2022, 11, 25),  # Black Friday 2022
    datetime(2023, 11, 24),  # Black Friday 2023
]

for promo_date in promotional_dates:
    df.loc[df['Date of Sale'] == promo_date, 'Units Sold'] *= 2

# Ensure no negative values
df['Units Sold'] = df['Units Sold'].clip(lower=0)

# Drop the Quarter column as it's no longer needed
df.drop(columns=['Quarter'], inplace=True)

# Save to CSV
df.to_csv('synthetic_retail_data_for_demand_prediction.csv', index=False)

# Show a sample of the generated dataset
print(df.head())

# Return the dataframe for further use
df


  Date of Sale  Store ID Country  Product ID Product Category  Units Sold  \
0   2000-01-01         4     USA          15   Home & Kitchen          14   
1   2000-01-02         5  Canada           6      Electronics          21   
2   2000-01-03         3  Canada          11      Electronics          16   
3   2000-01-04         5     USA           8            Books          19   
4   2000-01-05         5  Mexico          17            Books          24   

   Price Sold  GDP Growth Rate  Inflation Rate  
0      229.25         1.716353        1.006865  
1      104.01         1.955608        2.417040  
2      169.17         1.774069        2.509185  
3      179.60         2.112682        2.636036  
4      286.81         1.943792        1.945556  


Unnamed: 0,Date of Sale,Store ID,Country,Product ID,Product Category,Units Sold,Price Sold,GDP Growth Rate,Inflation Rate
0,2000-01-01,4,USA,15,Home & Kitchen,14,229.25,1.716353,1.006865
1,2000-01-02,5,Canada,6,Electronics,21,104.01,1.955608,2.417040
2,2000-01-03,3,Canada,11,Electronics,16,169.17,1.774069,2.509185
3,2000-01-04,5,USA,8,Books,19,179.60,2.112682,2.636036
4,2000-01-05,5,Mexico,17,Books,24,286.81,1.943792,1.945556
...,...,...,...,...,...,...,...,...,...
8913,2024-05-27,5,USA,5,Electronics,14,11.33,2.346892,2.146812
8914,2024-05-28,4,Canada,3,Clothing,20,50.23,2.209269,1.042405
8915,2024-05-29,2,USA,8,Home & Kitchen,27,372.89,1.955724,1.857459
8916,2024-05-30,4,Canada,4,Clothing,21,173.40,1.754706,0.846825


In [None]:
df.describe(include='all')

Unnamed: 0,Date of Sale,Store ID,Country,Product ID,Product Category,Units Sold,Price Sold,GDP Growth Rate,Inflation Rate
count,8918,8918.0,8918,8918.0,8918,8918.0,8918.0,8918.0,8918.0
unique,,,3,,5,,,,
top,,,USA,,Books,,,,
freq,,,3005,,1915,,,,
mean,2012-03-16 12:00:00,2.996187,,10.538125,,21.305786,257.798725,1.996123,1.993854
min,2000-01-01 00:00:00,1.0,,1.0,,6.0,10.06,0.197802,0.11138
25%,2006-02-07 06:00:00,2.0,,6.0,,18.0,135.4725,1.65993,1.658651
50%,2012-03-16 12:00:00,3.0,,11.0,,21.0,258.56,1.996907,1.983493
75%,2018-04-23 18:00:00,4.0,,16.0,,24.0,381.9575,2.3369,2.33174
max,2024-05-31 00:00:00,5.0,,20.0,,76.0,499.93,3.702603,3.829514


Let's break down the synthetic retail dataset generated by the code:

### Overview
The dataset is a synthetic representation of retail sales data, designed to span from January 1, 2000, to May 31, 2024. It includes various features that are commonly found in retail datasets, such as sales dates, store information, product details, sales quantities, prices, and economic indicators.

### Features
Here's a detailed explanation of each feature in the dataset:

1. **Date of Sale (`Date of Sale`):**
   - This column contains the date for each sales record, ranging from January 1, 2000, to May 31, 2024. The data is recorded daily.

2. **Store ID (`Store ID`):**
   - This column represents the unique identifier for each store. The dataset includes 5 different stores, randomly assigned to each record.

3. **Country (`Country`):**
   - This column indicates the country where the sale took place. The possible values are 'USA', 'Canada', and 'Mexico', randomly assigned to each record.

4. **Product ID (`Product ID`):**
   - This column contains the unique identifier for each product. The dataset includes 20 different products, randomly assigned to each record.

5. **Product Category (`Product Category`):**
   - This column categorizes the products into one of five categories: 'Electronics', 'Clothing', 'Home & Kitchen', 'Sports', and 'Books', randomly assigned to each record.

6. **Units Sold (`Units Sold`):**
   - This column represents the number of units sold for each transaction. The values are generated using a Poisson distribution with an average of 20 units sold per day. The dataset also includes adjustments for seasonality and promotional events:
     - Sales increase during Q4 (October to December).
     - Sales double on promotional dates like Black Friday.

7. **Price Sold (`Price Sold`):**
   - This column contains the selling price of each product, randomly assigned between $10 and $500.

8. **GDP Growth Rate (`GDP Growth Rate`):**
   - This column represents the GDP growth rate at the time of the sale, generated using a normal distribution with a mean of 2% and a standard deviation of 0.5%.

9. **Inflation Rate (`Inflation Rate`):**
   - This column represents the inflation rate at the time of the sale, generated using a normal distribution with a mean of 2% and a standard deviation of 0.5%.

### Seasonal and Promotional Adjustments
1. **Seasonality:**
   - The number of units sold increases during the fourth quarter of each year to reflect higher sales typically observed during the holiday season.

2. **Promotional Events:**
   - Sales are significantly boosted on specific promotional dates like Black Friday (the day after Thanksgiving in the USA), where the units sold are doubled for those dates.

### Data Integrity
1. **No Negative Sales:**
   - The dataset ensures that the number of units sold is never negative by clipping the values to a minimum of 0.

### Sample of the Dataset
Here is a sample of the first few rows of the dataset:

```plaintext
   Date of Sale  Store ID Country  Product ID Product Category  Units Sold  Price Sold  GDP Growth Rate  Inflation Rate
0    2000-01-01         4     USA           5       Electronics         20      362.38          1.865016        2.206579
1    2000-01-02         5  Canada          12             Books         19       41.12          1.602539        1.709739
2    2000-01-03         4  Mexico          14          Clothing         18      336.44          2.332410        2.235565
3    2000-01-04         2     USA          18         Home & Kitchen         22      292.55          1.809740        1.640369
4    2000-01-05         2  Canada           4           Sports         18      376.66          2.518632        1.990418
```

### Summary
This synthetic dataset is designed to simulate a real-world retail sales scenario, incorporating various aspects such as store locations, product categories, daily sales, pricing, and economic factors. It includes seasonality and promotional events, making it suitable for training machine learning models for demand forecasting, inventory management, and other retail analytics applications.