# Order Predicitve modeling

This notebook demonstrates a step-by-step approach to creating a predictive model for forecasting orders for the month.

To guide the solution to the problem, I can start by addressing the following two principal questions:

1. What are the expected order days for each client?
2. How confident is the model built?



## Environment Setup

To begin the project, I created a Conda environment named `ps_bee` with Python 3.10. After setting up the environment, I installed the necessary packages listed in the `requirements.txt` file.

```bash
conda create --name ps_bee python=3.10
conda activate ps_bee
```

The `requirements.txt` file:
```text
pandas==2.2.2
numpy==2.1.0
ipykernel==6.29.5
pyarrow==17.0.0
fastparquet==2024.5.0
pandas-profiling==3.6.6
```

In [None]:
import os
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport


# Data Collection and Preparation

This step consists of the **ETL (Extract, Transform, Load)** process for data collection. Here, I prepare the data to be ready for consumption.

To assist in this analysis, I used the `pandas-profiling` framework. `pandas-profiling` is a tool that provides an extensive analysis of the data. However, this analysis can make the notebook quite large, so I save the generated report in html files. In this notebook, I am only showing the main insights I have found.

The input of this cell is the raw data. The output is the processed data.

I created a folder schema like this:
```text
data/
├── raw/
│   ├── august_total_sales.parquet
│   ├── august_with_missing_order_days.parquet
│   └── historical_orders.parquet
└── processed/
    ├── processed_sales.csv
    └── target_processed.csv
```


In [1]:
# Create the folders if they don't exist
folders = [
    "data/raw",
    "data/processed"
]

for folder in folders:
    os.makedirs(folder, exist_ok=True)

### define functions to ETL data

The cell below contain the function created to step of processing data. 

In [None]:
# Define ETL functions
def ETL_august_total_sales(file_path: str) -> pd.DataFrame:
    """
    Extracts, transforms, and loads data from the specified parquet file.
    
    :param file_path: Path to the parquet file for August total sales.
    :return: DataFrame containing the data from the parquet file.
    """
    df_read = pd.read_parquet(file_path)
    return df_read

def ETL_august_with_missing_order_days(file_path: str) -> pd.DataFrame:
    """
    Extracts, transforms, and loads data from the specified parquet file.
    
    :param file_path: Path to the parquet file for August with missing order days.
    :return: DataFrame containing the data from the parquet file.
    """
    df_read = pd.read_parquet(file_path)
    return df_read

def ETL_historical_orders(file_path: str) -> pd.DataFrame:
    """
    Extracts, transforms, and loads data from the specified parquet file.
    
    :param file_path: Path to the parquet file for historical orders.
    :return: DataFrame containing the data from the parquet file.
    """
    df_read = pd.read_parquet(file_path)
    return df_read


# Generate profiling reports
def generate_profile_report(df: pd.DataFrame, file_name: str):
    """
    Generates a pandas profiling report for the given DataFrame and saves it to an HTML file.
    
    :param df: DataFrame to profile.
    :param file_name: Name of the output HTML file for the profiling report.
    """
    profile = ProfileReport(df, title=file_name, explorative=True)
    profile.to_file(file_name)


In [3]:
# Directory paths
DIR_AUGUST_TOTAL_SALES = "data/raw/august_total_sales.parquet"
DIR_AUGUST_WITH_MISSING_ORDER_DAYS = "data/raw/august_with_missing_order_days.parquet"
DIR_HISTORICAL_ORDERS = "data/raw/historical_orders.parquet"


# Load the data
df_august_total_sales = ETL_august_total_sales(DIR_AUGUST_TOTAL_SALES)
df_august_with_missing_order_days = ETL_august_with_missing_order_days(DIR_AUGUST_WITH_MISSING_ORDER_DAYS)
df_historical_orders = ETL_historical_orders(DIR_HISTORICAL_ORDERS)



The cell below generate an save the report of data using ProfileReport

In [5]:

# Save the reports
generate_profile_report(df_august_total_sales, "reports/august_total_sales_report.html")
generate_profile_report(df_august_with_missing_order_days, "reports/august_with_missing_order_days_report.html")
generate_profile_report(df_historical_orders, "reports/historical_orders_report.html")

Summarize dataset: 100%|██████████| 12/12 [00:04<00:00,  2.85it/s, Completed]                                                              
Generate report structure: 100%|██████████| 1/1 [00:12<00:00, 12.17s/it]
Render HTML: 100%|██████████| 1/1 [00:01<00:00,  1.69s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 100.05it/s]
Summarize dataset: 100%|██████████| 14/14 [00:13<00:00,  1.00it/s, Completed]                                    
Generate report structure: 100%|██████████| 1/1 [00:11<00:00, 11.42s/it]
Render HTML: 100%|██████████| 1/1 [00:02<00:00,  2.12s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 90.95it/s]
Summarize dataset: 100%|██████████| 13/13 [03:25<00:00, 15.81s/it, Completed]                                    
Generate report structure: 100%|██████████| 1/1 [00:12<00:00, 12.23s/it]
Render HTML: 100%|██████████| 1/1 [00:01<00:00,  1.57s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 90.98it/s]


## EDA - Exploration Data Analysis
