### 🛒 Sales Sage

This project focuses on developing a machine learning model designed to predict the total sales for individual stores on a daily basis.

---
### Importing libs
---

In [16]:
import sys

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import missingno as msno

from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import r2_score

sys.path.insert(0, '../')

from src.proccess.processor import DataProcessor
from src.predict.predictor import ModelPredictor
from src.train.trainer import ModelTrainer

---
### Generate Train Data
---

In [2]:
!python3 ../src/get_data.py 2022 01 01 2023 08 01 train

Simulate data ingestion!
Saving to ../data/train-2023-08-01.csv file...


---
### PreProccess Data
---

In [3]:
!python3 ../src/proccess/processor.py ../data/train-2023-08-01.csv


    [INFO] Reading data ...

    [INFO] Creating DataProcessor() ...

    [INFO] Actual data ...

        store_id        date  client_id  product_id       price
148999      5003  2022-09-06     180767        1032  227.897436
18356       5000  2022-06-02     284139        1242  331.629317
254136      5004  2022-05-31     154635        1709  435.382437
57010       5000  2023-05-05     389714        1223  348.823613
17605       5000  2022-05-28     134995        1630  342.627398 ...

    [INFO] Create total sales column ...

    [INFO] Actual data ...

            date  store_id   total_sales
1106  2022-07-04      5002  10417.842752
1713  2022-10-13      5003  66357.056058
50    2022-01-09      5002  11239.542293
474   2022-03-21      5000  35715.739790
3399  2023-07-21      5003  40929.499838 ...

    [INFO] Create weekday column ...

    [INFO] Actual data ...

            date  store_id   total_sales  weekday
430   2022-03-13      5004  79268.367916        6
823   2022-05-18      500

---
### Model
---

In [4]:
!python3 ../src/train/trainer.py ../data/train-2023-08-01.parquet


    [INFO] Reading processed data ...

    [INFO] Creating ModelTrainer() ...

    [INFO] Prepare data for training ...

    [INFO] Start training ...

------------------------------ 

 ------- Model Trainer ------- 

------------------------------ 

    [MODEL]
    RandomForestRegressor(random_state=195)

    [X__train] 
       store_id  year  month  day  weekday
963       5003  2022      6   10        4
94        5004  2022      1   16        6
2654      5002  2023      3   19        6
928       5004  2022      6    4        5
2177      5005  2022     12   29        3
...        ...   ...    ...  ...      ...
2342      5002  2023      1   26        3
3025      5001  2023      5   20        5
1103      5005  2022      7    3        6
3104      5002  2023      6    2        4
1787      5005  2022     10   25        1

[2427 rows x 5 columns]

    [Y__train] 
 963      46134.181142
94      101000.197613
2654     11615.895924
928      59545.653710
2177     70307.720174
            ...  

---
### Generate new data for prediction
---

In [5]:
!python3 ../src/get_data.py 2023 08 02 2023 08 03 predict

Simulate data ingestion!
Saving to ../data/predict-2023-08-03.parquet file...


---
### Predict and Model Evaluation
---

In [6]:
!python3 ../src/predict/predictor.py ../models/model-2023-08-01.pkl ../data/predict-2023-08-03.parquet


    [INFO] Reading new data ...

    [INFO] Creating ModelPredictor() ...

    [INFO] Read model ...

    [INFO] Make prediction ...

    [INFO] Predict value : 

[32009.92344386 32379.64315132  5130.99840417  4696.47413702
  9330.00200726  9706.85479706 43137.65076984 73424.11946406
 57536.13412892 57127.49738927 44582.02266862 44984.14493023] 



#### Evaluate Model

In [7]:
!python3 ../src/get_data.py 2022 01 01 2023 09 01 train

Simulate data ingestion!
Saving to ../data/train-2023-09-01.csv file...


In [8]:
df = pd.read_csv("../data/train-2023-09-01.csv")
df.sample(3)

Unnamed: 0,store_id,date,client_id,product_id,price
365571,5005,2022-09-26,272754,2049,881.209909
189465,5003,2023-02-02,375632,1101,220.534305
330132,5004,2023-05-06,362917,2867,410.588858


In [9]:
# Process data recived
processor = DataProcessor(df)
processor.create_total_sales_column()
processor.create_weekday_column()
processor.create_day_month_year_columns()
processor.remove_columns(column_name="date")
processor.reorder_columns(column_order=['store_id', 'year', 'month', 'day', 'weekday', 'total_sales'])

processed_df = processor.get_data()
processed_df.sample(3)

Unnamed: 0,store_id,year,month,day,weekday,total_sales
2731,5001,2023,4,1,5,2997.241082
2319,5003,2023,1,22,6,44476.351196
3085,5001,2023,5,30,1,5538.151288


In [10]:
# Train model
trainer = ModelTrainer(processed_df)
trainer.prepare_data()
trainer.train_model()
trainer.save_model("../models/model-2023-09-01.pkl")

X_train, Y_train, X_test, Y_test = trainer.get_split_data()

In [11]:
# Predict value
predictor = ModelPredictor()
predictor.read_model("../models/model-2023-09-01.pkl")
predictor.predict(X_test)
Y_predict = predictor.get_y_prediction()

In [17]:
mse = mean_squared_error(Y_test, Y_predict)
mae = mean_absolute_error(Y_test, Y_predict)
r2 = r2_score(Y_test, Y_predict)

print(f"Mean Squared Error: {mse:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"R-Squared: {r2:.2f}")

Mean Squared Error: 31011535.85
Mean Absolute Error: 3764.44
R-Squared: 0.96
