# IRON KAGGLE

- **shop_ID**: Unique identifier for each shop.
- **day_of_the_week**: Encoded from 1 to 7, representing the day of the week.
- **date**: Day, month, and year of the data point.
- **number_of_customers**: Quantity of customers that visited the shop on that day.
- **open**: Binary variable; 0 means the shop was closed, while 1 means it was open.
- **promotion**: Binary variable; 0 means no promotions, 1 means there were promotions.
- **state_holiday**: Encoded as 0, 'a', 'b', 'c', indicating the presence of a state holiday (0 if none). 'a', 'b', 'c' represent different state holidays.
- **school_holiday**: Binary variable; 0 means no school holiday, 1 means there was a school holiday.

In [85]:
import pandas as pd


from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error

import datetime

import matplotlib.pyplot as plt
import seaborn as sns

## Data Loading

In [86]:
data = pd.read_csv('data/sales.csv')
sales_df = data.copy()

## Data Exploration

In [None]:
display(sales_df.shape)
display(sales_df.head())
display(sales_df.info())
display(sales_df.isna().sum())
display(sales_df.duplicated().sum())

## Feature Engineering

### Dates

In [88]:
sales_df['Date'] = pd.to_datetime(data['Date'])

# Split the dates
sales_df['Year'] = sales_df['Date'].dt.year
sales_df['Month'] = sales_df['Date'].dt.month
sales_df['Day'] = sales_df['Date'].dt.day

### Categories

In [89]:
# converts the state holiday column to multiple binary columns
sales_df = pd.get_dummies(sales_df, columns=['State_holiday'], drop_first=True)

### DF Check

In [None]:
display(sales_df.head())

## Correlation Matrix

In [None]:
corr_matrix = sales_df.corr()

# just for the target column
corr_matrix_target = corr_matrix[['Sales']]

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix_target, annot=True, cmap='coolwarm', fmt='.2f')
plt.show() 

## Data Split

### Define feature & target

In [92]:
#TODO Drop open 0?
features = sales_df.drop(columns=['Sales','Date', 'True_index'])
target = sales_df['Sales']

### Split the training and test

In [93]:
#TODO Try different sample sizes
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=0)

## Scaling

In [94]:
#TODO Try different scalers

normalizer = MinMaxScaler()

normalizer.fit(X_train)

X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

## Training

## Tuning

## Evaluation

## Cross Validation