# IronKaggle

Team: John, Lovely, Alex

## Schedule:
- 9am: Introduction, Group organization, framing the problem
- 9.30am: Development time
- 1pm: Lunch break
- 2pm: Back from break
- 3:30pm: You will receive the “real-life” data
- 5pm: Delivery of the dataset with predictions + r2 score + Finish your presentation
- 5:30pm: presentations + winner announcement



## Expected Deliverable
- “Real-life data set” with an extra column called “sales”, with your predictions (in .csv)
- An expected value of R2 of performance of your model
- A 5’ presentation on the choices you did and the road you took

## Deliverables
- A .csv file called with your groupd name (e.g, ‘G1.csv’, ‘G2.csv’)
- The value of R2 you are expecting to get
- Send this in a .zip file containing two elements: the csv file, a txt file with the R2 score inside

## Decisions
- It is supervised learning
- It is a regression problem


# Workflow

## Instal Dependencies

Import libabries

In [25]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from xgboost import XGBRegressor

## Get Data into Phyton

In [26]:
df = pd.read_csv("sales.csv")

df

Unnamed: 0.1,Unnamed: 0,store_ID,day_of_week,date,nb_customers_on_day,open,promotion,state_holiday,school_holiday,sales
0,425390,366,4,2013-04-18,517,1,0,0,0,4422
1,291687,394,6,2015-04-11,694,1,0,0,0,8297
2,411278,807,4,2013-08-29,970,1,1,0,0,9729
3,664714,802,2,2013-05-28,473,1,1,0,0,6513
4,540835,726,4,2013-10-10,1068,1,1,0,0,10882
...,...,...,...,...,...,...,...,...,...,...
640835,359783,409,6,2013-10-26,483,1,0,0,0,4553
640836,152315,97,1,2014-04-14,987,1,1,0,0,12307
640837,117952,987,1,2014-07-07,925,1,0,0,0,6800
640838,435829,1084,4,2014-06-12,725,1,0,0,0,5344


## Select Features

In [28]:
print(df.dtypes) #check different df types

print(df.shape)

Unnamed: 0              int64
store_ID                int64
day_of_week             int64
date                   object
nb_customers_on_day     int64
open                    int64
promotion               int64
state_holiday          object
school_holiday          int64
sales                   int64
dtype: object
(640840, 10)


In [29]:
df = pd.get_dummies(df, columns=["state_holiday"])


In [30]:
df['date'] = pd.to_datetime(df['date'])

df['week_of_year'] = df['date'].dt.isocalendar().week.astype(int)

df = df.drop(columns=['date'])  # Only after you've created date_numeric, since it is reperesented in week_of_:year

In [31]:
print(df.dtypes) #check different data types

print(df.shape)

Unnamed: 0             int64
store_ID               int64
day_of_week            int64
nb_customers_on_day    int64
open                   int64
promotion              int64
school_holiday         int64
sales                  int64
state_holiday_0         bool
state_holiday_a         bool
state_holiday_b         bool
state_holiday_c         bool
week_of_year           int64
dtype: object
(640840, 13)


## Define X and y

In [32]:
from sklearn.model_selection import train_test_split

# Define features and target
X = df.drop("sales", axis=1)  # features
y = df["sales"]            #target

## Split in Train and Test

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Standardization of Data

In [37]:
print(X_train.describe())

          Unnamed: 0       store_ID    day_of_week  nb_customers_on_day  \
count  512672.000000  512672.000000  512672.000000        512672.000000   
mean   355836.819327     557.885188       4.000798           633.505376   
std    205536.360934     321.898611       1.996979           464.131738   
min         0.000000       1.000000       1.000000             0.000000   
25%    177886.750000     280.000000       2.000000           405.000000   
50%    355650.500000     558.000000       4.000000           610.000000   
75%    533930.250000     836.000000       6.000000           838.000000   
max    712044.000000    1115.000000       7.000000          5458.000000   

                open      promotion  school_holiday   week_of_year  
count  512672.000000  512672.000000   512672.000000  512672.000000  
mean        0.830223       0.381780        0.178085      23.608877  
std         0.375437       0.485824        0.382584      14.442820  
min         0.000000       0.000000        0.000

In [40]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Fit on training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Only transform the test set (don't fit again)
X_test_scaled = scaler.transform(X_test)

In [41]:
print(X_train.describe())

          Unnamed: 0       store_ID    day_of_week  nb_customers_on_day  \
count  512672.000000  512672.000000  512672.000000        512672.000000   
mean   355836.819327     557.885188       4.000798           633.505376   
std    205536.360934     321.898611       1.996979           464.131738   
min         0.000000       1.000000       1.000000             0.000000   
25%    177886.750000     280.000000       2.000000           405.000000   
50%    355650.500000     558.000000       4.000000           610.000000   
75%    533930.250000     836.000000       6.000000           838.000000   
max    712044.000000    1115.000000       7.000000          5458.000000   

                open      promotion  school_holiday   week_of_year  
count  512672.000000  512672.000000   512672.000000  512672.000000  
mean        0.830223       0.381780        0.178085      23.608877  
std         0.375437       0.485824        0.382584      14.442820  
min         0.000000       0.000000        0.000

## Fit / Train

In [42]:
from sklearn.linear_model import LinearRegression

model = LinearRegression() #create model
model.fit(X_train, y_train) #train model

## Predictions

In [43]:
y_pred = model.predict(X_test)

## Evaluation

### Linear regression

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

r2 = r2_score(y_test, y_pred)

print("R²:", r2)

MSE: 2194715.9234377993
R²: 0.8515244075004484


## Try different models

### RandomForestRegressor

In [47]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
print("R²:", r2_score(y_test, model.predict(X_test)))

R²: 0.9538243220413481


### xgboost

In [51]:
import xgboost as xgb
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split


# Create and train the XGBoost model
model = xgb.XGBRegressor(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

# Evaluate
r2 = r2_score(y_test, y_pred)

print("R²:", r2)


R²: 0.9034551382064819


### Crossvalidation

In [53]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print("Average R²:", scores.mean())

Average R²: 0.904656958580017
