# Moov AI - Data science test 

**Mise en situation**

Certains gestionnaires de magasins tentent de comprendre **comment augmenter les ventes de
leurs magasins.** Ils ont à leur disposition des données historiques de ventes de différents
magasins.

**Objectif**

Développe en Python une approche ML (supervisée et/ou non supervisée) pour aider les
gestionnaires de magasins à prévoir les ventes futures.



In [256]:
# Import des données et des librairies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from IPython.display import display
import plotly.express as px
import plotly.graph_objects as go

# Import xgboost and sklearn
from xgboost import XGBClassifier, XGBRegressor
from sklearn.model_selection import train_test_split

In [257]:
# Import csv
df = pd.read_csv(r'/Users/philippebeliveau/Desktop/Notebook/Moov AI/stores_sales_forecasting.csv', encoding='ISO-8859-1')

# Transform 'Order Date' and 'Ship Date' to datetime
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Ship Date'] = pd.to_datetime(df['Ship Date']) 

# Transform the postal code in a categorical variable
df['Postal Code'] = df['Postal Code'].astype('str')
df['Row ID'] = df['Row ID'].astype('str')

print(f"Shape of the dataset: {df.shape}")
display(df.head(5).style.set_sticky().set_properties(**{'overflow-x': 'auto'}))

df.columns

Shape of the dataset: (2121, 21)


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,2016-11-08 00:00:00,2016-11-11 00:00:00,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,2016-11-08 00:00:00,2016-11-11 00:00:00,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back",731.94,3,0.0,219.582
2,4,US-2015-108966,2015-10-11 00:00:00,2015-10-18 00:00:00,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
3,6,CA-2014-115812,2014-06-09 00:00:00,2014-06-14 00:00:00,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,FUR-FU-10001487,Furniture,Furnishings,"Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood",48.86,7,0.0,14.1694
4,11,CA-2014-115812,2014-06-09 00:00:00,2014-06-14 00:00:00,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,FUR-TA-10001539,Furniture,Tables,Chromcraft Rectangular Conference Tables,1706.184,9,0.2,85.3092


Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
       'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State',
       'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category',
       'Product Name', 'Sales', 'Quantity', 'Discount', 'Profit'],
      dtype='object')

# Question #1 : Préparation des données

Comme c’est souvent le cas dans les projets, le jeu de données peut nécessiter quelques
manipulations pour être utilisable par une approche ML.
- Si tu rencontres des problèmes de qualité des données durant ta manipulation des
données de ventes, comment les as-tu résolus?
- Limite-toi aux trois enjeux les plus pertinents selon toi (appuie-toi avec un visuel).
- Est-ce que les insights trouvés peuvent être transformés en features qui faciliteront
l'apprentissage du modèle M

Feature creation

In [258]:
# Create a feature to compute the difference between the order date and the ship date
df['Order Ship Delta'] = (df['Ship Date'] - df['Order Date']).dt.days

# Create a feature regarding the profit margin 
df['Profit Margin'] = df['Profit'] / df['Sales']

## Enjeux #1 - Nan values

Assigning rows at every possible date between Order_date min and max

In [259]:
def adjust_dataset_for_daily_entries(df, date_col):
    """Ensures that the dataset has a row for every single day, filling missing days with NaN values."""
    df[date_col] = pd.to_datetime(df[date_col])
    all_dates = pd.date_range(start=df[date_col].min(), end=df[date_col].max(), freq='D')
    
    # Ensure all columns are retained, filling missing values with NaN
    full_df = pd.DataFrame(all_dates, columns=[date_col])
    df = full_df.merge(df, on=date_col, how='left')
    
    return df

df = adjust_dataset_for_daily_entries(df, 'Order Date')
df

Unnamed: 0,Order Date,Row ID,Order ID,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Order Ship Delta,Profit Margin
0,2014-01-06,7475,CA-2014-167199,2014-01-10,Standard Class,ME-17320,Maria Etezadi,Home Office,United States,Henderson,...,FUR-CH-10004063,Furniture,Chairs,Global Deluxe High-Back Manager's Chair,2573.820,9.0,0.0,746.4078,4.0,0.2900
1,2014-01-07,7661,CA-2014-105417,2014-01-12,Standard Class,VS-21820,Vivek Sundaresam,Consumer,United States,Huntsville,...,FUR-FU-10004864,Furniture,Furnishings,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",76.728,3.0,0.6,-53.7096,5.0,-0.7000
2,2014-01-08,,,NaT,,,,,,,...,,,,,,,,,,
3,2014-01-09,,,NaT,,,,,,,...,,,,,,,,,,
4,2014-01-10,867,CA-2014-149020,2014-01-15,Standard Class,AJ-10780,Anthony Jacobs,Corporate,United States,Springfield,...,FUR-FU-10000965,Furniture,Furnishings,"Howard Miller 11-1/2"" Diameter Ridgewood Wall ...",51.940,1.0,0.0,21.2954,5.0,0.4100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2682,2017-12-29,5458,CA-2017-130631,2018-01-02,Standard Class,BS-11755,Bruce Stewart,Consumer,United States,Edmonds,...,FUR-FU-10004093,Furniture,Furnishings,Hand-Finished Solid Wood Document Frame,68.460,2.0,0.0,20.5380,4.0,0.3000
2683,2017-12-29,7633,US-2017-158526,2018-01-01,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Louisville,...,FUR-CH-10002602,Furniture,Chairs,DMI Arturo Collection Mission-style Design Woo...,1207.840,8.0,0.0,314.0384,3.0,0.2600
2684,2017-12-29,7636,US-2017-158526,2018-01-01,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Louisville,...,FUR-CH-10004495,Furniture,Chairs,"Global Leather and Oak Executive Chair, Black",300.980,1.0,0.0,87.2842,3.0,0.2900
2685,2017-12-29,7637,US-2017-158526,2018-01-01,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Louisville,...,FUR-CH-10001270,Furniture,Chairs,Harbour Creations Steel Folding Chair,258.750,3.0,0.0,77.6250,3.0,0.3000


In [260]:
# Print the number of nan values for the sales column 
print(f"Number of NaN values in the 'Sales' column: {df['Sales'].isna().sum()}")
print(f"Number of days in the dataset: {df['Order Date'].nunique()}")
# Calculate the number of days with missing sales values
missing_sales_days = df['Sales'].isna().sum()
# Calculate the percentage of days with missing sales values
missing_sales_percentage = missing_sales_days / df['Order Date'].nunique() * 100
print(f"Percentage of days with missing sales values: {missing_sales_percentage:.2f}%")

Number of NaN values in the 'Sales' column: 566
Number of days in the dataset: 1455
Percentage of days with missing sales values: 38.90%


This tells me that I should change the frequency of my time series

In [261]:
# Change my frequency from daily to weekly and give me the statistics of the na values
df_weekly = df.resample('W', on='Order Date').agg({
    'Sales': 'sum',
    'Profit': 'sum',
    'Profit Margin': 'sum',
    'Order Ship Delta': 'mean',
    'Discount': 'mean'
}).reset_index()
print(f"Number of NaN values in the 'Sales' column: {df_weekly['Sales'].isna().sum()}")
print(f"Number of weeks in the dataset: {df_weekly['Sales'].shape[0]}")
df_weekly

Number of NaN values in the 'Sales' column: 0
Number of weeks in the dataset: 208


Unnamed: 0,Order Date,Sales,Profit,Profit Margin,Order Ship Delta,Discount
0,2014-01-12,2712.4280,717.0750,0.310000,4.250000,0.150000
1,2014-01-19,1250.4730,-254.0044,-2.229902,2.400000,0.310000
2,2014-01-26,1655.9580,355.6263,2.542500,4.888889,0.022222
3,2014-02-02,623.6660,-13.2304,-0.038235,4.000000,0.175000
4,2014-02-09,14.5600,5.5328,0.380000,1.000000,0.000000
...,...,...,...,...,...,...
203,2017-12-03,16008.1720,-178.6637,1.399087,4.031250,0.190625
204,2017-12-10,8794.4040,519.7188,2.803340,3.714286,0.121429
205,2017-12-17,4639.8190,-780.1243,-2.115747,4.052632,0.255263
206,2017-12-24,7274.0430,557.5274,1.499951,4.545455,0.125000


When working with weekly data, I can only aggregate data numerically. Thus, for categorical variables, I would need to create columns with respect to each region. Did they make a sell during that week? 1 or 0. Or using a frequency encoder -> How many times they made a sell. 

Although, this means that I would put each of the region in distinct column, not in one.

How to aggregate by week, but id like to keep being able to filter per segment and 


## Choosing the Level of Granularity for the Time Series

**2. Visualizations to Determine the Right Level of Granularity**

(a) Overall Sales Trends
- Visualization: Line plot of total sales over time
- Purpose: Check if sales exhibit clear seasonal or trend-based patterns.

In [262]:
# Give me the distribution of the days with Nan values, regarding the day of the week 
df['Day of Week'] = df['Order Date'].dt.day_name()
df['Has NaN Sales'] = df['Sales'].isna()
df['Has NaN Sales'] = df['Has NaN Sales'].replace({True: 'Yes', False: 'No'})

fig = px.histogram(df, x='Day of Week', color='Has NaN Sales', barmode='group')
fig.update_layout(title='Distribution of NaN Sales values per day of the week')
fig.show()

In [263]:
# Plot the overall sales trend of the df, but I want you to resample the data weekly and filter on the Region. Plot all region on the same graph, you need a legend to identify which region is which and dont forget that when you resample by week, you cannot keep the categorical variables which are needed to plot per region 

import pandas as pd
import plotly.express as px

def plot_weekly_sales_trend_by_region(df, date_col, sales_col, region_col):
    """Plots the weekly sales trend by region."""
    df[date_col] = pd.to_datetime(df[date_col])
    
    # Resampler les données par semaine et par région
    df_weekly = df.set_index(date_col).groupby(region_col).resample('W').agg({
        sales_col: 'sum'
    }).reset_index()
    
    # Tracer le graphique
    fig = px.line(df_weekly, x=date_col, y=sales_col, color=region_col, title='Weekly Sales Trend by Region')
    fig.update_layout(xaxis_title='Order Date', yaxis_title='Sales', legend_title='Region')
    fig.show()

# Exemple d'utilisation
plot_weekly_sales_trend_by_region(df, 'Order Date', 'Sales', 'Region')

In [276]:
import pandas as pd

def compute_zero_percentage_weekly_by_region(df, date_col, sales_col, region_col):
    """Computes the percentage of zero values at the weekly level for each region."""
    df[date_col] = pd.to_datetime(df[date_col])
    
    # Resampler les données par semaine et par région
    df_weekly = df.set_index(date_col).groupby(region_col).resample('W').agg({
        sales_col: 'sum'
    }).reset_index()
    
    # Compter le nombre total de semaines par région
    total_weeks = df_weekly.groupby(region_col).size()
    
    # Compter les zéros par région
    zero_counts = df_weekly[df_weekly[sales_col] == 0].groupby(region_col).size()
    
    # Calculer le pourcentage de zéros par région
    zero_percentage = (zero_counts / total_weeks) * 100
    
    return zero_percentage

# Exemple d'utilisation
zero_percentage = compute_zero_percentage_weekly_by_region(df, 'Order Date', 'Sales', 'Sub-Category')
print(zero_percentage)

Sub-Category
Bookcases      34.299517
Chairs         12.500000
Furnishings     4.326923
Tables         24.878049
dtype: float64


In [None]:
import plotly.express as px

def plot_sales_distribution_by_region(df, sales_col, region_col):
    """Plots the sales distribution by region using a box plot."""
    fig = px.box(df, x=region_col, y=sales_col, title='Sales Distribution by Region')
    fig.update_layout(xaxis_title='Region', yaxis_title='Sales', legend_title='Region')
    fig.show()

# Exemple d'utilisation
plot_sales_distribution_by_region(df, 'Sales', 'Region')

In [None]:
def plot_overall_sales_trend(df, date_col, sales_col):
    """Plots the overall sales trend over time using Plotly, highlighting NaN values with red vertical bars."""
    df[date_col] = pd.to_datetime(df[date_col])
    sales_trend = df.groupby(date_col)[sales_col].sum().reset_index()
    
    fig = px.line(sales_trend, x=date_col, y=sales_col, title='Overall Sales Trend Over Time')
    fig.update_xaxes(title_text='Date')
    fig.update_yaxes(title_text='Total Sales')

    fig.show()

plot_overall_sales_trend(df_weekly, 'Order Date', 'Sales')
# plot_overall_sales_trend(df_weekly, 'Order Date', 'Sales')

(b) Product Category Trends
- Visualization: Faceted line plots of sales per category over time.
- Purpose: Identify whether different product categories follow distinct trends.
- Action: If they follow similar trends, you can aggregate. If they differ significantly, separate forecasting may be needed.

In [None]:
def plot_category_sales_trend(df, date_col, sales_col, category_col):
    """Plots sales trends for each product category over time using Plotly, highlighting NaN values."""
    df[date_col] = pd.to_datetime(df[date_col])
    category_trend = df.groupby([date_col, category_col])[sales_col].sum().reset_index()
    
    fig = px.line(category_trend, x=date_col, y=sales_col, color=category_col, title='Sales Trends by Product Category')
    fig.update_xaxes(title_text='Date')
    fig.update_yaxes(title_text='Total Sales')
    
    fig.show()

plot_category_sales_trend(df, 'Order Date', 'Sales', 'Sub-Category')

(c) Regional Trends
- Visualization: Faceted line plots of sales per region.
- Purpose: Identify if regional trends exist.
- Action: If all regions behave similarly, no need for regional separation.

In [None]:
def plot_region_sales_trend(df, date_col, sales_col, region_col):
    """Plots sales trends for each region over time using Plotly, highlighting NaN values."""
    df[date_col] = pd.to_datetime(df[date_col])
    region_trend = df.groupby([date_col, region_col])[sales_col].sum().reset_index()
    
    fig = px.line(region_trend, x=date_col, y=sales_col, color=region_col, title='Sales Trends by Region')
    fig.update_xaxes(title_text='Date')
    fig.update_yaxes(title_text='Total Sales')
    
    fig.show()

plot_region_sales_trend(df, 'Order Date', 'Sales', 'Region')

(d) Heatmaps for Category-Region Interactions
- Visualization: Heatmap of sales by category and region over time.
- Purpose: Detect if interactions between region and category drive differences.
- Action: If category and region interactions drive significant variations, a multi-series approach may be needed.

In [None]:
def plot_category_region_heatmap(df, date_col, sales_col, category_col, region_col):
    """Plots a heatmap showing sales variations by category and region using Plotly."""
    df[date_col] = pd.to_datetime(df[date_col])
    sales_pivot = df.pivot_table(index=category_col, columns=region_col, values=sales_col, aggfunc='sum').reset_index()
    
    fig = px.imshow(sales_pivot.set_index(category_col).values,
                     labels=dict(x='Region', y='Category', color='Sales'),
                     x=sales_pivot.columns[1:],
                     y=sales_pivot[category_col],
                     title='Sales Heatmap by Category and Region')
    fig.show()

plot_category_region_heatmap(df, 'Order Date', 'Sales', 'Sub-Category', 'Region')

(e) Variance Analysis
- Test: Compute variance in sales across categories and across regions.
- Purpose: If variance across categories is high but across regions is low, product categories should be the primary segmentation.

In [None]:
def compute_variance_analysis(df, sales_col, category_col, region_col):
    """Computes variance in sales across categories and regions."""
    category_variance = df.groupby(category_col)[sales_col].var()
    region_variance = df.groupby(region_col)[sales_col].var()
    
    print("Variance in Sales by Category:")
    print(category_variance.sort_values(ascending=False))
    print("\nVariance in Sales by Region:")
    print(region_variance.sort_values(ascending=False))

compute_variance_analysis(df, 'Sales', 'Sub-Category', 'Region')

Variance in Sales by Category:
Sub-Category
Bookcases      407999.675401
Tables         379178.426350
Chairs         302663.089138
Furnishings     21872.528677
Name: Sales, dtype: float64

Variance in Sales by Region:
Region
East       299896.711229
South      263359.971543
West       235458.278132
Central    215262.639907
Name: Sales, dtype: float64


### Seasonality analysis

In [None]:
from plotly.subplots import make_subplots
from statsmodels.tsa.seasonal import seasonal_decompose

def plot_time_series_decomposition(df, date_col, sales_col, freq='M'):
    """Plots time series decomposition using Plotly."""
    df[date_col] = pd.to_datetime(df[date_col])
    df.set_index(date_col, inplace=True)
    
    # Décomposer la série temporelle
    decomposition = seasonal_decompose(df[sales_col], model='additive', period=12)
    
    # Extraire les composants
    trend = decomposition.trend
    seasonal = decomposition.seasonal
    residual = decomposition.resid
    
    # Créer des sous-graphes pour chaque composant
    fig = make_subplots(rows=4, cols=1, shared_xaxes=True, 
                        subplot_titles=('Original', 'Trend', 'Seasonal', 'Residual'),
                        vertical_spacing=0.1)
    
    fig.add_trace(go.Scatter(x=df.index, y=df[sales_col], mode='lines', name='Original'), row=1, col=1)
    fig.add_trace(go.Scatter(x=df.index, y=trend, mode='lines', name='Trend'), row=2, col=1)
    fig.add_trace(go.Scatter(x=df.index, y=seasonal, mode='lines', name='Seasonal'), row=3, col=1)
    fig.add_trace(go.Scatter(x=df.index, y=residual, mode='lines', name='Residual'), row=4, col=1)
    
    fig.update_layout(title='Time Series Decomposition', showlegend=False)
    fig.show()

# Exemple d'utilisation
plot_time_series_decomposition(df_weekly, 'Order Date', 'Sales')

In [None]:
import pandas as pd
import plotly.express as px

def plot_seasonality(df, date_col, sales_col):
    """Plots seasonality over time using Plotly with normalized sales."""
    df[date_col] = pd.to_datetime(df[date_col])
    df['Month'] = df[date_col].dt.month
    df['Year'] = df[date_col].dt.year
    
    # Calculer la tendance de la saisonnalité
    seasonality_trend = df.groupby('Month')[sales_col].mean().reset_index()

    # Tracer le graphique
    fig = px.line(seasonality_trend, x='Month', y=sales_col, title='Seasonality')
    fig.update_xaxes(title_text='Month', tickmode='array', tickvals=list(range(1, 13)), 
                     ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
    fig.update_yaxes(title_text='Average Sales')
    fig.show()

# Exemple d'utilisation
plot_seasonality(df, 'Order Date', 'Sales')

In [None]:
def plot_seasonality_per_group(df, date_col, sales_col, group_col):
    """Plots seasonality for each group over time using Plotly with normalized sales."""
    df[date_col] = pd.to_datetime(df[date_col])
    df['Month'] = df[date_col].dt.month
    df['Year'] = df[date_col].dt.year
    
    # Calculer la tendance de la saisonnalité
    seasonality_trend = df.groupby([group_col, 'Month'])[sales_col].mean().reset_index()
    
    # Normaliser les ventes par groupe
    seasonality_trend[sales_col] = seasonality_trend.groupby(group_col)[sales_col].transform(lambda x: (x - x.min()) / (x.max() - x.min()))
    
    # Tracer le graphique
    fig = px.line(seasonality_trend, x='Month', y=sales_col, color=group_col, title='Seasonality by Group (Normalized)')
    fig.update_xaxes(title_text='Month', tickmode='array', tickvals=list(range(1, 13)), 
                     ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
    fig.update_yaxes(title_text='Normalized Average Sales')
    fig.show()
plot_seasonality_per_group(df, 'Order Date', 'Sales', 'Sub-Category')

In [None]:
plot_seasonality_per_group(df, 'Order Date', 'Sales', 'Region')

In [None]:
plot_seasonality_per_group(df, 'Order Date', 'Sales', 'Segment')

In [None]:
# Show me a boxplot of the sales over the week and aggregate the sales by the day of the week
df['Day of Week'] = df['Order Date'].dt.dayofweek
fig = px.box(df, x='Day of Week', y='Sales', title='Sales Distribution by Day of Week')
fig.update_xaxes(title_text='Day of Week', tickmode='array', tickvals=list(range(7)), 
                 ticktext=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
fig.update_yaxes(title_text='Sales')
fig.show()


In [271]:
import plotly.express as px


# Ajouter une colonne pour le jour de la semaine
df['Day of Week'] = df['Order Date'].dt.dayofweek

# Créer un boxplot des ventes par jour de la semaine et par région
fig = px.box(df, x='Day of Week', y='Sales', color='Region', title='Sales Distribution by Day of Week and Region')
fig.update_xaxes(title_text='Day of Week', tickmode='array', tickvals=list(range(7)), 
                 ticktext=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
fig.update_yaxes(title_text='Sales')
fig.show()

## Enjeux # 2 - Outliers

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import zscore

def adjust_dataset_for_daily_entries(df, date_col):
    """Ensures that the dataset has a row for every single day, filling missing days with NaN values."""
    # Convert the date column to datetime
    df[date_col] = pd.to_datetime(df[date_col])

    # Aggregate sales data by day
    daily_data = df.groupby(date_col).sum(numeric_only=True).reset_index()

    # Create a full date range
    all_dates = pd.date_range(start=daily_data[date_col].min(), end=daily_data[date_col].max(), freq='D')

    # Create a DataFrame with the full date range
    full_dates_df = pd.DataFrame(all_dates, columns=[date_col])

    # Merge the full date range with the aggregated daily data
    df = full_dates_df.merge(daily_data, on=date_col, how='left')
    
    return df

def identify_outliers(df, sales_col):
    """Identifies outliers in the sales data using Z-score method."""
    df['Z_Score'] = zscore(df[sales_col].fillna(0))  # Compute Z-scores for sales
    outliers = df[df['Z_Score'].abs() > 3]  # Z-score > 3 indicates an outlier
    return outliers

def assess_holiday_effect(outliers, date_col, holidays):
    """Checks if outliers coincide with known holidays."""
    outliers[date_col] = pd.to_datetime(outliers[date_col])
    holidays = pd.to_datetime(holidays)
    outliers['Is_Holiday'] = outliers[date_col].isin(holidays)
    return outliers

def plot_outliers_with_holidays(df, date_col, sales_col, outliers):
    """Plots sales data and highlights outliers and holidays."""
    fig = px.line(df, x=date_col, y=sales_col, title='Sales with Outliers and Holidays')
    
    # Highlight outliers
    fig.add_trace(go.Scatter(x=outliers[date_col], y=outliers[sales_col], 
                             mode='markers', 
                             marker=dict(color='red', size=8), 
                             name='Outliers'))
    
    # Highlight holidays
    holidays = outliers[outliers['Is_Holiday']]
    fig.add_trace(go.Scatter(x=holidays[date_col], y=holidays[sales_col], 
                             mode='markers', 
                             marker=dict(color='green', size=14), 
                             name='Holidays'))
    
    fig.update_xaxes(title_text='Date')
    fig.update_yaxes(title_text='Sales')
    fig.show()


In [None]:
holidays = [
    '2014-01-01',  # New Year's Day
    '2014-01-20',  # Martin Luther King Jr. Day
    '2014-02-14',  # Valentine's Day (Sales on gifts, chocolates, flowers)
    '2014-02-17',  # Presidents' Day (Major sales on furniture, appliances, cars)
    '2014-03-17',  # St. Patrick's Day (Sales on alcohol, party supplies)
    '2014-04-20',  # Easter Sunday (Sales on candy, clothing, decorations)
    '2014-05-11',  # Mother's Day (Sales on gifts, jewelry, beauty products)
    '2014-05-26',  # Memorial Day (Major sales on home goods, cars, mattresses)
    '2014-06-15',  # Father's Day (Sales on tools, electronics, clothing)
    '2014-07-04',  # Independence Day (Sales on grills, outdoor furniture, appliances)
    '2014-09-01',  # Labor Day (Major sales on furniture, appliances, clothing)
    '2014-10-13',  # Columbus Day (Retail sales, especially clothing and outdoor gear)
    '2014-10-31',  # Halloween (Sales on costumes, candy, decorations)
    '2014-11-11',  # Veterans Day (Military discounts, retail sales)
    '2014-11-27',  # Thanksgiving
    '2014-11-28',  # Black Friday (Biggest shopping day of the year)
    '2014-12-01',  # Cyber Monday (Major online retail discounts)
    '2014-12-25',  # Christmas Day (Post-Christmas sales)
    '2014-12-26',  # Boxing Day (Retail clearance sales)
    
    '2015-01-01',  # New Year's Day
    '2015-01-19',  # Martin Luther King Jr. Day
    '2015-02-14',  # Valentine's Day
    '2015-02-16',  # Presidents' Day
    '2015-03-17',  # St. Patrick's Day
    '2015-04-05',  # Easter Sunday
    '2015-05-10',  # Mother's Day
    '2015-05-25',  # Memorial Day
    '2015-06-21',  # Father's Day
    '2015-07-04',  # Independence Day
    '2015-09-07',  # Labor Day
    '2015-10-12',  # Columbus Day
    '2015-10-31',  # Halloween
    '2015-11-11',  # Veterans Day
    '2015-11-26',  # Thanksgiving
    '2015-11-27',  # Black Friday
    '2015-11-30',  # Cyber Monday
    '2015-12-25',  # Christmas Day
    '2015-12-26',  # Boxing Day

    '2016-01-01',  # New Year's Day
    '2016-01-18',  # Martin Luther King Jr. Day
    '2016-02-14',  # Valentine's Day
    '2016-02-15',  # Presidents' Day
    '2016-03-17',  # St. Patrick's Day
    '2016-03-27',  # Easter Sunday
    '2016-05-08',  # Mother's Day
    '2016-05-30',  # Memorial Day
    '2016-06-19',  # Father's Day
    '2016-07-04',  # Independence Day
    '2016-09-05',  # Labor Day
    '2016-10-10',  # Columbus Day
    '2016-10-31',  # Halloween
    '2016-11-11',  # Veterans Day
    '2016-11-24',  # Thanksgiving
    '2016-11-25',  # Black Friday
    '2016-11-28',  # Cyber Monday
    '2016-12-25',  # Christmas Day
    '2016-12-26',  # Boxing Day

    '2017-01-01',  # New Year's Day
    '2017-01-16',  # Martin Luther King Jr. Day
    '2017-02-14',  # Valentine's Day
    '2017-02-20',  # Presidents' Day
    '2017-03-17',  # St. Patrick's Day
    '2017-04-16',  # Easter Sunday
    '2017-05-14',  # Mother's Day
    '2017-05-29',  # Memorial Day
    '2017-06-18',  # Father's Day
    '2017-07-04',  # Independence Day
    '2017-09-04',  # Labor Day
    '2017-10-09',  # Columbus Day
    '2017-10-31',  # Halloween
    '2017-11-11',  # Veterans Day
    '2017-11-23',  # Thanksgiving
    '2017-11-24',  # Black Friday
    '2017-11-27',  # Cyber Monday
    '2017-12-25',  # Christmas Day
    '2017-12-26',  # Boxing Day
]


In [None]:
# Convert the holiday dates to datetime if needed
holidays = pd.to_datetime(holidays)

df = adjust_dataset_for_daily_entries(df, 'Order Date')
outliers = identify_outliers(df, 'Sales')
outliers_with_holidays = assess_holiday_effect(outliers, 'Order Date', holidays)
plot_outliers_with_holidays(df, 'Order Date', 'Sales', outliers_with_holidays)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import zscore

def identify_outliers(df, sales_col):
    """Creates a new column indicating if a row is an outlier based on Z-scores."""
    df['Z_Score'] = zscore(df[sales_col].fillna(0))  # Compute Z-scores for sales
    df['Is_Outlier'] = df['Z_Score'].abs() > 3  # Outliers have Z-score > 3
    return df

def outlier_distribution_by_weekday(df, date_col, outlier_col):
    """Analyzes the distribution of outliers across days of the week."""
    df[date_col] = pd.to_datetime(df[date_col])
    df['Day_of_Week'] = df[date_col].dt.day_name()  # Add day of week column
    outlier_distribution = df[df[outlier_col]].groupby('Day_of_Week').size().reset_index(name='Count')
    
    # Sort by day of the week order
    days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    outlier_distribution['Day_of_Week'] = pd.Categorical(outlier_distribution['Day_of_Week'], categories=days_order, ordered=True)
    outlier_distribution = outlier_distribution.sort_values('Day_of_Week')
    
    # Plot the distribution
    fig = px.bar(outlier_distribution, x='Day_of_Week', y='Count', title='Outlier Distribution by Day of the Week', 
                 labels={'Count': 'Number of Outliers', 'Day_of_Week': 'Day of the Week'})
    fig.show()
    
    return outlier_distribution

# Example usage
df = adjust_dataset_for_daily_entries(df, 'Order Date')
df = identify_outliers(df, 'Sales')
outlier_distribution = outlier_distribution_by_weekday(df, 'Order Date', 'Is_Outlier')


In [None]:
import pandas as pd
import plotly.express as px
from scipy.stats import zscore

def identify_outliers(df, sales_col):
    """Creates a new column indicating if a row is an outlier based on Z-scores."""
    df['Z_Score'] = zscore(df[sales_col].fillna(0))  # Compute Z-scores for sales
    df['Is_Outlier'] = df['Z_Score'].abs() > 3  # Outliers have Z-score > 3
    return df

def outlier_distribution_by_month_year(df, date_col, outlier_col):
    """Analyzes the distribution of outliers across months and years."""
    df[date_col] = pd.to_datetime(df[date_col])
    df['Month'] = df[date_col].dt.month_name()  # Add month column
    df['Year'] = df[date_col].dt.year  # Add year column
    outlier_distribution = df[df[outlier_col]].groupby(['Year', 'Month']).size().reset_index(name='Count')
    
    # Sort by month order
    months_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
    outlier_distribution['Month'] = pd.Categorical(outlier_distribution['Month'], categories=months_order, ordered=True)
    outlier_distribution = outlier_distribution.sort_values(['Year', 'Month'])
    
    # Plot the distribution
    fig = px.bar(outlier_distribution, x='Month', y='Count', color='Year', barmode='group', 
                 title='Outlier Distribution by Month and Year', 
                 labels={'Count': 'Number of Outliers', 'Month': 'Month', 'Year': 'Year'})
    fig.show()
    
    return outlier_distribution

# Example usage
df = identify_outliers(df, 'Sales')
outlier_distribution = outlier_distribution_by_month_year(df, 'Order Date', 'Is_Outlier')

#### Examine Relationships with Other Variables
- Correlations: Are outliers correlated with discounts, shipping delays, or specific products?

Categorical Breakdowns:
- Are certain regions or categories disproportionately associated with outliers?

Temporal Patterns:
- Are outliers more common on certain days of the week or months of the year?

Check the distribution of the outliers with respect to the discount
- Promotions: Identify if sales campaigns or discounts caused spikes in sales.

## Enjeux #3 - 