# Case 2: Siemens AI-Driven Sales Forecasting

## Overview
This case study involves building a monthly sales forecasting model using real sales data from Siemens’ Smart Infrastructure Division in Germany. The objective is to apply machine learning techniques to predict future sales based on historical data and macro-economic indicators.

## Business Problem
- Manual sales forecasting is time-consuming and relies on human judgment.
- Data is scattered across multiple sources, making it difficult to derive insights.
- Inaccurate forecasts lead to financial losses, such as inefficient inventory management and unsatisfied customers.

## Objective
- Develop an AI-driven predictive model to automate the forecasting process.
- Evaluate the model using Root Mean Squared Error (RMSE).
- Submit predictions for May 2022 - February 2023 in a structured CSV format. 

## This notebook was developed by:

- João Venichand - 20211644
- Gonçalo Custódio - 20211643
- Diogo Correia - 20211586
- Duarte Emanuel - 20240564


# 1. Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# 2. Load Datasets

In [None]:
sales_data = pd.read_csv("sales_data.csv")
market_data = pd.read_excel("market_data.xlsx", skiprows=2)
test_set = pd.read_csv("test_set_template.csv")

In [None]:
print(sales_data.head())
print(sales_data.info())
print(sales_data.describe())

In [None]:
print(market_data.head())
print(market_data.info())
print(market_data.describe())

In [None]:
print(test_set.head())
print(test_set.info())
print(test_set.describe())

# 3. Data Exploration & Quality Check

Data Types

In [None]:
print(sales_data.dtypes)

Check Missing Values

In [None]:
print(sales_data.isnull().sum())

In [None]:
print(market_data.isnull().sum())

In [None]:
print(test_set.isnull().sum())

Duplicated Values Check

In [None]:
print("Duplicate Rows in Sales Data:", sales_data.duplicated().sum())
print("Duplicate Rows in Market Data:", market_data.duplicated().sum())
print("Duplicate Rows in Test Set:", test_set.duplicated().sum())

Outliers Check

In [None]:
num_cols = sales_data.select_dtypes(include=['number']).columns

plt.figure(figsize=(12, 6))
for i, col in enumerate(num_cols):
    plt.subplot(2, 3, i+1)
    sns.boxplot(y=sales_data[col])
    plt.title(f"Boxplot of {col}")
    plt.tight_layout()
plt.show()

In [None]:
z_scores = np.abs(zscore(sales_data.select_dtypes(include=['number'])))
threshold = 3
outliers = (z_scores > threshold).sum()

print("Number of Outliers per Column:")
print(outliers)

# 4. Data Cleaning and Preprocessing

Fill missing values in test_set

In [None]:
test_set['Sales_EUR'].fillna(0, inplace=True)

Forward-fill missing values in market_data

In [None]:
market_data.fillna(method='ffill', inplace=True)

Apply Log Transformation to the Outliers

In [None]:
sales_data['Sales_EUR'] = np.log1p(sales_data['Sales_EUR'])

Convert DATE column to datetime format

In [None]:
sales_data['DATE'] = pd.to_datetime(sales_data['DATE'], format="%d.%m.%Y", errors='coerce')

Extract year and month

In [None]:
sales_data['YearMonth'] = sales_data['DATE'].dt.to_period('M')

Aggregate sales to monthly level

In [None]:
monthly_sales = sales_data.groupby(['YearMonth', 'Mapped_GCK'])['Sales_EUR'].sum().reset_index()

Convert YearMonth back to datetime for merging

In [None]:
monthly_sales['YearMonth'] = monthly_sales['YearMonth'].astype(str) + "-01"
monthly_sales['YearMonth'] = pd.to_datetime(monthly_sales['YearMonth'])

print(monthly_sales.head())

### 4.1 Merge with Market Data

Convert YearMonth column in market_data to datetime

In [None]:
market_data['YearMonth'] = pd.to_datetime(market_data['YearMonth'])

Merge datasets on YearMonth

In [None]:
df = pd.merge(monthly_sales, market_data, on='YearMonth', how='left')
print(df.head())

# 5. Feature Engineering

Sort data by Mapped_GCK and YearMonth

In [None]:
df = df.sort_values(by=['Mapped_GCK', 'YearMonth'])

Create lagged sales features

In [None]:
df['Sales_Lag_1M'] = df.groupby('Mapped_GCK')['Sales_EUR'].shift(1)
df['Sales_Lag_2M'] = df.groupby('Mapped_GCK')['Sales_EUR'].shift(2)

Create rolling average features

In [None]:
df['Sales_MA_3M'] = df.groupby('Mapped_GCK')['Sales_EUR'].rolling(window=3, min_periods=1).mean().reset_index(level=0, drop=True)

Drop rows with NaN values due to shifting

In [None]:
df.dropna(inplace=True)

print(df.head())