# 01 â€” Data Cleaning and Preparation

## Purpose
This notebook performs the initial quality assessment and cleaning of the dataset to ensure it is ready for exploratory data analysis and modeling.

## Objectives
1. Read the weekly dataset and convert the `datum` column to datetime.  
2. Set the `datum` column as the index and sort chronologically.  
3. Inspect data types and check for missing or invalid values.  
4. Summarize descriptive statistics for each ATC-level variable.

## Expected Outcome
- All variables are numeric (`float64`).  
- No missing or negative values exist.  
- The dataset is confirmed to be clean and ready for exploratory analysis.


In [None]:
# 01_data_cleaning.ipynb
# Purpose: Inspect column types, missing values, and prepare cleaned dataframe for modeling

import pandas as pd

# Path to the dataset
path = "../data/raw/pharma_sales.csv"

# 1. Read the dataset
df = pd.read_csv(path)

# 2. Convert date column
df["datum"] = pd.to_datetime(df["datum"], errors="coerce")

# 3. Set date as index
df = df.set_index("datum").sort_index()

# 4. Check data types
print("Column data types:\n")
print(df.dtypes)

# 5. Check for missing values
print("\nMissing values per column:\n")
print(df.isna().sum())

# 6. Quick summary statistics
print("\nDescriptive statistics:\n")
print(df.describe().T)

Column data types:

M01AB    float64
M01AE    float64
N02BA    float64
N02BE    float64
N05B     float64
N05C     float64
R03      float64
R06      float64
dtype: object

Missing values per column:

M01AB    0
M01AE    0
N02BA    0
N02BE    0
N05B     0
N05C     0
R03      0
R06      0
dtype: int64

Descriptive statistics:

       count        mean        std     min       25%         50%       75%  \
M01AB  302.0   35.102441   8.617106   7.670   29.3875   34.565000   40.1750   
M01AE  302.0   27.167611   7.043491   6.237   22.3875   26.789500   31.0465   
N02BA  302.0   27.060295   8.086458   3.500   21.3000   26.500000   32.4750   
N02BE  302.0  208.627161  76.069221  86.250  149.3000  198.300000  252.4715   
N05B   302.0   61.740853  22.436970  18.000   47.0000   57.000000   71.0000   
N05C   302.0    4.138935   3.129265   0.000    2.0000    3.979167    6.0000   
R03    302.0   38.439811  22.900873   2.000   21.0000   35.000000   51.0000   
R06    302.0   20.224561  11.381464   1.00