# Lab 4: Data Quality Assessment & Preprocessing
## ARTI308 - Machine Learning

### Assignment Tasks:
1. Identify data quality issues.
2. Apply missing value strategy (Median).
3. Handle outliers using IQR.
4. Apply Min-Max and Z-score normalization.
5. Apply PCA and interpret variance.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA

sns.set(style="whitegrid")

# 1. Load Dataset
df = pd.read_csv("Chocolate_Sales.csv")

# Task 1: Data Quality Fixes
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['Amount'] = df['Amount'].replace(r'[\$,]', '', regex=True).astype(float)

print("Data Types Corrected:")
print(df.dtypes)

### Task 2: Missing Value Strategy
We use **Median Imputation** because the data contains outliers, and the median is more robust than the mean.

In [None]:
df_missing = df.copy()
df_missing.loc[0:5, 'Amount'] = np.nan  # Artificial missing values

df_missing['Amount'].fillna(df_missing['Amount'].median(), inplace=True)
print("Missing values handled using Median.")

### Task 3: Outlier Handling (IQR)
Detecting and removing extreme values.

In [None]:
Q1 = df['Amount'].quantile(0.25)
Q3 = df['Amount'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df_no_outliers = df[(df['Amount'] >= lower) & (df['Amount'] <= upper)]

plt.figure(figsize=(6,4))
sns.boxplot(x=df['Amount'])
plt.title("Amount Outliers Boxplot")
plt.show()
print(f"Original shape: {df.shape}, Shape after outlier removal: {df_no_outliers.shape}")

### Task 4: Normalization & Standardization

In [None]:
# Min-Max
min_max = MinMaxScaler()
df_minmax = min_max.fit_transform(df[['Amount', 'Boxes Shipped']])

# Z-Score
std_scaler = StandardScaler()
df_std = std_scaler.fit_transform(df[['Amount', 'Boxes Shipped']])
print("Normalization and Standardization applied.")

### Task 5: PCA Application
Reducing dimensionality and interpreting variance.

In [None]:
pca = PCA(n_components=2)
pcs = pca.fit_transform(df_std)

print("Explained Variance Ratio:", pca.explained_variance_ratio_)

plt.figure(figsize=(6,4))
plt.scatter(pcs[:,0], pcs[:,1], alpha=0.5)
plt.title("PCA Projection")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()