<a href="https://colab.research.google.com/github/lujainAziz/LujainAlmajyul-it326/blob/main/Phase2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IT326 – Phase 2: Data Summarization & Preprocessing

This notebook follows the Phase 2 requirements:
- Data Analysis (summaries, distributions, missing values, class distribution, outliers & boxplots, at least 3 plots)
- Data Preprocessing (apply at least three techniques other than removal/splitting: normalization, discretization, noise removal)

**Dataset path:** `/mnt/data/Dataset/Raw_dataset.csv`
**Preprocessed output:** `/mnt/data/Dataset/Preprocessed_dataset.csv`

## Contents
1. [Setup](#setup)
2. [Data Overview](#overview)
3. [Data Analysis](#analysis)
   - 3.1 Missing Values
   - 3.2 Statistical Summary (Five-number summary)
   - 3.3 Distributions (Histograms)
   - 3.4 Boxplots & Outliers
   - 3.5 Class Label Distribution
   - 3.6 Example Scatter Plot
4. [Data Preprocessing](#preprocessing)
   - 4.1 Outlier Treatment (IQR Capping)
   - 4.2 Normalization (Min-Max)
   - 4.3 Discretization (GPA → Low/Medium/High)
   - 4.4 (Optional) Feature Selection note
   - 4.5 Save Preprocessed Dataset
5. [Before vs After Snapshot](#snapshot)
6. [Notes & Justification](#notes)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

RAW_PATH = r"/mnt/data/Dataset/Raw_dataset.csv"
PREPROC_PATH = r"/mnt/data/Dataset/Preprocessed_dataset.csv"

df = pd.read_csv(RAW_PATH)
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data/Dataset/Raw_dataset.csv'

## 2. Data Overview  <a id='overview'></a>
- Print info, dtypes, shape, and a small sample.
- Confirm the class attribute and its value counts.

In [None]:
print('\nDataset Info:')
df.info()
print('\nDtypes:')
print(df.dtypes)
print('\nShape:', df.shape)
display(df.head())

label_col = 'GradeClass'
print('\nClass counts:')
print(df[label_col].value_counts().sort_index())

## 3. Data Analysis  <a id='analysis'></a>

In [None]:
# 3.1 Missing Values
missing_counts = df.isna().sum().sort_values(ascending=False)
display(missing_counts.to_frame('Missing'))
print('Any missing? ', df.isna().any().any())

In [None]:
# 3.2 Statistical Summary
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
desc = df[numeric_cols].describe(percentiles=[0.25,0.5,0.75]).T
display(desc[['min','25%','50%','75%','max','mean','std']])

In [None]:
# 3.3 Distributions (Histograms) – choose a few key numeric columns
cols_to_plot = [c for c in ['GPA','Absences','StudyTimeWeekly'] if c in df.columns]
for c in cols_to_plot:
    plt.figure()
    df[c].plot(kind='hist', bins=30, title=f'Histogram of {c}')
    plt.xlabel(c)
    plt.ylabel('Frequency')
    plt.show()

In [None]:
# 3.4 Boxplots & Outliers (IQR logic preview)
box_cols = [c for c in ['GPA','Absences','StudyTimeWeekly'] if c in df.columns]
for c in box_cols:
    plt.figure()
    df[c].plot(kind='box', title=f'Boxplot of {c}')
    plt.ylabel(c)
    plt.show()

def iqr_bounds(s, k=1.5):
    q1, q3 = s.quantile(0.25), s.quantile(0.75)
    iqr = q3 - q1
    return q1 - k*iqr, q3 + k*iqr

for c in box_cols:
    lo, hi = iqr_bounds(df[c].astype(float))
    outliers = ((df[c] < lo) | (df[c] > hi)).sum()
    print(f'{c}: lower={lo:.3f}, upper={hi:.3f}, outliers={outliers}')

In [None]:
# 3.5 Class Label Distribution (Bar Plot)
counts = df['GradeClass'].value_counts().sort_index()
plt.figure()
counts.plot(kind='bar', title='Class Label Distribution (GradeClass)')
plt.xlabel('GradeClass')
plt.ylabel('Count')
plt.show()


In [None]:
# 3.6 Example Scatter Plot: GPA vs StudyTimeWeekly
if set(['GPA','StudyTimeWeekly']).issubset(df.columns):
    plt.figure()
    plt.scatter(df['StudyTimeWeekly'], df['GPA'])
    plt.xlabel('StudyTimeWeekly')
    plt.ylabel('GPA')
    plt.title('Scatter: GPA vs StudyTimeWeekly')
    plt.show()

## 4. Data Preprocessing  <a id='preprocessing'></a>
We apply at least three techniques: **Outlier Treatment (IQR capping)**, **Normalization (Min-Max)**, and **Discretization (GPA → bins)**.

In [None]:
# 4.1 Outlier Treatment (IQR Capping)
df_prep = df.copy()
cont_cols = [c for c in ['Age','StudyTimeWeekly','Absences','GPA'] if c in df_prep.columns]
def iqr_cap(series, k=1.5):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - k * iqr
    upper = q3 + k * iqr
    return series.clip(lower=lower, upper=upper)
for c in cont_cols:
    df_prep[f'{c}_capped'] = iqr_cap(df_prep[c].astype(float))
df_prep.head()

In [None]:
# 4.2 Normalization (Min-Max) on capped columns
def minmax_scale(s):
    mn, mx = s.min(), s.max()
    if mx == mn:
        return pd.Series(np.zeros_like(s), index=s.index)
    return (s - mn) / (mx - mn)
for c in [f'{x}_capped' for x in cont_cols]:
    df_prep[f'{c.replace("_capped", "_minmax")}'] = minmax_scale(df_prep[c].astype(float))
df_prep[[c for c in df_prep.columns if c.endswith('_minmax')]].head()

In [None]:
# 4.3 Discretization (GPA → Low/Medium/High) using quantiles
if 'GPA' in df_prep.columns:
    q = df_prep['GPA'].quantile([0.33, 0.66]).values
    bins = [-np.inf, q[0], q[1], np.inf]
    labels = ['Low','Medium','High']
    df_prep['GPA_Bin'] = pd.cut(df_prep['GPA'], bins=bins, labels=labels)
df_prep['GPA_Bin'].value_counts(dropna=False) if 'GPA_Bin' in df_prep.columns else 'GPA not found'

### 4.4 (Optional) Feature Selection note
- For **Decision Trees** (Phase 3), numerical scaling is not required but harmless.
- We do **not** drop features here to keep Phase 3 flexible.
- We will exclude obvious identifiers (e.g., `StudentID`) during model training.

In [None]:
# 4.5 Save Preprocessed Dataset
import os
os.makedirs(os.path.dirname(PREPROC_PATH), exist_ok=True)
df_prep.to_csv(PREPROC_PATH, index=False)
print('Saved:', PREPROC_PATH)
df_prep.head()

## 5. Before vs After Snapshot  <a id='snapshot'></a>
- The table below contrasts a few rows of raw vs preprocessed columns to show the effect of capping and scaling.

In [None]:
cols_show = []
for base in ['GPA','Absences','StudyTimeWeekly']:
    for suffix in ['', '_capped', '_minmax']:
        col = base + suffix
        if col in df_prep.columns:
            cols_show.append(col)
display(pd.concat([df[cols_show], df_prep[cols_show]], axis=1).head())

## 6. Notes & Justification  <a id='notes'></a>
- **Missing Values:** The dataset contains no missing values in the provided snapshot. We still confirm programmatically.
- **Outliers:** IQR-based capping (winsorization) reduces the influence of extreme values in `Absences` and `StudyTimeWeekly`.
- **Normalization:** Min-Max scaling is applied on capped continuous columns to bring them to [0,1], helping algorithms sensitive to scale.
- **Discretization:** `GPA` is discretized into **Low/Medium/High** using quantiles to support interpretable analysis and optional modeling variants.
- **No removal of raw columns:** We keep raw features intact to comply with the requirement of preserving the original dataset and to allow Phase 3 flexibility.
