This notebook is associated with Assignment 1. Use it to complete the assignment by following the instructions provided in each section, which includes a text cell describing the requirements. For additional details, see the Canvas.

Use this first cell to import the necessary libraries.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from scipy.stats import t
import numpy as np

# 1. **Data Management**


In this part, you need to:

1.   analyse and prepare the data. Use plots, graphs, and tables (such as histogram, box plots, scatterplots etc.) to better analyse the dataset and identify issues or potential improvements in the data, including (but not limited to) unnecessary feature/variable which can be dropped/removed, standardization, encoding, etc;
2.   split the data and define your experimental protocol (such as cross-validation or k-fold).

In [None]:
# Load Data
def load_data(filepath):
    print("Loading data...")
    df = pd.read_csv(filepath)
    print(f"Data loaded with {df.shape[0]} rows and {df.shape[1]} columns.\n")
    return df

# Check for Missing Values
def check_missing(df):
    print("Checking for missing values...")
    missing = df.isnull().sum()
    if missing.sum() > 0:
        print(missing[missing > 0])
    else:
        print("No missing values found.")
    print("")

# Visualize Outliers
def show_outliers(df, ignore_cols):
    print("Visualizing outliers...")
    plt.figure(figsize=(12,6))
    sns.boxplot(data=grades_data, color='#00ff99')
    plt.xticks(rotation=90)
    plt.title('Box Plots to Visualise outliers')
    plt.show()
    print("")

# Remove Outliers
def clean_outliers(df, ignore_cols):
    print("Removing outliers...")
    original_shape = df.shape[0]
    for col in df.select_dtypes(include=['number']).columns:
        if col not in ignore_cols:
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower = Q1 - 1.5 * IQR
            upper = Q3 + 1.5 * IQR
            df = df[(df[col] >= lower) & (df[col] <= upper)]
    print(f"Outliers removed, new dataset size: {df.shape[0]} rows (removed {original_shape - df.shape[0]} rows).\n")
    return df.reset_index(drop=True)

# Encode Categories
def encode_categories(df, categorical_cols):
    print("Encoding categorical variables...")
    encoder = LabelEncoder()
    for col in categorical_cols:
        df[col] = encoder.fit_transform(df[col])
    print("")
    return df

# Create Pass/Fail Target
def make_pass_fail(df):
    df['Pass'] = (df['Grade'] >= 12).astype(int)
    return df.drop(columns=['Grade'])

# Correlation Heatmap
def show_correlation(df, target_col):
    print("Visualizing correlation heatmap...")
    corr_matrix = df.corr()
    plt.figure(figsize=(20, 16))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
    plt.title("Correlation Heatmap")
    plt.show()
    print("")

# Remove Weak Features
def remove_weak_features(df, target_col, alpha=0.01):
    print("Removing weakly correlated features...")
    corr_series = df.corr()[target_col].drop(target_col)
    n = len(df)
    critical_value = t.ppf(1 - alpha / 2, n - 2) / np.sqrt(n - 2 + t.ppf(1 - alpha / 2, n - 2) ** 2)
    threshold = round(critical_value, 3)
    relevant_cols = [target_col] + list(corr_series[abs(corr_series) > threshold].index)
    excluded_cols = list(corr_series[abs(corr_series) <= threshold].index)
    print("Excluded Columns due to Weak Correlation:", excluded_cols)
    print(f"Remaining columns: {len(relevant_cols)}\n")
    return df[relevant_cols]

# Main Program
grades_data = load_data('grades.csv')
categorical_cols = ['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic']
ignore_outlier_cols = ['failures', 'Grade']
check_missing(grades_data)
show_outliers(grades_data, ignore_outlier_cols)
grades_data = clean_outliers(grades_data, ignore_outlier_cols)
grades_data = encode_categories(grades_data, categorical_cols)
grades_data = make_pass_fail(grades_data)
show_correlation(grades_data, 'Pass')
grades_data = remove_weak_features(grades_data, 'Pass')


---

# 2. **Model Training**

Here, you need to:

1.	select and compare at least three machine learning models (seen/discussed during the lectures) appropriate for your modelling;
2.	if there are hyperparameters in a selected algorithm, define a hyperparameter search protocol (you can define your own), and tune them.


In [None]:
# Write your proposed solution code here. Create more code cells if you find it necessary





---

# 3. **Evaluate models**

Here, you need to:

1.	test the model (the best one you obtained from the above stage) on the testing dataset.


In [None]:
# Write your proposed solution code here. Create more code cells if you find it necessary



