<a href="https://colab.research.google.com/github/nithuuu13/ANN-StudentDropout/blob/main/01_Data_Conditioning_StudentDropout.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Conditioning: Acquisition, Exploration, Cleaning, Encoding & Splitting

This section performs the Data Conditioning phase of the Machine Learning Pipeline.

Neural Networks require:


*   clean and consistent input patterns
*   numerical inputs
*   scaled features for stable gradient descent
*   training/validation/testing separation

This section prepares the UCI Predict Students Dropout and Academic Success dataset so it can be used as valid input to a Multilayer Perceptron (MLP) trained via backpropagation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pickle

# Data Collection

The dataset is obtained from the UCI Machine Learning Repository, fulfilling the “real-world data” criterion. This is also a supervised multi-class classification problem.

It contains demographic, academic, and socioeconomic features used to predict whether a student will:

*   Dropout
*   Enrol
*   Graduate



In [None]:
url = "https://archive.ics.uci.edu/static/public/697/data.csv"
df = pd.read_csv(url)

df.head()


# Exploratory Data Analysis (EDA)

Networks learn from input patters, so before data can be used, it is important to:


*   understand feature types
*   inspect distributions
*   check for noise and inconsistencies
*   analyse target class balance
*   analyse relationships between features


This ensures the MLP receives well-understood, meaningful inputs.



In [None]:
df.info()

In [None]:
df.describe(include='all')

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x=df['Target'])
plt.title('Distribution of Student Outcomes')
plt.show()

df['Target'].value_counts(normalize=True)

# Feature Correlation Analysis

Although the ANN does not require feature selection, correlation analysis provides insight into which numeric inputs may influence the output. This strengthens the EDA component.

In [None]:
numeric_cols = df.select_dtypes(include=[np.number]).columns

# encoding target temporarily to compute correlation
target_encoded = df['Target'].astype('category').cat.codes

correlation = df[numeric_cols].corrwith(target_encoded)
correlation.sort_values(ascending=False)


# Data Cleaning

Neural networks learn best when the data they are given is clean and consistent. Problems like missing values, repeated entries, or categories that are not labelled the same way can confuse the model and slow down learning. When the data is well-prepared, the training process tends to be smoother and more reliable.

To keep things in good shape, we make sure to:
*   Remove missing values
*   Get rid of duplicate rows
*   Standardise column names
*   Check that categories are used consistently









In [None]:
# checking for missing values
df.isnull().sum()


In [None]:
df = df.dropna()
df = df.drop_duplicates()
df.columns = df.columns.str.strip()

df.shape

# Data Encoding

The MLP computes weighted sums of inputs:

∑ᵢ (wᵢ · xᵢ)

Therefore, categorical inputs cannot be used directly because the perceptron cannot multiply a weight by a string. Due to the importance of numerical input patterns to consider, one-hot encoding is used which:
*   converts categories → binary vectors
*   avoids implying any ordinal relationship
*   produces clean numerical inputs for ANN training


In [None]:
df_encoded = pd.get_dummies(df, drop_first=True)
df_encoded.head()

# Data Splitting
A typical dataset split ensures that the model has enough data to learn while still providing separate sets to monitor performance.
The training set should be the largest, the validation set is used to track overfitting during training, and the test set provides an unbiased measure of how well the model generalises.

In this case, we use:
*   70% training
*   15% validation
*   15% testing

This keeps the validation and test sets in similar proportions, which helps maintain a balanced evaluation.

In [None]:
X = df_encoded.drop(columns=['Target_Dropout', 'Target_Enrolled', 'Target_Graduate'], errors='ignore')
y = df['Target']

# first split: training (70%) + temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# second split: validation (15%) + testing (15%)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp
)

X_train.shape, X_val.shape, X_test.shape


# Feature Scaling (Standardization)

This step is important for getting stable, reliable performance from an ANN.
Because neural networks rely on gradient descent, the scale of the input features matters: very large values can cause unstable updates, while very small values can slow learning down.

Standardizing the data helps by ensuring:
* mean = 0
* standard deviation = 1
* faster convergence
* more stable backpropagation

The scaler is fitted only on the training data to prevent data leakage.

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# saving scaler for chetan rao sonoo ***********
pickle.dump(scaler, open("scaler.pkl", "wb"))


# Saving Processed Files

These files will be used by the model development stage handled by the other teammate.

We save:
* Cleaned + encoded dataset
* Scaler object (for consistent prediction inputs)

In [None]:
df_encoded.to_csv("cleaned_dataset.csv", index=False)