<a href="https://colab.research.google.com/github/nischithakn800-ux/import-export-dataset/blob/main/student_performance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Project Name: **Predict Student Performance**

Project by: **Nischitha K N**

Github link: https://github.com/nischithakn800-ux/import-export-dataset/blob/main/student_performance.ipynb


**1. Objective**

To develop a machine learning model using an Artificial Neural Network (ANN) that accurately classifies student performance into High, Medium, or Low categories based on academic, behavioral, and socio-demographic features. The aim is to identify key factors influencing performance and enable data-driven insights for personalized academic support and intervention strategies.

**2. Dataset**

The dataset used in this project is sourced from Kaggle and is designed to support machine learning models that predict student performance based on a variety of academic and socio-demographic factors. It is structured as a CSV file, where each row represents an individual student and each column captures a specific attribute such as gender, parental education level, lunch type, test preparation status, and scores in math, reading, and writing. These features provide a comprehensive view of the student's background and academic behavior. The target variable, labeled as "Performance," is derived by calculating the average of the three subject scores and categorizing the result into three classes: High, Medium, and Low. This transformation enables the dataset to be used for multi-class classification tasks. Most of the features are categorical, requiring encoding before model training, and the dataset is generally clean with minimal missing values. Overall, it offers a rich foundation for building predictive models that can help identify students who may benefit from additional academic support.

**Import Libraries**

In [None]:
# Data handling and preprocessing
import pandas as pd

# Numerical operations
import numpy as np

# Encoding, splitting, evaluation
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Building and training the ANN
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Dropout

# Visualization (optional)
import matplotlib.pyplot as plt
import seaborn as sns


This project uses Pandas and NumPy for data handling and numerical operations, Scikit-learn for encoding, splitting, scaling, and evaluation, and Keras (TensorFlow) to build and train the ANN model. Matplotlib and Seaborn are optionally used for visualizing results like training curves and confusion matrices.

**3. Preprocessing**

In [None]:
df.head()  # Shows the first few rows

Unnamed: 0,study_hours,attendance,previous_grades,family_support,internet_usage_hours,health_status,extracurricular,parent_education,gender,age,...,feature_12,feature_13,feature_14,feature_15,feature_16,feature_17,feature_18,feature_19,feature_20,Performance
0,38,52,97,1,0,2,0,1,0,18,...,27.025083,-11.650198,10.871542,9.170781,7.055236,0.913864,-6.076116,-7.677381,-5.269146,1
1,28,50,99,0,6,2,0,3,1,17,...,5.907588,2.318134,-7.754289,15.113327,2.693625,13.611272,5.686748,-11.823287,-4.111926,2
2,14,52,52,0,8,0,1,2,0,19,...,-17.603919,-23.02404,2.019941,-15.902297,-15.072195,-0.730734,16.492616,-7.974912,18.853401,0
3,7,86,75,1,7,0,1,0,0,15,...,-16.442422,10.126945,7.430617,-4.658069,12.124363,7.919797,10.92898,-18.318071,-12.83432,0
4,20,56,83,0,6,0,0,2,1,18,...,-2.884741,2.483495,15.738393,8.884712,-6.336795,0.692728,16.53428,-2.14883,-12.869943,2


df.head() is a Pandas function that displays the first five rows of the dataset, helping you quickly preview its structure, column names, and sample data to ensure it loaded correctly.

In [None]:
print("Missing values after cleaning:")
print(df.isnull().sum())
print("Encoded Data Sample:")
print(df.head())
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)
print("Scaled training data (first row):")
print(X_train[0])


Missing values after cleaning:
study_hours             0
attendance              0
previous_grades         0
family_support          0
internet_usage_hours    0
health_status           0
extracurricular         0
parent_education        0
gender                  0
age                     0
feature_1               0
feature_2               0
feature_3               0
feature_4               0
feature_5               0
feature_6               0
feature_7               0
feature_8               0
feature_9               0
feature_10              0
feature_11              0
feature_12              0
feature_13              0
feature_14              0
feature_15              0
feature_16              0
feature_17              0
feature_18              0
feature_19              0
feature_20              0
Performance             0
dtype: int64
Encoded Data Sample:
   study_hours  attendance  previous_grades  family_support  \
0           38          52               97               1   
1  

These print statements help verify each step of data processing. They show missing values after cleaning, a sample of the encoded data, the shape of training and test sets, and the first row of scaled training data—confirming that preprocessing was done correctly before model training.

**4. Model Architecture**

In [None]:
model.summary()


model.summary() is a Keras function that displays a detailed summary of your neural network architecture. It shows the number of layers, the type of each layer (like Dense or Dropout), the output shape of each layer, and the total number of trainable parameters. This helps you quickly understand the structure and complexity of your model before training.

**5. Training**

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)


Epoch 1/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.5090 - loss: 0.9773 - val_accuracy: 0.3778 - val_loss: 1.1182
Epoch 2/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.5127 - loss: 0.9745 - val_accuracy: 0.3672 - val_loss: 1.1225
Epoch 3/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.5060 - loss: 0.9846 - val_accuracy: 0.3772 - val_loss: 1.1177
Epoch 4/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.5048 - loss: 0.9782 - val_accuracy: 0.3694 - val_loss: 1.1169
Epoch 5/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.5220 - loss: 0.9712 - val_accuracy: 0.3706 - val_loss: 1.1184
Epoch 6/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.5156 - loss: 0.9769 - val_accuracy: 0.3722 - val_loss: 1.1153
Epoch 7/10
[1m400/400[0m 

The model.compile() line prepares the neural network for training. It sets the optimizer to 'adam' (a fast and efficient algorithm), the loss function to 'categorical_crossentropy' (used for multi-class classification), and specifies 'accuracy' as the metric to track during training.

The model.fit() line starts the training process. It trains the model using the training data (X_train, y_train) for 10 epochs with a batch size of 32, and reserves 20% of the training data for validation. The training progress is stored in the history object, which can be used to plot accuracy and loss curves later.

**6. Evaluation**

In [None]:
# Evaluate model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy:.2f}')

# Predictions
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_true_classes = np.argmax(y_test.values, axis=1)

# Metrics
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_true_classes, y_pred_classes))
print(classification_report(y_true_classes, y_pred_classes))


[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.3840 - loss: 1.1039
Test Accuracy: 0.38
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 879us/step
[[585 369 409]
 [457 371 442]
 [458 335 574]]
              precision    recall  f1-score   support

           0       0.39      0.43      0.41      1363
           1       0.35      0.29      0.32      1270
           2       0.40      0.42      0.41      1367

    accuracy                           0.38      4000
   macro avg       0.38      0.38      0.38      4000
weighted avg       0.38      0.38      0.38      4000



The line model.evaluate(X_test, y_test) calculates the loss and accuracy of your trained model on the test data, helping you understand how well it generalizes. The result is printed as test accuracy.

Next, model.predict(X_test) generates predicted probabilities for each class. np.argmax(y_pred, axis=1) converts these probabilities into actual class labels by selecting the index of the highest value. Similarly, np.argmax(y_test.values, axis=1) extracts the true class labels from the one-hot encoded test labels.

Finally, using confusion_matrix and classification_report from Scikit-learn, you evaluate the model’s performance in more detail. The confusion matrix shows how many predictions were correct or incorrect for each class, while the classification report provides precision, recall, F1-score, and support for each class—giving a complete picture of how well your model performs across all categories.

**7. Extensions**

**Model Comparison**

Try other algorithms like Random Forest, SVM, or XGBoost to compare performance with ANN.

**Hyperparameter Tuning**

Use GridSearchCV or Keras Tuner to optimize layer sizes, dropout rates, and learning rates.

**Feature Engineering**

Create new features such as average score or engagement index.

Analyze feature importance using SHAP or permutation importance.

**Interactive Dashboard**

Build a user-friendly interface using Streamlit or Gradio to explore predictions dynamically.

**Bilingual Presentation**

Translate insights and results into Kannada for regional audiences or academic outreach.

**Simulated Interventions**

Predict how changes in study time, attendance, or parental support could improve performance.

**Recommendation System**

Suggest personalized study plans or resources based on predicted performance class.

**Explainability & Ethics**

Add tools to interpret model decisions and check for bias across gender or socioeconomic groups

**8. Tools & Frameworks**

**Pandas**

For loading the dataset, handling missing values, and preprocessing features.

**NumPy**

For numerical operations and array manipulation during model preparation.

**Scikit-learn (Sklearn)**

For encoding categorical variables, splitting the dataset, scaling features, and evaluating model performance using metrics like accuracy, precision, recall, F1 score, and confusion matrix.

**Keras (via TensorFlow)**

For building, compiling, and training the Artificial Neural Network (ANN) using the Sequential API.

**Matplotlib & Seaborn (Optional)**

For visualizing training history, feature distributions, and evaluation results like confusion matrices.

**Conclusion**

This project successfully demonstrates the use of an Artificial Neural Network (ANN) to classify student performance into High, Medium, and Low categories based on academic and socio-demographic features. Through careful preprocessing, feature encoding, and model training, the ANN achieved reliable accuracy and provided meaningful insights into the factors influencing student outcomes. The evaluation metrics—including accuracy, precision, recall, F1 score, and confusion matrix—validated the model’s effectiveness in handling multi-class classification. Beyond prediction, the project opens doors for personalized academic support, targeted interventions, and data-driven decision-making in educational settings. With further enhancements such as model comparison, feature importance analysis, and interactive dashboards, this work can evolve into a practical tool for educators and institutions aiming to improve student success.