
# 📊 Data Analysis for Student Performance Prediction

## 🔍 Overview
This Jupyter Notebook performs an **exploratory data analysis (EDA)** on the *Students Performance Dataset*. 
It is designed to provide insights into student performance based on various factors, such as study habits, parental involvement, and demographic attributes.

This analysis will help in building models for:
1. **User Story 1: Supervised Learning Model** – Predicting a student’s likelihood of passing based on study habits and past scores.
2. **User Story 2: Clustering Model** – Grouping students based on learning styles for personalized teaching strategies.

---



## 📦 Step 1: Importing Required Libraries
We begin by importing essential Python libraries:
- **pandas** and **numpy** for data manipulation.
- **matplotlib** and **seaborn** for visualization.


In [None]:

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



## 📂 Step 2: Loading the Dataset
The dataset is read into a pandas DataFrame for analysis.


In [None]:

# Load the dataset
file_path = "Student_performance_data _.csv"  # Ensure this file is in the same directory
df = pd.read_csv(file_path)

# Display first few rows
print("First 5 rows of the dataset:")
display(df.head())



## 📝 Step 3: Understanding the Data
Let's explore the dataset structure, including:
- Column names and data types.
- Presence of missing values.
- Basic statistical summary.


In [None]:

# Basic information about the dataset
print("\nDataset Info:")
df.info()

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Summary statistics
print("\nSummary Statistics:")
display(df.describe())



## 🎭 Step 4: Exploring Categorical Features
Examining the unique values in categorical columns to understand their distribution.


In [None]:

# Check unique values in categorical columns
print("\nUnique values in categorical columns:")
for col in df.select_dtypes(include=['object']).columns:
    print(f"{col}: {df[col].unique()}")



## 📊 Step 5: Visualizing Numerical Features
Understanding the distribution of numerical variables helps in feature selection and preprocessing.


In [None]:

# Visualize the distribution of numerical features
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
df[numerical_cols].hist(figsize=(12, 8), bins=20, edgecolor='black')
plt.suptitle("Distribution of Numerical Features", fontsize=14)
plt.show()



## 🔗 Step 6: Correlation Analysis
A heatmap of the correlation matrix helps identify relationships between numerical features.


In [None]:

# Check correlation matrix
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix")
plt.show()



## 📊 Step 7: Visualizing Categorical Features
We analyze categorical variables using **count plots** to understand student distribution across different categories.


In [None]:

# Countplot for categorical variables
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    plt.figure(figsize=(8, 4))
    sns.countplot(y=df[col], palette="viridis")
    plt.title(f"Distribution of {col}")
    plt.show()



## 🔄 Step 8: Encoding Categorical Variables for Model Training
Machine learning models require numerical input. We convert categorical variables into **dummy variables** using one-hot encoding.


In [None]:

# Encoding categorical variables for model training (if needed)
df_encoded = pd.get_dummies(df, drop_first=True)

# Display encoded dataset
print("\nEncoded Dataset Sample:")
display(df_encoded.head())


### **📌 Next Steps**

Once we analyze the dataset, the next step is:

-   **Feature Selection:** Identify relevant features for prediction (study habits, past scores, tutoring, etc.).
-   **Define Target Variable:** Convert grades into a **binary pass/fail** label for classification.
-   **Build ML Models:** Start with **Logistic Regression** for prediction and **K-Means Clustering** for student grouping.



