# 📊 Student Performance Analysis & Prediction

This project analyzes student exam performance using data science techniques. It includes exploratory data analysis, visualization, classification (high/low performers), and regression (score prediction).

## 📥 Step 1: Load the Data

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('StudentsPerformance.csv')
df.head()

## 🧹 Step 2: Data Cleaning & Basic EDA

In [None]:
df.info()
df.describe()
df.isnull().sum()

## 📊 Step 3: Data Visualization

In [None]:

df['average score'] = df[['math score', 'reading score', 'writing score']].mean(axis=1)

sns.barplot(x='gender', y='average score', data=df, hue='gender', palette='pastel', legend=False)
plt.title('Average Score by Gender')
plt.show()

sns.barplot(x='race/ethnicity', y='average score', data=df, hue='race/ethnicity', palette='muted', legend=False)
plt.title('Average Score by Race/Ethnicity')
plt.show()

sns.boxplot(x='parental level of education', y='average score', data=df)
plt.title('Score Distribution by Parental Education')
plt.xticks(rotation=45)
plt.show()


## 🛠️ Step 4: Feature Engineering

In [None]:

df['performance level'] = df['average score'].apply(lambda x: 'High' if x >= 70 else 'Low')

from sklearn.preprocessing import LabelEncoder
df_encoded = df.copy()
categorical_cols = ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']
le = LabelEncoder()
for col in categorical_cols:
    df_encoded[col] = le.fit_transform(df_encoded[col])

df_encoded.head()


## 🤖 Step 5: Classification (Decision Tree and Random Forest)

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X = df_encoded[categorical_cols + ['math score', 'reading score', 'writing score']]
y = df_encoded['performance level']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tree = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)
forest = RandomForestClassifier(random_state=42).fit(X_train, y_train)

print("Decision Tree Accuracy:", accuracy_score(y_test, tree.predict(X_test)))
print("Random Forest Accuracy:", accuracy_score(y_test, forest.predict(X_test)))


## 📈 Step 6: Regression - Predicting Average Score

In [None]:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

y_reg = df_encoded['average score']
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X, y_reg, test_size=0.2, random_state=42)
reg = LinearRegression().fit(X_train_r, y_train_r)
y_pred_r = reg.predict(X_test_r)

print("Mean Squared Error:", mean_squared_error(y_test_r, y_pred_r))
print("R2 Score:", r2_score(y_test_r, y_pred_r))



## ✅ Step 7: Project Summary

- **Goal**: Analyze and predict student performance.
- **Models Used**:
  - Decision Tree Classifier → Accuracy: 0.97
  - Random Forest Classifier → Accuracy: 0.98
  - Linear Regression → R² Score: 1.00, MSE: 0.00
- **Insights**:
  - Test prep and math score are strong performance indicators.
  - Random Forest gave better classification accuracy than Decision Tree.
  - Linear Regression accurately predicted scores.

🎉 Project Complete!
