# 🎓 Student Performance Analysis Project (with B.Sc CSDA Students)
This notebook analyzes student performance using a synthetic dataset of Indian students, including B.Sc CSDA students. We train regression models to predict final exam scores based on study habits, attendance, and previous performance.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)


## 📂 Load the Dataset

In [None]:
# Load the CSV dataset (already generated with B.Sc CSDA students)
df = pd.read_csv("student_performance_india_with_csda.csv")
print("Dataset shape:", df.shape)
df.head(10)

## 🔍 Data Summary & Info

In [None]:
df.info()
df.describe(include='all')

## 📊 Correlation Heatmap

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(numeric_only=True), annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Heatmap of Features")
plt.show()

## 📈 Feature Relationships

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15,4))
sns.scatterplot(x="Hours_Studied_per_week", y="Final_Score", data=df, ax=axes[0])
sns.scatterplot(x="Attendance_pct", y="Final_Score", data=df, ax=axes[1])
sns.scatterplot(x="Previous_Score", y="Final_Score", data=df, ax=axes[2])
axes[0].set_title("Hours vs Final Score")
axes[1].set_title("Attendance vs Final Score")
axes[2].set_title("Previous Score vs Final Score")
plt.tight_layout()
plt.show()

## 🧑‍💻 Train/Test Split

In [None]:
features = ["Hours_Studied_per_week","Attendance_pct","Previous_Score"]
X = df[features]
y = df["Final_Score"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Train size:", X_train.shape, "Test size:", X_test.shape)

## 📐 Linear Regression Model

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("Linear Regression:")
print("R2 score:", r2_score(y_test, y_pred_lr))
print("MAE:", mean_absolute_error(y_test, y_pred_lr))

## 🌳 Random Forest Model

In [None]:
rf = RandomForestRegressor(n_estimators=150, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest:")
print("R2 score:", r2_score(y_test, y_pred_rf))
print("MAE:", mean_absolute_error(y_test, y_pred_rf))

## ⭐ Feature Importance (Random Forest)

In [None]:
importances = pd.Series(rf.feature_importances_, index=features)
importances.sort_values().plot(kind="barh", figsize=(6,4))
plt.title("Feature Importance (Random Forest)")
plt.show()

## ✅ Conclusion
- Hours studied, attendance, and previous scores are strong predictors of final performance.
- **Linear Regression** performed very well with R² around 0.86.
- Random Forest gave a baseline with slightly lower accuracy but more flexibility.
- Dataset included **B.Sc CSDA students** along with other departments.
- Recommendation: Students should balance study hours and maintain high attendance for better academic performance.