<a href="https://colab.research.google.com/github/itinasharma/MachineLearning/blob/main/StudentPerformanceIndex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Student Performance Prediction Overview

In this notebook, we developed a predictive model to estimate student performance based on several key factors:

1. **Data Loading**: We loaded a dataset containing information about students' study habits, previous scores, and extracurricular activities.

2. **Data Preprocessing**: Categorical variables were encoded into numerical format to prepare the data for analysis.

3. **Feature Selection**: We identified the relevant features (e.g., hours studied, previous scores) and the target variable (performance index).

4. **Model Training**: A linear regression model was trained using the prepared dataset.

5. **Making Predictions**: We used the trained model to predict the performance index for new student data.

6. **Model Evaluation**: The model's performance was assessed using Mean Squared Error (MSE) to understand its accuracy.

This workflow illustrates how various academic factors can influence student performance, enabling targeted interventions for improvement.


In [3]:
# 1. Import Necessary Libraries
import pandas as pd


# 2. Load the Data
file_path = '/content/sample_data/Student_Performance.csv'

# Define the column names
columns = [
    'Hours_Studied',                # Total hours spent studying (numerical)
    'Previous_Scores',              # Scores from previous tests (numerical)
    'Extracurricular_Activities',   # Participation in extracurricular activities (categorical)
    'Sleep_Hours',                  # Average sleep hours per day (numerical)
    'Sample_Question_Papers_Practiced',  # Number of question papers practiced (numerical)
    'Performance_Index'             # Target variable (numerical)
]

df = pd.read_csv(file_path)

# If the file doesn't have headers, you can manually assign the column names:
# df = pd.read_csv(file_path, header=None, names=columns)

# Display the first few rows of the DataFrame
print(df.head())


   Hours Studied  Previous Scores Extracurricular Activities  Sleep Hours  \
0              7               99                        Yes            9   
1              4               82                         No            4   
2              8               51                        Yes            7   
3              5               52                        Yes            5   
4              7               75                         No            8   

   Sample Question Papers Practiced  Performance Index  
0                                 1               91.0  
1                                 2               65.0  
2                                 2               45.0  
3                                 2               36.0  
4                                 5               66.0  


In [10]:
from sklearn.preprocessing import StandardScaler
numerical_columns = ['Hours Studied', 'Previous Scores','Extracurricular Activities', 'Sleep Hours',
                     'Sample Question Papers Practiced', 'Performance Index']
normalized_data = scaler.fit_transform(df[numerical_columns])

# Create a new DataFrame with normalized data
normalized_df = pd.DataFrame(normalized_data, columns=numerical_columns)

# Add the categorical column back to the normalized DataFrame
normalized_df['Extracurricular_Activities'] = df['Extracurricular Activities']

# Display the normalized DataFrame
print("\nNormalized Data:")
print(normalized_df.head())


Normalized Data:
   Hours Studied  Previous Scores  Extracurricular Activities  Sleep Hours  \
0       0.775188         1.704176                    1.010455     1.456205   
1      -0.383481         0.723913                   -0.989654    -1.492294   
2       1.161410        -1.063626                    1.010455     0.276805   
3       0.002742        -1.005963                    1.010455    -0.902594   
4       0.775188         0.320275                   -0.989654     0.866505   

   Sample Question Papers Practiced  Performance Index  \
0                         -1.249754           1.862167   
1                         -0.900982           0.508818   
2                         -0.900982          -0.532220   
3                         -0.900982          -1.000687   
4                          0.145333           0.560870   

   Extracurricular_Activities  
0                           1  
1                           0  
2                           1  
3                           1  
4   

In [12]:
# Define features (X) and target (Y)
X = df[['Hours Studied', 'Previous Scores', 'Extracurricular Activities',
        'Sleep Hours', 'Sample Question Papers Practiced','Performance Index']]  # Features

# Y should not be present in the dataset initially since it's being predicted.
# For training purposes, Y would be the actual performance index for a dataset where it's available.
# Y = df['Performance Index']  # Target


In [15]:
X = df.drop('Performance Index', axis=1)  # Features
Y = df['Performance Index']


In [16]:
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np

model = LinearRegression()
model.fit(X, Y)
print ("c ", model.intercept_)
print ("m ",model.coef_)

c  1.4370006738287273
m  [2.85298205 1.01843419 0.61289758 0.48055975 0.19380214]


In [17]:
y_pred = model.predict([[0.775188, 1.704176, 1.010455, 1.456205, -1.249754]])
print(y_pred)

[6.46108318]


