## Load data

### Subtask:
Load the dataset from the provided CSV file into a pandas DataFrame.


**Reasoning**:
Import pandas, load the data, and display the head and info of the dataframe to understand its structure and content.



In [1]:
import pandas as pd

df = pd.read_csv("Performance.csv")
display(df.head())
display(df.info())

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Hours Studied                     10000 non-null  int64  
 1   Previous Scores                   10000 non-null  int64  
 2   Extracurricular Activities        10000 non-null  object 
 3   Sleep Hours                       10000 non-null  int64  
 4   Sample Question Papers Practiced  10000 non-null  int64  
 5   Performance Index                 10000 non-null  float64
dtypes: float64(1), int64(4), object(1)
memory usage: 468.9+ KB


None

## Explore and preprocess data

### Subtask:
Explore the dataset to understand its structure, identify missing values, and perform necessary preprocessing steps such as encoding categorical features and splitting the data into training and testing sets.


**Reasoning**:
Examine descriptive statistics, check unique values and counts for the categorical column, encode the categorical feature, separate features and target, and split the data into training and testing sets.



In [2]:
display(df.describe())
display(df['Extracurricular Activities'].value_counts())
df_encoded = pd.get_dummies(df, columns=['Extracurricular Activities'], drop_first=True)
X = df_encoded.drop('Performance Index', axis=1)
y = df_encoded['Performance Index']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

display(X_train.head())
display(X_test.head())
display(y_train.head())
display(y_test.head())

Unnamed: 0,Hours Studied,Previous Scores,Sleep Hours,Sample Question Papers Practiced,Performance Index
count,10000.0,10000.0,10000.0,10000.0,10000.0
mean,4.9929,69.4457,6.5306,4.5833,55.2248
std,2.589309,17.343152,1.695863,2.867348,19.212558
min,1.0,40.0,4.0,0.0,10.0
25%,3.0,54.0,5.0,2.0,40.0
50%,5.0,69.0,7.0,5.0,55.0
75%,7.0,85.0,8.0,7.0,71.0
max,9.0,99.0,9.0,9.0,100.0


Unnamed: 0_level_0,count
Extracurricular Activities,Unnamed: 1_level_1
No,5052
Yes,4948


Unnamed: 0,Hours Studied,Previous Scores,Sleep Hours,Sample Question Papers Practiced,Extracurricular Activities_Yes
9254,5,49,7,5,False
1561,2,48,7,6,True
1670,2,81,7,2,False
6087,2,46,6,1,False
6669,8,47,9,0,False


Unnamed: 0,Hours Studied,Previous Scores,Sleep Hours,Sample Question Papers Practiced,Extracurricular Activities_Yes
6252,5,69,8,2,False
4684,2,46,4,8,True
1731,7,56,7,5,True
4742,6,42,8,5,True
4521,7,53,4,6,False


Unnamed: 0,Performance Index
9254,36.0
1561,25.0
1670,59.0
6087,22.0
6669,40.0


Unnamed: 0,Performance Index
6252,51.0
4684,20.0
1731,46.0
4742,28.0
4521,41.0


## Build and train model

### Subtask:
Choose an appropriate AI model for the prediction task and train it using the preprocessed training data.


**Reasoning**:
Import the LinearRegression model, instantiate it, and train it using the training data.



In [3]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

## Evaluate model

### Subtask:
Evaluate the performance of the trained model on the testing data using appropriate metrics.


**Reasoning**:
Evaluate the performance of the trained model by making predictions on the test set and calculating MSE, RMSE, and R-squared scores.



In [4]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error (MSE): {mse}')
print(f'Root Mean Squared Error (RMSE): {rmse}')
print(f'R-squared (R2) Score: {r2}')

Mean Squared Error (MSE): 4.082628398521851
Root Mean Squared Error (RMSE): 2.020551508505005
R-squared (R2) Score: 0.9889832909573145


## Summary:

### Data Analysis Key Findings

*   The dataset contains 10000 entries and 6 columns with no missing values.
*   The 'Extracurricular Activities' column was successfully one-hot encoded.
*   The dataset was split into training (80%) and testing (20%) sets.
*   A Linear Regression model was trained on the training data.
*   The model achieved a Mean Squared Error (MSE) of 4.0826 on the test set.
*   The Root Mean Squared Error (RMSE) on the test set was 2.0205.
*   The R-squared (R2) score on the test set was 0.9889, indicating a strong fit.

### Insights or Next Steps

*   The Linear Regression model performs very well in predicting student performance based on the provided features, explaining nearly 99% of the variance.
*   Further analysis could involve exploring feature importance to understand which factors most significantly influence student performance according to the model.
