## Problem Statement

**Goal:** Predict student performance index based on various study-related factors.

**Type:** Regression Problem

**Objective:** Build a machine learning model to predict student performance using features such as:
- Hours Studied
- Previous Scores
- Extracurricular Activities
- Sleep Hours
- Sample Question Papers Practiced

**Business Value:** Help identify factors affecting student performance and provide insights for educational interventions.

Task 2: Import Required Libraries

In [20]:
import pandas as pd

In [21]:
import numpy as np

In [22]:
import matplotlib.pyplot as plt

In [23]:
import seaborn as sns

In [24]:
pd.__version__

'2.2.3'

Task 3: Load the Students Performance Dataset

In [25]:
df=pd.read_csv("Student_Performance.csv")

In [26]:
df

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,,1,91
1,4,82,No,4.0,2,65
2,8,51,Yes,7.0,2,45
3,5,52,Yes,5.0,2,36
4,7,75,No,,5,66
...,...,...,...,...,...,...
9995,1,49,Yes,4.0,2,23
9996,7,64,Yes,8.0,5,58
9997,6,83,Yes,8.0,5,74
9998,9,97,Yes,7.0,0,95


In [27]:
df.head()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,,1,91
1,4,82,No,4.0,2,65
2,8,51,Yes,7.0,2,45
3,5,52,Yes,5.0,2,36
4,7,75,No,,5,66


In [28]:
df.head(10)

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,,1,91
1,4,82,No,4.0,2,65
2,8,51,Yes,7.0,2,45
3,5,52,Yes,5.0,2,36
4,7,75,No,,5,66
5,3,78,No,9.0,6,61
6,7,73,Yes,5.0,6,63
7,8,45,Yes,4.0,6,42
8,5,77,No,8.0,2,61
9,4,89,No,4.0,0,69


In [29]:
df.tail()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
9995,1,49,Yes,4.0,2,23
9996,7,64,Yes,8.0,5,58
9997,6,83,Yes,8.0,5,74
9998,9,97,Yes,7.0,0,95
9999,7,74,No,8.0,1,64


Task 4: Inspect the Dataset

### Shape of Dataset

In [30]:
df.shape

(10000, 6)

In [31]:
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

Number of rows: 10000
Number of columns: 6


### Column Names and What Each Column Represents

In [32]:
df.columns

Index(['Hours Studied', 'Previous Scores', 'Extracurricular Activities',
       'Sleep Hours', 'Sample Question Papers Practiced', 'Performance Index'],
      dtype='object')

In [33]:
list(df.columns)

['Hours Studied',
 'Previous Scores',
 'Extracurricular Activities',
 'Sleep Hours',
 'Sample Question Papers Practiced',
 'Performance Index']

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Hours Studied                     10000 non-null  int64  
 1   Previous Scores                   10000 non-null  int64  
 2   Extracurricular Activities        10000 non-null  object 
 3   Sleep Hours                       9998 non-null   float64
 4   Sample Question Papers Practiced  10000 non-null  int64  
 5   Performance Index                 10000 non-null  int64  
dtypes: float64(1), int64(4), object(1)
memory usage: 468.9+ KB


### Column Descriptions:

1. **Hours Studied** - Number of hours the student studied
2. **Previous Scores** - Student's previous academic scores
3. **Extracurricular Activities** - Whether student participates in extracurricular activities (Yes/No)
4. **Sleep Hours** - Average hours of sleep per day
5. **Sample Question Papers Practiced** - Number of sample question papers practiced
6. **Performance Index** - Target variable representing overall performance (0-100)

### Data Types

In [35]:
df.dtypes

Hours Studied                         int64
Previous Scores                       int64
Extracurricular Activities           object
Sleep Hours                         float64
Sample Question Papers Practiced      int64
Performance Index                     int64
dtype: object

In [36]:
df.select_dtypes(include='object').columns

Index(['Extracurricular Activities'], dtype='object')

In [37]:
df.select_dtypes(include=['int64','float64']).columns

Index(['Hours Studied', 'Previous Scores', 'Sleep Hours',
       'Sample Question Papers Practiced', 'Performance Index'],
      dtype='object')

### Apply Basic DataFrame Functions

In [None]:
df['Hours Studied']

KeyError: 'gender'

In [None]:
df['Performance Index']

In [None]:
df[['Hours Studied','Previous Scores','Performance Index']]

In [None]:
df.iloc[0]

In [None]:
df.iloc[0:5]

In [None]:
df.iloc[0:10,0:3]

In [None]:
df.loc[0:5]

In [None]:
df.loc[0:5,['Hours Studied','Performance Index']]

### Additional Inspection

In [None]:
df.describe()

In [None]:
df.describe(include='object')

In [None]:
df.isnull().sum()

In [None]:
df.duplicated().sum()

In [None]:
df.nunique()

### Explore Categorical Features

In [None]:
df['Extracurricular Activities'].value_counts()

In [None]:
df['Hours Studied'].value_counts()

In [None]:
df['Previous Scores'].describe()

In [None]:
df['Sleep Hours'].describe()

In [None]:
df['Sample Question Papers Practiced'].describe()

### Explore Numerical Features

In [None]:
df['Performance Index'].describe()

In [None]:
df['Hours Studied'].describe()

In [None]:
df['Previous Scores'].describe()

In [None]:
df[['Hours Studied','Previous Scores','Sleep Hours','Sample Question Papers Practiced','Performance Index']].describe()

Task 5: Identify Features (X) and Target (y)

### Predict Performance Index

In [None]:
X = df.drop('Performance Index', axis=1)

In [None]:
X.head()

In [None]:
y = df['Performance Index']

In [None]:
y.head()

In [None]:
X.shape

In [None]:
y.shape

### Feature Types Analysis

In [None]:
categorical_features=X.select_dtypes(include='object').columns.tolist()

In [None]:
categorical_features

In [None]:
numerical_features=X.select_dtypes(include=['int64','float64']).columns.tolist()

In [None]:
numerical_features

In [None]:
print(f"Categorical Features: {len(categorical_features)}")
print(f"Numerical Features: {len(numerical_features)}")
print(f"Total Features: {len(categorical_features)+len(numerical_features)}")

### Summary of Features and Target

In [None]:
print("="*60)
print("FEATURES (X):")
print("="*60)
print(f"Shape: {X.shape}")
print(f"\nCategorical Features ({len(categorical_features)}):")
for feat in categorical_features:
    print(f"  - {feat}")
print(f"\nNumerical Features ({len(numerical_features)}):")
for feat in numerical_features:
    print(f"  - {feat}")
print("\n" + "="*60)
print("TARGET (y):")
print("="*60)
print(f"Variable: Performance Index")
print(f"Shape: {y.shape}")
print(f"Type: Continuous (Regression)")
print(f"Range: {y.min()} to {y.max()}")
print(f"Mean: {y.mean():.2f}")
print("="*60)