## Problem Statement

**Goal:** Predict student performance (math/reading/writing scores) based on various demographic and academic factors.

**Type:** Regression Problem

**Objective:** Build a machine learning model to predict student scores using features such as:
- Gender
- Race/Ethnicity
- Parental Level of Education
- Lunch Type
- Test Preparation Course
- Reading Score
- Writing Score
- Math Score

**Business Value:** Help identify factors affecting student performance and provide insights for educational interventions.

Task 2: Import Required Libraries

In [None]:
import pandas as pd

In [None]:
import numpy as np

In [None]:
import matplotlib.pyplot as plt

In [None]:
import seaborn as sns

In [None]:
pd.__version__

Task 3: Load the Students Performance Dataset

In [None]:
df=pd.read_csv("StudentsPerformance.csv")

In [None]:
df

In [None]:
df.head()

In [None]:
df.head(10)

In [None]:
df.tail()

Task 4: Inspect the Dataset

### Shape of Dataset

In [None]:
df.shape

In [None]:
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

### Column Names and What Each Column Represents

In [None]:
df.columns

In [None]:
list(df.columns)

In [None]:
df.info()

### Column Descriptions:

1. **gender** - Student's gender (male/female)
2. **race/ethnicity** - Student's racial/ethnic group (group A, B, C, D, E)
3. **parental level of education** - Highest education level of parents
4. **lunch** - Type of lunch (standard/free or reduced)
5. **test preparation course** - Whether student completed test prep course (completed/none)
6. **math score** - Score in mathematics (0-100)
7. **reading score** - Score in reading (0-100)
8. **writing score** - Score in writing (0-100)

### Data Types

In [None]:
df.dtypes

In [None]:
df.select_dtypes(include='object').columns

In [None]:
df.select_dtypes(include=['int64','float64']).columns

### Apply Basic DataFrame Functions

In [None]:
df['gender']

In [None]:
df['math score']

In [None]:
df[['gender','math score','reading score']]

In [None]:
df.iloc[0]

In [None]:
df.iloc[0:5]

In [None]:
df.iloc[0:10,0:3]

In [None]:
df.loc[0:5]

In [None]:
df.loc[0:5,['gender','math score']]

### Additional Inspection

In [None]:
df.describe()

In [None]:
df.describe(include='object')

In [None]:
df.isnull().sum()

In [None]:
df.duplicated().sum()

In [None]:
df.nunique()

### Explore Categorical Features

In [None]:
df['gender'].value_counts()

In [None]:
df['race/ethnicity'].value_counts()

In [None]:
df['parental level of education'].value_counts()

In [None]:
df['lunch'].value_counts()

In [None]:
df['test preparation course'].value_counts()

### Explore Numerical Features

In [None]:
df['math score'].describe()

In [None]:
df['reading score'].describe()

In [None]:
df['writing score'].describe()

In [None]:
df[['math score','reading score','writing score']].describe()

Task 5: Identify Features (X) and Target (y)

### Option 1: Predict Math Score

In [None]:
X=df.drop('math score',axis=1)

In [None]:
X.head()

In [None]:
y=df['math score']

In [None]:
y.head()

In [None]:
X.shape

In [None]:
y.shape

### Feature Types Analysis

In [None]:
categorical_features=X.select_dtypes(include='object').columns.tolist()

In [None]:
categorical_features

In [None]:
numerical_features=X.select_dtypes(include=['int64','float64']).columns.tolist()

In [None]:
numerical_features

In [None]:
print(f"Categorical Features: {len(categorical_features)}")
print(f"Numerical Features: {len(numerical_features)}")
print(f"Total Features: {len(categorical_features)+len(numerical_features)}")

### Summary of Features and Target

In [None]:
print("="*60)
print("FEATURES (X):")
print("="*60)
print(f"Shape: {X.shape}")
print(f"\nCategorical Features ({len(categorical_features)}):")
for feat in categorical_features:
    print(f"  - {feat}")
print(f"\nNumerical Features ({len(numerical_features)}):")
for feat in numerical_features:
    print(f"  - {feat}")
print("\n" + "="*60)
print("TARGET (y):")
print("="*60)
print(f"Variable: math score")
print(f"Shape: {y.shape}")
print(f"Type: Continuous (Regression)")
print(f"Range: {y.min()} to {y.max()}")
print(f"Mean: {y.mean():.2f}")
print("="*60)