# Problem Statement & Analytical Objective


Educational institutions collect large volumes of student data but often struggle to translate it into actionable decisions. Schools need to identify which factors truly influence academic performance, distinguish high-impact drivers from noise, and ensure that interventions (time, money, policy) are directed toward areas that produce measurable improvement in outcomes.

This project analyzes student performance data to:

Identify behavioral, environmental, and demographic factors associated with academic success

Evaluate whether commonly assumed drivers (e.g., internet access, school type) have practical significance

Provide evidence-based guidance for academic planning, student support strategies, and policy decisions

The analysis follows an end-to-end analytics workflow:
Data Cleaning → EDA → Statistical Testing → Predictive Modeling → Actionable Insights

## Key Stakeholder Questions

1. Which factors have the strongest relationship with overall student performance?

2. Are students meeting institutional benchmarks for study time, attendance, and academic scores?

3. Do demographic factors (gender, age, school type) meaningfully influence outcomes, or are differences negligible?

4. Does access to resources (internet, travel time, study method) translate into measurable academic advantage?

5. Can student performance be predicted reliably using available data?

6. Which variables should schools prioritize for intervention, and which can be deprioritized?

# 1. Importing Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 2. Importing data

In [2]:
raw_data = pd.read_csv('Student_performance.csv')

# 3. Data Check

1. Size of the data
2. Name of the columns
3. data types
4. Null Values snapshots
5. Sample rows

## 3.1. Size of the data

In [3]:
raw_data.shape

(25000, 16)

The data has 25000 rows and 16 columns

## 3.2. Name of the columns

In [4]:
raw_data.columns

Index(['student_id', 'age', 'gender', 'school_type', 'parent_education',
       'study_hours', 'attendance_percentage', 'internet_access',
       'travel_time', 'extra_activities', 'study_method', 'math_score',
       'science_score', 'english_score', 'overall_score', 'final_grade'],
      dtype='object')

## 3.3. Data types

In [5]:
raw_data.describe()

Unnamed: 0,student_id,age,study_hours,attendance_percentage,math_score,science_score,english_score,overall_score
count,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0
mean,7493.0438,16.48276,4.253224,75.084084,63.785944,63.74532,63.681948,64.006172
std,4323.56215,1.703895,2.167541,14.373171,20.875262,20.970529,20.792693,18.932025
min,1.0,14.0,0.5,50.0,0.0,0.0,0.0,14.5
25%,3743.75,15.0,2.4,62.8,48.3,48.2,48.3,49.0
50%,7461.5,16.0,4.3,75.1,64.1,64.1,64.2,64.2
75%,11252.0,18.0,6.1,87.5,80.0,80.0,80.0,79.0
max,15000.0,19.0,8.0,100.0,100.0,100.0,100.0,100.0


In [6]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   student_id             25000 non-null  int64  
 1   age                    25000 non-null  int64  
 2   gender                 25000 non-null  object 
 3   school_type            25000 non-null  object 
 4   parent_education       25000 non-null  object 
 5   study_hours            25000 non-null  float64
 6   attendance_percentage  25000 non-null  float64
 7   internet_access        25000 non-null  object 
 8   travel_time            25000 non-null  object 
 9   extra_activities       25000 non-null  object 
 10  study_method           25000 non-null  object 
 11  math_score             25000 non-null  float64
 12  science_score          25000 non-null  float64
 13  english_score          25000 non-null  float64
 14  overall_score          25000 non-null  float64
 15  fi

In [7]:
raw_data.dtypes

student_id                 int64
age                        int64
gender                    object
school_type               object
parent_education          object
study_hours              float64
attendance_percentage    float64
internet_access           object
travel_time               object
extra_activities          object
study_method              object
math_score               float64
science_score            float64
english_score            float64
overall_score            float64
final_grade               object
dtype: object

## 3.4. Null Values 

In [8]:
raw_data.isna().sum()

student_id               0
age                      0
gender                   0
school_type              0
parent_education         0
study_hours              0
attendance_percentage    0
internet_access          0
travel_time              0
extra_activities         0
study_method             0
math_score               0
science_score            0
english_score            0
overall_score            0
final_grade              0
dtype: int64

## 3.5. Sample rows

In [9]:
raw_data.head(5)

Unnamed: 0,student_id,age,gender,school_type,parent_education,study_hours,attendance_percentage,internet_access,travel_time,extra_activities,study_method,math_score,science_score,english_score,overall_score,final_grade
0,1,14,male,public,post graduate,3.1,84.3,yes,<15 min,yes,notes,42.7,55.4,57.0,53.1,e
1,2,18,female,public,graduate,3.7,87.8,yes,>60 min,no,textbook,57.6,68.8,64.8,61.3,d
2,3,17,female,private,post graduate,7.9,65.5,no,<15 min,no,notes,84.8,95.0,79.2,89.6,b
3,4,16,other,public,high school,1.1,58.1,no,15-30 min,no,notes,44.4,27.5,54.7,41.6,e
4,5,16,female,public,high school,1.3,61.0,yes,30-60 min,yes,group study,8.9,32.7,30.0,25.4,f


In [10]:
raw_data.sample(5, random_state = 42)

Unnamed: 0,student_id,age,gender,school_type,parent_education,study_hours,attendance_percentage,internet_access,travel_time,extra_activities,study_method,math_score,science_score,english_score,overall_score,final_grade
6868,6869,18,other,private,phd,6.3,82.9,yes,30-60 min,no,notes,71.0,78.7,88.4,86.1,b
24016,12456,15,other,public,high school,2.1,76.1,yes,<15 min,no,notes,41.1,50.8,44.1,44.9,e
9668,9669,15,male,public,graduate,3.7,84.1,no,<15 min,no,notes,73.2,73.5,88.7,66.4,d
13640,13641,16,male,public,no formal,2.4,62.3,yes,30-60 min,yes,coaching,56.1,42.3,55.8,43.9,e
14018,14019,17,other,private,high school,1.0,91.0,yes,>60 min,yes,online videos,49.9,48.9,33.0,37.7,f


# 4. Handling Duplicates

## 4.1. Overall duplicates

1. Identify overall duplicates
2. Count suplicate rows
3. View duplicate rows
4. View duplicate rows including first occurence
5. drop overall duplicates

### 4.1.1. Identify overal duplicates

In [11]:
raw_data.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
24995     True
24996     True
24997     True
24998     True
24999     True
Length: 25000, dtype: bool

Some are true, that means there are duplicates

### 4.1.2. Count of duplicate rows

In [12]:
raw_data.duplicated().sum()

10000

There are 1000 rows that are actually duplicates ( this is without first occurence, that means 10000 rows should be removed)

In [13]:
raw_data.duplicated(keep = False).sum()

17265

There are 17265 rows that are duplicates including first occurence. This means there are 7265 etries that have duplicates

### 4.1.3. View duplicates

In [14]:
raw_data[raw_data.duplicated(keep = False)]

Unnamed: 0,student_id,age,gender,school_type,parent_education,study_hours,attendance_percentage,internet_access,travel_time,extra_activities,study_method,math_score,science_score,english_score,overall_score,final_grade
1,2,18,female,public,graduate,3.7,87.8,yes,>60 min,no,textbook,57.6,68.8,64.8,61.3,d
2,3,17,female,private,post graduate,7.9,65.5,no,<15 min,no,notes,84.8,95.0,79.2,89.6,b
4,5,16,female,public,high school,1.3,61.0,yes,30-60 min,yes,group study,8.9,32.7,30.0,25.4,f
5,6,19,male,public,no formal,3.8,69.6,yes,>60 min,yes,coaching,51.5,78.3,63.9,63.5,d
6,7,14,female,private,post graduate,1.8,81.6,yes,30-60 min,no,textbook,41.9,29.4,39.2,39.1,f
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,12047,17,female,public,phd,1.8,55.2,yes,15-30 min,no,mixed,55.8,48.5,46.7,46.1,e
24996,1102,16,female,private,diploma,2.7,97.1,yes,<15 min,no,coaching,64.8,48.2,52.3,56.5,d
24997,4422,19,other,private,post graduate,1.0,63.0,yes,<15 min,no,group study,50.5,20.3,36.1,36.7,f
24998,7858,14,male,private,diploma,1.0,69.4,yes,15-30 min,yes,group study,13.0,34.2,7.3,34.1,f


### 4.1.4. drop duplicates

In [15]:
data_deduped = raw_data.drop_duplicates()

In [16]:
data_deduped.shape

(15000, 16)

Raw data had 25000 duplicates, but now 10000 duplicate rows are dropped , therefore 15000 rows are left in the data

## 4.2. Column based duplicates

1. Detect column based duplicates
2. View column based duplicates
3. Drop column based duplicates based on conditions ( maximum null counts, latest date, etc. )

In [17]:
data_deduped.columns

Index(['student_id', 'age', 'gender', 'school_type', 'parent_education',
       'study_hours', 'attendance_percentage', 'internet_access',
       'travel_time', 'extra_activities', 'study_method', 'math_score',
       'science_score', 'english_score', 'overall_score', 'final_grade'],
      dtype='object')

Columns to be checked for duplicates is only student_id as it cant have duplicates

### 4.2.1. student_id

#### 4.2.1.1. Detect column based duplicates

In [18]:
data_deduped.duplicated(subset = ['student_id']).sum()

0

There are no duplicates in student_id. This becomes our primary key

# 5. Standardizing data

String Handling
1. Trim spaces
2. Case normalization
3. Split
4. Concatenate
5. Extract patterns

Date & Time Handling

1. String → date
2. Multiple formats
3. Extract year / month / day
4. Time differences
5. Invalid dates

Numerical Handling

1. Type coercion
2. Rounding
3. Scaling
4. Negative / impossible values

Categorical Handling

1. Case issues
2. Misspellings
3. Unknown / Other

Category consolidation

1. Flag Creation
2. Binary flags
3. Condition-based flags
4. Time-based flags


In [19]:
data_deduped.head(10)

Unnamed: 0,student_id,age,gender,school_type,parent_education,study_hours,attendance_percentage,internet_access,travel_time,extra_activities,study_method,math_score,science_score,english_score,overall_score,final_grade
0,1,14,male,public,post graduate,3.1,84.3,yes,<15 min,yes,notes,42.7,55.4,57.0,53.1,e
1,2,18,female,public,graduate,3.7,87.8,yes,>60 min,no,textbook,57.6,68.8,64.8,61.3,d
2,3,17,female,private,post graduate,7.9,65.5,no,<15 min,no,notes,84.8,95.0,79.2,89.6,b
3,4,16,other,public,high school,1.1,58.1,no,15-30 min,no,notes,44.4,27.5,54.7,41.6,e
4,5,16,female,public,high school,1.3,61.0,yes,30-60 min,yes,group study,8.9,32.7,30.0,25.4,f
5,6,19,male,public,no formal,3.8,69.6,yes,>60 min,yes,coaching,51.5,78.3,63.9,63.5,d
6,7,14,female,private,post graduate,1.8,81.6,yes,30-60 min,no,textbook,41.9,29.4,39.2,39.1,f
7,8,18,female,private,post graduate,5.6,59.4,yes,>60 min,yes,group study,56.7,60.1,53.4,69.6,d
8,9,15,other,private,high school,3.2,89.6,yes,15-30 min,yes,mixed,54.1,59.5,38.3,55.2,d
9,10,14,female,public,diploma,6.8,62.4,yes,>60 min,no,mixed,71.9,70.4,81.3,69.6,d


## 5.1 String handling

### 5.1.1. final grade

All should be capital

In [20]:
data_1 = data_deduped.copy()

In [21]:
data_1['final_grade'] =  data_1['final_grade'].str.upper()

In [22]:
data_1.head(5)

Unnamed: 0,student_id,age,gender,school_type,parent_education,study_hours,attendance_percentage,internet_access,travel_time,extra_activities,study_method,math_score,science_score,english_score,overall_score,final_grade
0,1,14,male,public,post graduate,3.1,84.3,yes,<15 min,yes,notes,42.7,55.4,57.0,53.1,E
1,2,18,female,public,graduate,3.7,87.8,yes,>60 min,no,textbook,57.6,68.8,64.8,61.3,D
2,3,17,female,private,post graduate,7.9,65.5,no,<15 min,no,notes,84.8,95.0,79.2,89.6,B
3,4,16,other,public,high school,1.1,58.1,no,15-30 min,no,notes,44.4,27.5,54.7,41.6,E
4,5,16,female,public,high school,1.3,61.0,yes,30-60 min,yes,group study,8.9,32.7,30.0,25.4,F


### 5.1.2. Gender, school_type, study_method and parent_education all the first letters should be capital

In [23]:
data_1['gender'] = data_1['gender'].str.capitalize()

In [24]:
for i in ('school_type', 'study_method', 'parent_education'):
    data_1[i] = data_1[i].str.capitalize()

In [25]:
data_1.head(5)

Unnamed: 0,student_id,age,gender,school_type,parent_education,study_hours,attendance_percentage,internet_access,travel_time,extra_activities,study_method,math_score,science_score,english_score,overall_score,final_grade
0,1,14,Male,Public,Post graduate,3.1,84.3,yes,<15 min,yes,Notes,42.7,55.4,57.0,53.1,E
1,2,18,Female,Public,Graduate,3.7,87.8,yes,>60 min,no,Textbook,57.6,68.8,64.8,61.3,D
2,3,17,Female,Private,Post graduate,7.9,65.5,no,<15 min,no,Notes,84.8,95.0,79.2,89.6,B
3,4,16,Other,Public,High school,1.1,58.1,no,15-30 min,no,Notes,44.4,27.5,54.7,41.6,E
4,5,16,Female,Public,High school,1.3,61.0,yes,30-60 min,yes,Group study,8.9,32.7,30.0,25.4,F


## 5.2. Categorical Handling

categorical columns are: gender, school type, parent education, internet access, travel time, extra activities, study method and final grade

### 5.2.1. Gender

In [26]:
data_1['gender'].unique()

array(['Male', 'Female', 'Other'], dtype=object)

In [27]:
data_1['gender'].value_counts(dropna= False)

Other     5042
Male      4979
Female    4979
Name: gender, dtype: int64

In [28]:
data_1['gender'].value_counts(dropna= False, normalize = True)

Other     0.336133
Male      0.331933
Female    0.331933
Name: gender, dtype: float64

### 5.2.2. School_type

In [29]:
data_1['school_type'].unique()

array(['Public', 'Private'], dtype=object)

In [30]:
data_1['gender'].value_counts(dropna= False)

Other     5042
Male      4979
Female    4979
Name: gender, dtype: int64

In [31]:
data_1['school_type'].value_counts(dropna= False, normalize = True)

Private    0.5058
Public     0.4942
Name: school_type, dtype: float64

### 5.2.3. parent_education

In [32]:
data_1['parent_education'].unique()

array(['Post graduate', 'Graduate', 'High school', 'No formal', 'Diploma',
       'Phd'], dtype=object)

In [33]:
data_1['parent_education'].value_counts(dropna= False)

Diploma          2581
Post graduate    2535
High school      2532
Graduate         2481
No formal        2445
Phd              2426
Name: parent_education, dtype: int64

In [34]:
data_1['parent_education'].value_counts(dropna= False, normalize = True)

Diploma          0.172067
Post graduate    0.169000
High school      0.168800
Graduate         0.165400
No formal        0.163000
Phd              0.161733
Name: parent_education, dtype: float64

### 5.2.4. internet access

In [35]:
data_1['internet_access'].unique()

array(['yes', 'no'], dtype=object)

In [36]:
data_1['internet_access'].value_counts(dropna = False) 

yes    12754
no      2246
Name: internet_access, dtype: int64

In [37]:
data_1['internet_access'].value_counts(dropna = False, normalize = True) 

yes    0.850267
no     0.149733
Name: internet_access, dtype: float64

### 5.2.5. travel_time

In [38]:
data_1['travel_time'].unique()

array(['<15 min', '>60 min', '15-30 min', '30-60 min'], dtype=object)

In [39]:
data_1['travel_time'].value_counts(dropna = True)

15-30 min    3823
30-60 min    3813
>60 min      3716
<15 min      3648
Name: travel_time, dtype: int64

In [40]:
data_1['travel_time'].value_counts(dropna = True, normalize = True)

15-30 min    0.254867
30-60 min    0.254200
>60 min      0.247733
<15 min      0.243200
Name: travel_time, dtype: float64

### 5.2.6. extra activities, study_method and final grade

In [41]:
for i in ('extra_activities', 'study_method', 'final_grade'):
    unique = data_1[i].unique()
    value_counts = data_1[i].value_counts(dropna = False)
    normalized_values = data_1[i].value_counts(dropna = False, normalize = True)
    print('unique values in ', i , ' are', unique)
    print('value counts in ', i , ' are', value_counts)
    print('normalized value counts in ', i , ' are', normalized_values)

unique values in  extra_activities  are ['yes' 'no']
value counts in  extra_activities  are no     7506
yes    7494
Name: extra_activities, dtype: int64
normalized value counts in  extra_activities  are no     0.5004
yes    0.4996
Name: extra_activities, dtype: float64
unique values in  study_method  are ['Notes' 'Textbook' 'Group study' 'Coaching' 'Mixed' 'Online videos']
value counts in  study_method  are Mixed            2602
Textbook         2546
Notes            2515
Online videos    2468
Group study      2447
Coaching         2422
Name: study_method, dtype: int64
normalized value counts in  study_method  are Mixed            0.173467
Textbook         0.169733
Notes            0.167667
Online videos    0.164533
Group study      0.163133
Coaching         0.161467
Name: study_method, dtype: float64
unique values in  final_grade  are ['E' 'D' 'B' 'F' 'C' 'A']
value counts in  final_grade  are D    3770
C    3697
E    3378
F    1796
B    1638
A     721
Name: final_grade, dtype: int64


## 5.3. Numerical Handling

The following cleaning rules are applied to ensure data consistency and analytical validity.


a) Scores must lie between 0 and 100

b) Attendance percentage must be between 0 and 100

c) Grades must follow consistent categorical representation

d) Identifiers will be excluded from analysis

e) No rows will be dropped without explicit justification

In [42]:
data_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15000 entries, 0 to 14999
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   student_id             15000 non-null  int64  
 1   age                    15000 non-null  int64  
 2   gender                 15000 non-null  object 
 3   school_type            15000 non-null  object 
 4   parent_education       15000 non-null  object 
 5   study_hours            15000 non-null  float64
 6   attendance_percentage  15000 non-null  float64
 7   internet_access        15000 non-null  object 
 8   travel_time            15000 non-null  object 
 9   extra_activities       15000 non-null  object 
 10  study_method           15000 non-null  object 
 11  math_score             15000 non-null  float64
 12  science_score          15000 non-null  float64
 13  english_score          15000 non-null  float64
 14  overall_score          15000 non-null  float64
 15  fi

Numerical columns : student_id, age, study_hours, attendance_percentage, math_score, science_score, english_score, overall_score

### 5.3.1. Detetcting Invalid Values

#### 5.3.1.1. age

age cannot be negative

In [43]:
data_1[data_1['age']<0].shape[0]

0

#### 5.3.1.2. study hours

In [44]:
data_1[data_1['study_hours']<0].shape[0]

0

#### 5.3.1.3. attendance percentage

In [45]:
data_1[data_1['attendance_percentage']<0].shape[0]

0

In [46]:
data_1[data_1['attendance_percentage']> 100].shape[0]

0

#### 5.3.1.4. math_score, science_score, english score, overall_score

In [47]:
for i in ('math_score', 'science_score', 'english_score'):
    invalid_negative_values = data_1[data_1[i]<0].shape[0]
    invalid_large_values = data_1[data_1[i]>100].shape[0]
    print('negative invalid values in ', i, 'are', invalid_negative_values)
    print('large invalid values in ', i, 'are', invalid_large_values)
    

negative invalid values in  math_score are 0
large invalid values in  math_score are 0
negative invalid values in  science_score are 0
large invalid values in  science_score are 0
negative invalid values in  english_score are 0
large invalid values in  english_score are 0


There are no invalid values

### 5.3.2. Outliers detection

Numerical columns : student_id, age, study_hours, attendance_percentage, math_score, science_score, english_score, overall_score

In [48]:
lower_outlier_columns = []
upper_outlier_columns = []

for i in ('age', 'study_hours', 'attendance_percentage', 'math_score', 'science_score', 'english_score', 'overall_score'):
    Q1 = data_1[i].quantile(0.25)
    Q3 = data_1[i].quantile(0.75)
    IQR = Q3-Q1
    lower_bound = Q1-(1.5*IQR)
    upper_bound = Q3 +(1.5*IQR)
    lower_Outliers = data_1[data_1[i]< lower_bound].shape[0]
    upper_Outliers = data_1[data_1[i]> upper_bound].shape[0]
    
        
    print('for ', i, ' first quartile is', Q1)
    print('for ', i, ' third quartile is', Q3)
    print('for ', i, ' inter quartile range is', IQR)
    print('for ', i, ' lower_bound is', lower_bound)
    print('for ', i, ' upper_bound is', upper_bound)
    print('for ', i, ' number of lower outlying values are', lower_Outliers)
    print('for ', i, ' number of upper outlying values are', upper_Outliers)
    
    if lower_Outliers > 0:
        lower_outlier_columns.append(i)
    
    if upper_Outliers > 0:
        upper_outlier_columns.append(i)
    
print('columns to check for lower_outliers ', lower_outlier_columns)
print('columns to check for upper_outliers ', upper_outlier_columns)

    


for  age  first quartile is 15.0
for  age  third quartile is 18.0
for  age  inter quartile range is 3.0
for  age  lower_bound is 10.5
for  age  upper_bound is 22.5
for  age  number of lower outlying values are 0
for  age  number of upper outlying values are 0
for  study_hours  first quartile is 2.4
for  study_hours  third quartile is 6.1
for  study_hours  inter quartile range is 3.6999999999999997
for  study_hours  lower_bound is -3.15
for  study_hours  upper_bound is 11.649999999999999
for  study_hours  number of lower outlying values are 0
for  study_hours  number of upper outlying values are 0
for  attendance_percentage  first quartile is 62.6
for  attendance_percentage  third quartile is 87.4
for  attendance_percentage  inter quartile range is 24.800000000000004
for  attendance_percentage  lower_bound is 25.4
for  attendance_percentage  upper_bound is 124.60000000000001
for  attendance_percentage  number of lower outlying values are 0
for  attendance_percentage  number of upper out

The lower outliers means that some students have failed badly. complete analysis will be done in univarite analysis

In [49]:
# creating a checkpoint

data_2 = data_1.copy()

Cleaned dataset of **15,000 students across 16 academic and demographic variables** after removing 10,000 duplicate records and applying strict data validation rules to ensure reliability of results .

### **Data Quality & Reliability**

The dataset contains:

* **No missing values**
* **No invalid numeric values** (all scores between 0–100, valid attendance rates, no negative values)
* **Consistent categorical formats** (standardized casing and cleaned category labels)
* **Verified primary key integrity** using `student_id` .
---

### **Population Composition**

* **Gender** is nearly evenly distributed: Male 33.2%, Female 33.2%, Other 33.6%.
* **School Type** is balanced: ~50% Public, ~50% Private.
* **Parental Education** shows broad socio-economic diversity across six levels from "No formal" to "PhD".
* **Internet Access** is high: **85% of students** have access.
* **Study Methods** are diversified, with no dominant single approach.
* **Final Grades** are skewed toward middle bands (C & D), while top grades (A) represent only **~5% of students**, revealing strong performance stratification.

These distributions ensure the dataset reflects **realistic educational diversity** rather than artificial sampling bias.

---

### **Risk & Opportunity Signals**

* **Lower outliers in subject scores** represent a small but critical at-risk group of students requiring academic intervention.
* The **absence of upper outliers** suggests performance ceilings are consistent across the population.
* Balanced demographic composition allows fair evaluation of policy and intervention strategies without demographic distortion.

---

### **Business & Policy Implications**

1. **Early Intervention Programs**
   Low-performing students can be reliably detected and supported using subject score distributions.

2. **Attendance & Study Optimization**
   Strong attendance and structured study behaviors present high leverage opportunities for academic improvement.

3. **Equitable Policy Design**
   Balanced demographic representation ensures insights can guide broad educational planning.

4. **Model Readiness**
   With clean, stable, and validated data, the dataset is now fully prepared for:

   * predictive modeling,
   * academic outcome forecasting,
   * institutional performance dashboards.

---

### **Conclusion**

This dataset is **analytically sound, statistically stable, and operationally reliable**.
The structure of the data, absence of quality defects, and consistency across key variables provide an excellent foundation for the next phase of analysis: **predictive modeling and strategic educational decision-making**.
