## Student Performance Factors Dataset
First, let's take a look at the dataset we have

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('../datasets/StudentPerformanceFactors.csv')
df

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,69,High,Medium,No,7,76,Medium,Yes,1,High,Medium,Public,Positive,2,No,High School,Near,Female,68
6603,23,76,High,Medium,No,8,81,Medium,Yes,3,Low,High,Public,Positive,2,No,High School,Near,Female,69
6604,20,90,Medium,Low,Yes,6,65,Low,Yes,3,Low,Medium,Public,Negative,2,No,Postgraduate,Near,Female,68
6605,10,86,High,High,Yes,6,91,High,Yes,2,Low,Medium,Private,Positive,3,No,High School,Far,Female,68


- We can ask a few questions about this data such as these for now:

Does sleep hours affect exam score?
Should low scoring students prefer tutoring?
Does family income affect score?
How does peer influence affect exam score?

To answer this questions we need to analyze if the predictors are linearly related to the outcome (target)
Then we can use Multiple regression and t-tests to analyze relationships between the target and independent variables.

Before that we need to look at the data, there is both numerical data and categorical data, for multiple regression, regression models use numerical data, we need to first convert them into numerical values. 

There can be two types of categorical variables
1. Nominal -> Yes/No, Gender
2. Ordinal -> has some hierarchy such as parental involvment, access to resources, peer influence, distance from home

In [5]:
df.dtypes

Hours_Studied                  int64
Attendance                     int64
Parental_Involvement          object
Access_to_Resources           object
Extracurricular_Activities    object
Sleep_Hours                    int64
Previous_Scores                int64
Motivation_Level              object
Internet_Access               object
Tutoring_Sessions              int64
Family_Income                 object
Teacher_Quality               object
School_Type                   object
Peer_Influence                object
Physical_Activity              int64
Learning_Disabilities         object
Parental_Education_Level      object
Distance_from_Home            object
Gender                        object
Exam_Score                     int64
dtype: object

In [7]:
# extract all object (categorical columns)
cat_cols = df.select_dtypes(include=["object"])
cat_cols.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Parental_Involvement        6607 non-null   object
 1   Access_to_Resources         6607 non-null   object
 2   Extracurricular_Activities  6607 non-null   object
 3   Motivation_Level            6607 non-null   object
 4   Internet_Access             6607 non-null   object
 5   Family_Income               6607 non-null   object
 6   Teacher_Quality             6529 non-null   object
 7   School_Type                 6607 non-null   object
 8   Peer_Influence              6607 non-null   object
 9   Learning_Disabilities       6607 non-null   object
 10  Parental_Education_Level    6517 non-null   object
 11  Distance_from_Home          6540 non-null   object
 12  Gender                      6607 non-null   object
dtypes: object(13)
memory usage: 671.2+ KB


In [None]:
# check if it has null values
df.isnull().sum()

Hours_Studied                  0
Attendance                     0
Parental_Involvement           0
Access_to_Resources            0
Extracurricular_Activities     0
Sleep_Hours                    0
Previous_Scores                0
Motivation_Level               0
Internet_Access                0
Tutoring_Sessions              0
Family_Income                  0
Teacher_Quality               78
School_Type                    0
Peer_Influence                 0
Physical_Activity              0
Learning_Disabilities          0
Parental_Education_Level      90
Distance_from_Home            67
Gender                         0
Exam_Score                     0
dtype: int64

In [16]:
df_dropped_na = df.dropna()
# print(df_dropped_na.isnull().sum())
print(df_dropped_na.select_dtypes(include=["object"]).info())

<class 'pandas.core.frame.DataFrame'>
Index: 6378 entries, 0 to 6606
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Parental_Involvement        6378 non-null   object
 1   Access_to_Resources         6378 non-null   object
 2   Extracurricular_Activities  6378 non-null   object
 3   Motivation_Level            6378 non-null   object
 4   Internet_Access             6378 non-null   object
 5   Family_Income               6378 non-null   object
 6   Teacher_Quality             6378 non-null   object
 7   School_Type                 6378 non-null   object
 8   Peer_Influence              6378 non-null   object
 9   Learning_Disabilities       6378 non-null   object
 10  Parental_Education_Level    6378 non-null   object
 11  Distance_from_Home          6378 non-null   object
 12  Gender                      6378 non-null   object
dtypes: object(13)
memory usage: 697.6+ KB
None


In [23]:
low_high = {'Low':1,'Medium':2,'High':3}
pos_neg = {'Positive':1, 'Neutral':0, 'Negative':-1}
binary = {'Yes':1, 'No':0}
ordinal_mappings = {
    'Parental_Involvement': low_high,
    'Access_to_Resources': low_high,
    'Extracurricular_Activities': binary,
    'Motivation_Level': low_high,
    'Internet_Access': binary,
    'Family_Income': low_high,
    'Teacher_Quality': low_high,
    'School_Type': {'Private':1, 'Public':0},
    'Peer_Influence': pos_neg,
    'Learning_Disabilities':binary,
    'Parental_Education_Level': {'High School':1, 'College':2, 'Postgraduate':3},
    'Distance_from_Home': {'Near':3,'Moderate':2,'Far':1},
    'Gender':{'Male':1,'Female':0}
}

In [24]:
for col in cat_cols:
    df_dropped_na[col] = df_dropped_na[col].map(ordinal_mappings[col])
df_dropped_na

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dropped_na[col] = df_dropped_na[col].map(ordinal_mappings[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dropped_na[col] = df_dropped_na[col].map(ordinal_mappings[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dropped_na[col] = df_dropped_na[col].map(ordinal_mappings[col])
A va

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,1,3,0,7,73,1,1,0,1,2,0,1,3,0,1,3,1,67
1,19,64,1,2,0,8,59,1,1,2,2,2,0,-1,4,0,2,2,0,61
2,24,98,2,2,1,7,91,2,1,2,2,2,0,0,4,0,3,3,1,74
3,29,89,1,2,1,8,98,2,1,1,2,2,0,-1,4,0,1,2,1,71
4,19,92,2,2,1,6,65,2,1,3,2,3,0,0,4,0,2,3,0,70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,69,3,2,0,7,76,2,1,1,3,2,0,1,2,0,1,3,0,68
6603,23,76,3,2,0,8,81,2,1,3,1,3,0,1,2,0,1,3,0,69
6604,20,90,2,1,1,6,65,1,1,3,1,2,0,-1,2,0,3,3,0,68
6605,10,86,3,3,1,6,91,3,1,2,1,2,1,1,3,0,1,1,0,68


In [25]:
df_encoded = df_dropped_na.copy()

In [None]:
# check for nullity
df_encoded.isnull().sum()

Hours_Studied                 0
Attendance                    0
Parental_Involvement          0
Access_to_Resources           0
Extracurricular_Activities    0
Sleep_Hours                   0
Previous_Scores               0
Motivation_Level              0
Internet_Access               0
Tutoring_Sessions             0
Family_Income                 0
Teacher_Quality               0
School_Type                   0
Peer_Influence                0
Physical_Activity             0
Learning_Disabilities         0
Parental_Education_Level      0
Distance_from_Home            0
Gender                        0
Exam_Score                    0
dtype: int64

In [28]:
#great no nulls
x = df_encoded.drop('Exam_Score', axis=1)
y = df_encoded['Exam_Score']

In [31]:
import statsmodels.api as sm

In [33]:
X = sm.add_constant(x)

In [34]:
model = sm.OLS(y, X).fit()

In [35]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:             Exam_Score   R-squared:                       0.721
Model:                            OLS   Adj. R-squared:                  0.721
Method:                 Least Squares   F-statistic:                     866.8
Date:                Sat, 04 Oct 2025   Prob (F-statistic):               0.00
Time:                        22:30:42   Log-Likelihood:                -13677.
No. Observations:                6378   AIC:                         2.739e+04
Df Residuals:                    6358   BIC:                         2.753e+04
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const               