## EDA 

1. Chi-Square Test of Indepedence
2. One-way ANOVA test for Variance Comparison

In [239]:
# import dependencies
import pandas as pd
import yaml

listing out all the column names

In [240]:
data_mat = pd.read_csv("artifacts/raw/middle-student-mat.csv",sep=";")
data_mat.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


Due to the large number of attributes, we narrowed our scope to focus on features that are uncontrollable by students (i.e. Students inherent background). These features include:

- School
- Sex
- Age
- Address
- Famsize
- Pstatus
- Medu
- Fedu
- Mjob
- Fjob
- Reason
- Nursery (attended nursery school)


In [241]:
independent_feat = ['school','sex','age','address','famsize','Pstatus','Medu','Fedu','Mjob', 'Fjob','reason','nursery']
independent_feat

['school',
 'sex',
 'age',
 'address',
 'famsize',
 'Pstatus',
 'Medu',
 'Fedu',
 'Mjob',
 'Fjob',
 'reason',
 'nursery']

In [242]:
dependent_feat = ['G3']
dependent_feat

['G3']

In [243]:
# unique values for age
data_mat['age'].unique()

array([18, 17, 15, 16, 19, 22, 20, 21])

## 1. Testing for correlation between Independent variables

Before performing correlation between out independent and dependent variables, we will need to test that the independent variables are truly independent of each other.

We will use Chi Square test for Independence to verify this for categorical variables. Most of the variables are already categorical data except for `Age`. However, there are only 8 unique values for `Age` hence we will treat `Age` as a categorical variable for the test of independence.

- Null Hypothesis: There is **no relationship** between the 2 variables
- Alternate Hypothesis: There is **a relationship** betwen the 2 variables

In [244]:
import scipy.stats as stats

In [245]:
def calculate_chi_square_results(features, df):
    """
    Calculates chi-square test results for all possible pairs of features in a given data matrix.
    
    Parameters:
        independent_feat (list): List of feature names to compare
        data_mat (pandas.DataFrame): Data matrix containing the features
    
    Returns:
        chi_sqr_res_lst (list): List of dictionaries containing chi-square test results for each pair of features
    """
    chi_sqr_res_lst = []  # Initialize an empty list to store the results
    
    # Loop over all possible pairs of features
    for i in range(len(features)):
        for j in range(i+1, len(features)):
            # Calculate crosstabulation for the two features
            crosstab = pd.crosstab(df[features[i]],df[features[j]])
            # Perform chi-square test on the crosstabulation
            res = stats.chi2_contingency(crosstab)
            # Store the results in a dictionary
            results_dict = {
                'variables': (features[i], features[j]),
                'pvalue': res.pvalue,
                'statistics': res.statistic,
                'dof': res.dof,
                'expected_freq': res.expected_freq,
            }
            # Append the dictionary to the results list
            chi_sqr_res_lst.append(results_dict)
    
    return chi_sqr_res_lst

def print_significant_results(chi_sqr_res_lst):
    """
    Prints the variables and p-values for all significant chi-square test results in a list of results.
    A result is considered significant if its p-value is less than 0.05.
    
    Parameters:
        chi_sqr_res_lst (list): List of dictionaries containing chi-square test results for each pair of features
    
    Returns:
        None
    """
    count = 0
    data = {}
    # Loop over all results in the list
    for results in chi_sqr_res_lst:
        # Get the p-value and variables for the current result
        pvalue = float(results.get('pvalue'))
        variables = results.get('variables')
        
        # Check if the result is significant
        if pvalue < 0.05:
            # Print the variables and p-value, rounded to 4 decimal places
            print(variables, pvalue)
            count+=1

    if count == 0:
        print("There are no variables that have relationship with each other")

In [246]:
chi_sqr_res_lst = calculate_chi_square_results(features=independent_feat, df=data_mat)
dict_chisqr = print_significant_results(chi_sqr_res_lst)

('school', 'age') 2.8253973276676154e-13
('school', 'address') 7.77068354957483e-08
('school', 'Medu') 0.00031360912544192357
('school', 'reason') 0.005947834009263454
('sex', 'Mjob') 0.0015564385627536573
('age', 'address') 0.0034680030378527907
('address', 'Medu') 0.019257159259757555
('address', 'Mjob') 0.01013724484686267
('address', 'reason') 0.021804456830089435
('famsize', 'Pstatus') 0.00524754575877353
('Pstatus', 'Medu') 0.010369983537476098
('Medu', 'Fedu') 8.014562451932377e-34
('Medu', 'Mjob') 7.752732260542515e-39
('Medu', 'Fjob') 4.8280932164911185e-06
('Medu', 'nursery') 0.00026084822887275803
('Fedu', 'Mjob') 4.206183361043177e-08
('Fedu', 'Fjob') 9.007362084326286e-16
('Fedu', 'nursery') 0.02068060288889864
('Mjob', 'Fjob') 2.533576541234461e-09
('Mjob', 'reason') 0.02818547173558085


Let's drop school, address, Mjob and Fjob and check for relationships using chi square again

In [247]:
new_categorical_feats = [feat for feat in independent_feat if feat not in ['school', 'address', 'Fedu', 'Medu','Fjob','Pstatus']]
new_categorical_feats

['sex', 'age', 'famsize', 'Mjob', 'reason', 'nursery']

In [248]:
new_chi_sqr_res_lst = calculate_chi_square_results(features=new_categorical_feats, df=data_mat)
print_significant_results(new_chi_sqr_res_lst)

('sex', 'Mjob') 0.0015564385627536573
('Mjob', 'reason') 0.02818547173558085


With Chi Square Test of Independence, we can conclude that without the following features - school, address, Mjob and Fjob , the remaining features *('sex', 'age', 'famsize', 'Mjob', 'reason', 'nursery')* are independent of each other

## 2. ANOVA Test between Independent and Dependent Variable

We will use a one-way ANOVA for comparing variance across the average grades (G3) of student's background.

- Null Hypothesis: There is no difference in means of grades (G3)
- Alternate Hypothesis: There is a difference in means of grades (G3)

In [249]:
from scipy.stats import f_oneway

def calculate_anova_results(feats, df):
    """
    Calculates ANOVA test results for each categorical feature in a given data matrix.
    
    Parameters:
        new_categorical_feats (list): List of categorical feature names
        data_mat (pandas.DataFrame): Data matrix containing the categorical features
    
    Returns:
        anova_res_lst (list): List of dictionaries containing ANOVA test results for each feature
    """
    anova_res_lst = []  # Initialize an empty list to store the results
    
    # Loop over all categorical features
    for feat in feats:
        # Extract the feature and the dependent variable (G3) from the data matrix
        df_feats = df[[feat,'G3']]
        # Get the unique categories for the feature
        categories = df_feats[feat].unique()
        # Create a list of dataframes, one for each category
        feats_lst = [df_feats.loc[df_feats[feat]== cat] for cat in categories]
        # Perform ANOVA test on the dataframes
        res = f_oneway(feats_lst[0]['G3'],feats_lst[1]['G3'])
        # Store the results in a dictionary
        anova_dict = {
            'variable': feat,
            'statistic': res.statistic,
            'pvalue': res.pvalue
        }
        # Append the dictionary to the results list
        anova_res_lst.append(anova_dict)
    
    return anova_res_lst

In [250]:
anova_res_lst = calculate_anova_results(independent_feat, data_mat)
anova_res_lst

[{'variable': 'school',
  'statistic': 0.7980416422082741,
  'pvalue': 0.3722262371311368},
 {'variable': 'sex',
  'statistic': 4.251814371189991,
  'pvalue': 0.039865332341527955},
 {'variable': 'age',
  'statistic': 1.1027568354139103,
  'pvalue': 0.2950855248030284},
 {'variable': 'address',
  'statistic': 4.445163854236396,
  'pvalue': 0.035632679756558636},
 {'variable': 'famsize',
  'statistic': 2.621832377243357,
  'pvalue': 0.10620482783859679},
 {'variable': 'Pstatus',
  'statistic': 1.3269268029203884,
  'pvalue': 0.25005293926392647},
 {'variable': 'Medu',
  'statistic': 20.965272048704946,
  'pvalue': 8.49017278323326e-06},
 {'variable': 'Fedu',
  'statistic': 10.087338049374276,
  'pvalue': 0.0017630928393755144},
 {'variable': 'Mjob',
  'statistic': 9.313348035987374,
  'pvalue': 0.002981454150613827},
 {'variable': 'Fjob',
  'statistic': 3.741017809408278,
  'pvalue': 0.05424841968457241},
 {'variable': 'reason',
  'statistic': 2.3843787145700515,
  'pvalue': 0.124319682

In [251]:
for result in anova_res_lst:
    variable = result.get('variable')
    pvalue = float(result.get('pvalue'))
    if pvalue < 0.05:
        print(variable)

sex
address
Medu
Fedu
Mjob


With one-way ANOVA, we can conclude that these features (sex, address, Medu, Fedu and Mjob) show differences in grades among the students, which might indicate that some background factors have an effect on students' grades. 