# **EduExcluders: When Data Plays the Role of the Gatekeeper in Schools**

---
This notebook contains experiments for a black mirror scenario where educational institutions screen students based on their background. The data is is first analysed. A classification model is created to carry out this scenrio. We also apply at least two Explainable AI methods which are then evaluated and compared. We then use these results to provide recommendations for the future

The prospective student data is provided from https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success

# Exploratory Data analysis

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
base_dir = 'data/'
data_file = 'data.csv'

In [3]:
# Read the data
data = pd.read_csv(f'{base_dir}{data_file}',sep=";")

## Getting Aquainted with the dataset

In [4]:
# Get the first 5 entries
data.head()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [5]:
# Information about the size of the dataset and its datatypes
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 37 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Marital status                                  4424 non-null   int64  
 1   Application mode                                4424 non-null   int64  
 2   Application order                               4424 non-null   int64  
 3   Course                                          4424 non-null   int64  
 4   Daytime/evening attendance	                     4424 non-null   int64  
 5   Previous qualification                          4424 non-null   int64  
 6   Previous qualification (grade)                  4424 non-null   float64
 7   Nacionality                                     4424 non-null   int64  
 8   Mother's qualification                          4424 non-null   int64  
 9   Father's qualification                   

In [6]:
# Printining the amount of unique values in each column
data.nunique()

Marital status                                      6
Application mode                                   18
Application order                                   8
Course                                             17
Daytime/evening attendance\t                        2
Previous qualification                             17
Previous qualification (grade)                    101
Nacionality                                        21
Mother's qualification                             29
Father's qualification                             34
Mother's occupation                                32
Father's occupation                                46
Admission grade                                   620
Displaced                                           2
Educational special needs                           2
Debtor                                              2
Tuition fees up to date                             2
Gender                                              2
Scholarship holder          

Now that we have explored each of these, we can say the following about the data we have,

Categorical features include: Marital Status, Application mode, Application order, Course, Daytime/evening attendance, Previous qualification, Nacionality, Mother's qualification, Father's qualification, Mother's occupation, Father's occupation, Displaced, Educational special needs, Debtor, Tuition fees up to date, Gender, Scholarship holder, International

Numeric: Previous qualification (grade), Admission grade, Age at enrollment, Curricular units 1st/2nd sem (credited),  Curricular units 1st/2nd sem (enrolled), Curricular units 1st/2nd sem (evaluations), Curricular units 1st/2nd sem (approved), Curricular units 1st/2nd sem (grade), Curricular units 1st/2nd sem (without evaluations), Unemployment rate, Inflation rate, GDP

Outcome: Target (categorical)

## Data Imbalance

In [7]:
imbalance_check = data['Target'].value_counts().reset_index()

In [8]:
data_balance_fig = sns.barplot(x = "index", y = "Target", data = imbalance_check, palette='colorblind')
plt.ylabel('Occurences')
plt.xlabel('Class')
plt.title("Class Occurences")
plt.show(data_balance_fig)

ValueError: Could not interpret value `index` for `x`. An entry with this name does not appear in `data`.

Looking at the above graph, we can clearly see that we are working with imbalanced data, with Graduated students containing the majority of the dataset.

Although, this imbalance may not necessarily be significant between graduate & dropout (**NOTE**: Provide proof of this).

Addressing the class imbalance may therefore be necessary to ensure model performance and generalization across all classes. This could be done through under and over-sampling techniques such as SMOTE.



## Univariate Analysis

Explore each feature, analyze its distribution and examine summary statistics.

### Correlation Matrix

In [None]:
correlation_matrix = data.corr()

In [None]:
sns.set(rc = {'figure.figsize':(32,16)})
sns.heatmap(correlation_matrix, annot = True, fmt='.2g',cmap= 'coolwarm')

## Distribution Plots

In [None]:
# Distribution plots for correlated features
# Selecting highly correlated features
high_corr_features = correlation_matrix[abs(correlation_matrix) > 0.6].stack().reset_index().rename(columns={0: 'correlation'})
high_corr_features = high_corr_features[high_corr_features['level_0'] != high_corr_features['level_1']]

# Get unique pairs of highly correlated features
high_corr_features['pairs'] = list(zip(high_corr_features.level_0, high_corr_features.level_1))
high_corr_features.drop(columns=['level_0', 'level_1'], inplace=True)

# Remove duplicates
high_corr_features.drop_duplicates(subset='pairs', inplace=True)

In [None]:
# Create pairplot for highly correlated features
for pair in high_corr_features['pairs']:
    sns.pairplot(data, x_vars=pair[0], y_vars=pair[1], kind='scatter', diag_kind='kde')
    plt.title(f'Pairplot for {pair[0]} and {pair[1]}')
    plt.show()

## Bivariate Analysis
Explore the relationship between each feature and the target variable, perform statistical analysis to assess the significance of releationships between categorical variables and the target variable.


## Multivariate Analysis

Examine interactiosn between multiple features, look for correlations between features to identify multicollinearity issues.

## Feature Engineering

If applicable, generate new features.

## Dimensionality Reduction
If applicable, reduce the amount of featurese through dimensionality reduction techniques.

# Models

ideas:
- Ensemble Learning Techniques
- kNN
-