### Predict students' dropout and academic success

Investigating the Impact of Social and Economic Factors

By [source](https://zenodo.org/record/5777340#.Y7FJotJBwUE)

## About this dataset

This dataset provides a comprehensive view of students enrolled in various undergraduate degrees offered at a higher education institution. It includes demographic data, social-economic factors and academic performance information that can be used to analyze the possible predictors of student dropout and academic success. This dataset contains multiple disjoint databases consisting of relevant information available at the time of enrollment, such as application mode, marital status, course chosen and more. Additionally, this data can be used to estimate overall student performance at the end of each semester by assessing curricular units credited/enrolled/evaluated/approved as well as their respective grades. Finally, we have unemployment rate, inflation rate and GDP from the region which can help us further understand how economic factors play into student dropout rates or academic success outcomes. This powerful analysis tool will provide valuable insight into what motivates students to stay in school or abandon their studies for a wide range of disciplines such as agronomy, design, education nursing journalism management social service or technologies

### How to use the dataset
This dataset can be used to understand and predict student dropouts and academic outcomes. The data includes a variety of demographic, social-economic and academic performance factors related to the students enrolled in higher education institutions. The dataset provides valuable insights into the factors that affect student success and could be used to guide interventions and policies related to student retention.

Using this dataset, researchers can investigate two key questions:

- which specific predictive factors are linked with student dropout or completion?
- how do different features interact with each other? For example, researchers could explore if there any demographic characteristics (e.g., gender, age at enrollment etc.) or immersion conditions (e.g., unemployment rate in region) are associated with higher student success rates, as well as understand what implications poverty has for educational outcomes. By answering these questions, research insight is generated which can provide critical information for administrators on formulating strategies that promote successful degree completion among students from diverse backgrounds in their institutions.

In order to use this dataset effectively it is important that scientists familiarize themselves with all variables provided in the dataset including categorical (qualitative) variables such as gender or application mode; numerical variables such as number of curricular units at the beginning of semesters or age at enrollment; ordinal data measurement type variables such as marital status; studied trends over time such as inflation rate or GDP; frequency measurements variables like percentage of scholarship holders; etc.. Additionally scientists should make sure they aware off all potential bias included in the data prior running analysis–for example understanding if one population is underrepresented compared another -as this phenomenon could lead unexpected results if not taken into consideration while conducting research undertaken using this data set.. Finally it would be important for practitioners realize that this current Kaggle Dataset contains only one semester-worth information on each admission intake whereas additional studies conducted for a longer time period might be able provide more accurate results related selected topic area due further deterioration retention achievement coefficients obtained from those gradually accurate experiments unfolding different year-long admissions seasons

### Research Ideas
- Prediction of Student Retention: This dataset can be used to develop predictive models that can identify student risk factors for dropout and take early interventions to improve student retention rate.
- Improved Academic Performance: By using this data, higher education institutions could better understand their students' academic progress and identify areas of improvement from both an individual and institutional perspective. This will enable them to develop targeted courses, activities, or initiatives that enhance academic performance more effectively and efficiently.
- Accessibility Assistance: Using the demographic information included in the dataset, institutions could develop specific initiatives designed to help certain groups more easily access higher education services or resources that may not typically be available in their area or for their social-economic class, helping close existing gaps in accessibility across different student populations

### Acknowledgements
If you use this dataset in your research, please credit the original authors. [Data Source](https://zenodo.org/record/5777340#.Y7FJotJBwUE)

### License
License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/) No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](https://creativecommons.org/publicdomain/zero/1.0/).

### Columns
File: dataset.csv


| Column name | Description |
| --- | --- |
| Marital status | 	The marital status of the student. (Categorical) |
| Application mode | 	The method of application used by the student. (Categorical) |
| Application order | 	The order in which the student applied. (Numerical) |
| Course | 	The course taken by the student. (Categorical) |
| Daytime/evening attendance | 	Whether the student attends classes during the day or in the evening. (Categorical) |
| Previous qualification | 	The qualification obtained by the student before enrolling in higher education. (Categorical) |
| Nacionality | 	The nationality of the student. (Categorical) |
| Mother's qualification | 	The qualification of the student's mother. (Categorical) |
| Father's qualification | 	The qualification of the student's father. (Categorical) |
| Mother's occupation | 	The occupation of the student's mother. (Categorical) |
| Father's occupation | 	The occupation of the student's father. (Categorical) |
| Displaced | 	Whether the student is a displaced person. (Categorical) |
| Educational special needs | 	Whether the student has any special educational needs. (Categorical) |
| Debtor | 	Whether the student is a debtor. (Categorical) |
| Tuition fees up to date | 	Whether the student's tuition fees are up to date. (Categorical) |
| Gender | 	The gender of the student. (Categorical) |
| Scholarship holder | 	Whether the student is a scholarship holder. (Categorical) |
| Age at enrollment | 	The age of the student at the time of enrollment. (Numerical) |
| International | 	Whether the student is an international student. (Categorical) |
| Curricular units 1st sem (credited) | 	The number of curricular units credited by the student in the first semester. (Numerical) |
| Curricular units 1st sem (enrolled) | 	The number of curricular units enrolled by the student in the first semester. (Numerical) |
| Curricular units 1st sem (evaluations) | 	The number of curricular units evaluated by the student in the first semester. (Numerical) |
| Curricular units 1st sem (approved) | 	The number of curricular units approved by the student in the first semester. (Numerical) ||

### Acknowledgements
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit.

## Import Libraries

In [1]:
import re
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
pd.set_option('display.max_columns', None)

## Data Loading

In [3]:
data_path='../data/data.csv'

In [4]:
df = pd.read_csv(filepath_or_buffer=data_path,
                 sep=';')

In [5]:
df.head()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,Mother's occupation,Father's occupation,Admission grade,Displaced,Educational special needs,Debtor,Tuition fees up to date,Gender,Scholarship holder,Age at enrollment,International,Curricular units 1st sem (credited),Curricular units 1st sem (enrolled),Curricular units 1st sem (evaluations),Curricular units 1st sem (approved),Curricular units 1st sem (grade),Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,5,9,127.3,1,0,0,1,1,0,20,0,0,0,0,0,0.0,0,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,3,3,142.5,1,0,0,0,1,0,19,0,0,6,6,6,14.0,0,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,9,9,124.8,1,0,0,0,1,0,19,0,0,6,0,0,0.0,0,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,5,3,119.6,1,0,0,1,0,0,20,0,0,6,8,6,13.428571,0,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,9,9,141.5,0,0,0,1,0,0,45,0,0,6,9,5,12.333333,0,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [6]:
df.shape

(4424, 37)

In [7]:
columns = df.columns

In [8]:
re.sub(r'[^0-9a-zA-Z]+', ' ', 'Daytime/evening attendance\t')

'Daytime evening attendance '

In [9]:
renamed_columns = {}
for column in columns:
    new_column = re.sub(r'[^0-9a-zA-Z]+', ' ', column)
    new_column=new_column.strip()
    new_column = '_'.join(re.split('\W+', new_column))
    print(new_column)
    renamed_columns[column]=new_column

Marital_status
Application_mode
Application_order
Course
Daytime_evening_attendance
Previous_qualification
Previous_qualification_grade
Nacionality
Mother_s_qualification
Father_s_qualification
Mother_s_occupation
Father_s_occupation
Admission_grade
Displaced
Educational_special_needs
Debtor
Tuition_fees_up_to_date
Gender
Scholarship_holder
Age_at_enrollment
International
Curricular_units_1st_sem_credited
Curricular_units_1st_sem_enrolled
Curricular_units_1st_sem_evaluations
Curricular_units_1st_sem_approved
Curricular_units_1st_sem_grade
Curricular_units_1st_sem_without_evaluations
Curricular_units_2nd_sem_credited
Curricular_units_2nd_sem_enrolled
Curricular_units_2nd_sem_evaluations
Curricular_units_2nd_sem_approved
Curricular_units_2nd_sem_grade
Curricular_units_2nd_sem_without_evaluations
Unemployment_rate
Inflation_rate
GDP
Target


In [10]:
renamed_columns

{'Marital status': 'Marital_status',
 'Application mode': 'Application_mode',
 'Application order': 'Application_order',
 'Course': 'Course',
 'Daytime/evening attendance\t': 'Daytime_evening_attendance',
 'Previous qualification': 'Previous_qualification',
 'Previous qualification (grade)': 'Previous_qualification_grade',
 'Nacionality': 'Nacionality',
 "Mother's qualification": 'Mother_s_qualification',
 "Father's qualification": 'Father_s_qualification',
 "Mother's occupation": 'Mother_s_occupation',
 "Father's occupation": 'Father_s_occupation',
 'Admission grade': 'Admission_grade',
 'Displaced': 'Displaced',
 'Educational special needs': 'Educational_special_needs',
 'Debtor': 'Debtor',
 'Tuition fees up to date': 'Tuition_fees_up_to_date',
 'Gender': 'Gender',
 'Scholarship holder': 'Scholarship_holder',
 'Age at enrollment': 'Age_at_enrollment',
 'International': 'International',
 'Curricular units 1st sem (credited)': 'Curricular_units_1st_sem_credited',
 'Curricular units 1st

In [11]:
df.head()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,Mother's occupation,Father's occupation,Admission grade,Displaced,Educational special needs,Debtor,Tuition fees up to date,Gender,Scholarship holder,Age at enrollment,International,Curricular units 1st sem (credited),Curricular units 1st sem (enrolled),Curricular units 1st sem (evaluations),Curricular units 1st sem (approved),Curricular units 1st sem (grade),Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,5,9,127.3,1,0,0,1,1,0,20,0,0,0,0,0,0.0,0,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,3,3,142.5,1,0,0,0,1,0,19,0,0,6,6,6,14.0,0,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,9,9,124.8,1,0,0,0,1,0,19,0,0,6,0,0,0.0,0,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,5,3,119.6,1,0,0,1,0,0,20,0,0,6,8,6,13.428571,0,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,9,9,141.5,0,0,0,1,0,0,45,0,0,6,9,5,12.333333,0,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [12]:
df = df.rename(mapper=renamed_columns,
               axis=1)

In [13]:
df.head()

Unnamed: 0,Marital_status,Application_mode,Application_order,Course,Daytime_evening_attendance,Previous_qualification,Previous_qualification_grade,Nacionality,Mother_s_qualification,Father_s_qualification,Mother_s_occupation,Father_s_occupation,Admission_grade,Displaced,Educational_special_needs,Debtor,Tuition_fees_up_to_date,Gender,Scholarship_holder,Age_at_enrollment,International,Curricular_units_1st_sem_credited,Curricular_units_1st_sem_enrolled,Curricular_units_1st_sem_evaluations,Curricular_units_1st_sem_approved,Curricular_units_1st_sem_grade,Curricular_units_1st_sem_without_evaluations,Curricular_units_2nd_sem_credited,Curricular_units_2nd_sem_enrolled,Curricular_units_2nd_sem_evaluations,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Curricular_units_2nd_sem_without_evaluations,Unemployment_rate,Inflation_rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,5,9,127.3,1,0,0,1,1,0,20,0,0,0,0,0,0.0,0,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,3,3,142.5,1,0,0,0,1,0,19,0,0,6,6,6,14.0,0,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,9,9,124.8,1,0,0,0,1,0,19,0,0,6,0,0,0.0,0,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,5,3,119.6,1,0,0,1,0,0,20,0,0,6,8,6,13.428571,0,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,9,9,141.5,0,0,0,1,0,0,45,0,0,6,9,5,12.333333,0,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [14]:
df.shape

(4424, 37)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 37 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Marital_status                                4424 non-null   int64  
 1   Application_mode                              4424 non-null   int64  
 2   Application_order                             4424 non-null   int64  
 3   Course                                        4424 non-null   int64  
 4   Daytime_evening_attendance                    4424 non-null   int64  
 5   Previous_qualification                        4424 non-null   int64  
 6   Previous_qualification_grade                  4424 non-null   float64
 7   Nacionality                                   4424 non-null   int64  
 8   Mother_s_qualification                        4424 non-null   int64  
 9   Father_s_qualification                        4424 non-null   i

## Features

https://www.kaggle.com/code/yingxuansu/predict-students-dropout-by-ml

In [19]:
df.nunique()

Marital_status                                    6
Application_mode                                 18
Application_order                                 8
Course                                           17
Daytime_evening_attendance                        2
Previous_qualification                           17
Previous_qualification_grade                    101
Nacionality                                      21
Mother_s_qualification                           29
Father_s_qualification                           34
Mother_s_occupation                              32
Father_s_occupation                              46
Admission_grade                                 620
Displaced                                         2
Educational_special_needs                         2
Debtor                                            2
Tuition_fees_up_to_date                           2
Gender                                            2
Scholarship_holder                                2
Age_at_enrol

In [20]:
df.head()

Unnamed: 0,Marital_status,Application_mode,Application_order,Course,Daytime_evening_attendance,Previous_qualification,Previous_qualification_grade,Nacionality,Mother_s_qualification,Father_s_qualification,Mother_s_occupation,Father_s_occupation,Admission_grade,Displaced,Educational_special_needs,Debtor,Tuition_fees_up_to_date,Gender,Scholarship_holder,Age_at_enrollment,International,Curricular_units_1st_sem_credited,Curricular_units_1st_sem_enrolled,Curricular_units_1st_sem_evaluations,Curricular_units_1st_sem_approved,Curricular_units_1st_sem_grade,Curricular_units_1st_sem_without_evaluations,Curricular_units_2nd_sem_credited,Curricular_units_2nd_sem_enrolled,Curricular_units_2nd_sem_evaluations,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Curricular_units_2nd_sem_without_evaluations,Unemployment_rate,Inflation_rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,5,9,127.3,1,0,0,1,1,0,20,0,0,0,0,0,0.0,0,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,3,3,142.5,1,0,0,0,1,0,19,0,0,6,6,6,14.0,0,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,9,9,124.8,1,0,0,0,1,0,19,0,0,6,0,0,0.0,0,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,5,3,119.6,1,0,0,1,0,0,20,0,0,6,8,6,13.428571,0,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,9,9,141.5,0,0,0,1,0,0,45,0,0,6,9,5,12.333333,0,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [21]:
df.columns

Index(['Marital_status', 'Application_mode', 'Application_order', 'Course',
       'Daytime_evening_attendance', 'Previous_qualification',
       'Previous_qualification_grade', 'Nacionality', 'Mother_s_qualification',
       'Father_s_qualification', 'Mother_s_occupation', 'Father_s_occupation',
       'Admission_grade', 'Displaced', 'Educational_special_needs', 'Debtor',
       'Tuition_fees_up_to_date', 'Gender', 'Scholarship_holder',
       'Age_at_enrollment', 'International',
       'Curricular_units_1st_sem_credited',
       'Curricular_units_1st_sem_enrolled',
       'Curricular_units_1st_sem_evaluations',
       'Curricular_units_1st_sem_approved', 'Curricular_units_1st_sem_grade',
       'Curricular_units_1st_sem_without_evaluations',
       'Curricular_units_2nd_sem_credited',
       'Curricular_units_2nd_sem_enrolled',
       'Curricular_units_2nd_sem_evaluations',
       'Curricular_units_2nd_sem_approved', 'Curricular_units_2nd_sem_grade',
       'Curricular_units_2nd_sem

In [None]:
Categorical_features = ['Marital status', 'Application mode', 'Course', 'Daytime/evening attendance', 
            'Previous qualification', 'Nacionality', 'Mother\'s qualification', 'Father\'s qualification', 
            'Mother\'s occupation', 'Father\'s occupation', 'Displaced', 'Educational special needs', 
            'Debtor', 'Tuition fees up to date', 'Gender', 'Scholarship holder', 'International']

In [None]:
target_column = 'Target'
df[target_column].value_counts()
