# University Student Dropout Prediction Project

### Introduction
In this project, we aim to develop a machine learning model to predict the likelihood of university students dropping out. The challenge of student dropouts is a critical issue in higher education, impacting both the students' future and the educational institutions' effectiveness. Through predictive modeling, we seek to understand the key factors influencing dropout rates and identify at-risk students early in their academic journey.

Our methodology is informed by the approach used by Niyogisubizo et al. in their two-layer ensemble machine learning model. However, our focus expands to the broader context of university-wide student retention, rather than individual classes.

### Project Objectives:

1. **Data Collection:** Acquire comprehensive and relevant datasets from universities, encompassing various factors like student demographics, academic records, engagement levels, and more.
2. **Data Preprocessing:** Clean and preprocess the data to ensure accuracy and reliability for our predictive analysis.
3. **Exploratory Data Analysis (EDA):** Perform in-depth analysis to uncover trends and insights within the data, guiding our feature selection and modeling approach.
4.  **Development:** Construct a predictive model utilizing a stacking ensemble method. This method will integrate different algorithms such as Random Forest, XGBoost, Gradient Boosting, and Feed-forward Neural Networks, leveraging their combined strengths.
5. **Model Evaluation and Tuning:** Utilize relevant performance metrics to evaluate and refine the model, aiming for enhanced predictive accuracy and robustness.
6. **Interpretation and Reporting:** Interpret the results to provide meaningful insights and recommendations, focusing on strategies to improve student retention rates at the university level.

## 1. Data Collection:

**Sources Include:**
University requested student drop out data

## 2. Data Preprocessing

In [4]:
# Importing essential libraries

import pandas as pd
import numpy as np

In [7]:
url = "https://raw.githubusercontent.com/kflemming30/Student-Drop-Out-Prediction/main/OIR_Student%20Data%20Request.csv"
student_df = pd.read_csv(url)
student_df.head()


Unnamed: 0,PIDM,Cohort,SEX,Degree,Major 1,1st Year GPA,Dorm,1st Year Retention,College,Total Earned Hours,SAT,Major 2,Advisor
0,1,202109F,M,BS,Mechanical Engineering,2.49,Campion Hall,1,SEC,36,,,1.0
1,2,202109F,M,BS,Biology,3.18,Commuter,1,CAS,47,,,2.0
2,3,202109F,M,BS,Chemistry,2.86,Regis Hall,1,CAS,46,,,3.0
3,4,202109F,M,BS,DSB Undeclared,3.84,Gonzaga Hall,1,DSB,45,1300.0,,4.0
4,5,202109F,M,BS,Management,2.69,Commuter,1,DSB,42,,,5.0


In [8]:
student_df.shape

(2584, 13)

In [9]:
student_df.describe()

Unnamed: 0,PIDM,1st Year GPA,1st Year Retention,Total Earned Hours,SAT,Advisor
count,2584.0,2576.0,2584.0,2584.0,632.0,2576.0
mean,1292.5,3.360839,0.903638,45.852167,1306.977848,56.611413
std,746.080871,0.549521,0.295144,10.109841,88.935666,47.163027
min,1.0,0.0,0.0,0.0,980.0,0.0
25%,646.75,3.11,1.0,45.0,1240.0,11.0
50%,1292.5,3.49,1.0,46.0,1310.0,53.0
75%,1938.25,3.75,1.0,51.0,1370.0,88.0
max,2584.0,4.0,1.0,81.0,1550.0,166.0


In [10]:
student_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2584 entries, 0 to 2583
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   PIDM                2584 non-null   int64  
 1   Cohort              2584 non-null   object 
 2   SEX                 2584 non-null   object 
 3   Degree              2584 non-null   object 
 4   Major 1             2584 non-null   object 
 5   1st Year GPA        2576 non-null   float64
 6   Dorm                2584 non-null   object 
 7   1st Year Retention  2584 non-null   int64  
 8   College             2584 non-null   object 
 9   Total Earned Hours  2584 non-null   int64  
 10  SAT                 632 non-null    float64
 11  Major 2             6 non-null      object 
 12  Advisor             2576 non-null   float64
dtypes: float64(3), int64(3), object(7)
memory usage: 262.6+ KB
