# Introduction

As students, we were interested in finding what specific demographic and socioeconomic data play part in a students educational success. While demographic information such as race and socioeconomic status play an important role in educational outcomes, the primary focus of this investigation will be pinpointing and identifying the most significant actionable risk factors and things within the educational system that could be modified to potentially improve outcomes for students. We intend to do a comprehensive exploration of a dataset, collected from Portugal and provided by UCI, that provides details about a student’s academic success rates.

In this report we will use data cleaning and exploratory data analysis to provide the most significant factors in determining a students educational success. We will also examine different models and provide analysis based on how well they perform.



# Exploratory Data Analysis

![alt text](images/data_hist.png "Data Distribution")

![alt text](images/coef_matrix_before.png "Coefficient Matrix")

# Data Cleaning/Preparation

### Handling Missing Values
We started by checking the dataset for any null or missing values. We found no issues as the authors of the dataset had already done basic cleaning and preparation.

### Encoding Categorical Variables
We identified a number of columns in the dataset as categorical and it was necessary to convert these columns to a format that could be easily interpreted and processed by the algorithms used in our analysis. This involved one-hot encoding these variables to convert them into a binary matrix representation of the categories.

### Encoding Target Variable
We identified an issue with the presence of 'enrolled' students in our dataset. Since every student is enrolled at some point, this status does not provide clear insight into the educational outcomes we aim to predict. Our objective is to discern the factors that contribute to a student’s successful completion (graduate) or what might lead to their departure (dropout).

To address this issue and simplify our dataset, we focused only on students who had left school either via graduating or dropping out. We did this by discarding all rows associated with students still enrolled in their courses. We then transformed our target variable into a binary format, encoding 'graduate' as 1 and 'dropout' as 0 for the modeling process.

### Feature Selection
After encoding categorical variables most of the features were usable for in models. However we identified two specific columns: "previous qualification" and "previous qualification (grade)", which posed some concerns. The "previous qualification" column is categorical, indicating the type of qualification a student has obtained, while "previous qualification (grade)" is numerical, reflecting the grade or score achieved in that qualification. We decided to drop these columns because the grade is directly tied to the type of qualification, creating a dependency that we felt might be problematic to represent and interpret within the model.

![alt text](images/coef_matrix_after.png "Title")

# Model Selection

We experimented with a variety of models to achieve the highest accuracy in predicting student outcomes. These included Support Vector Machines, Logistic Regression, Decision Tree Classifier and Random Forest Classifier.

All models were evaluated using repeated 10-fold cross-validation over 30 iterations resulting in a total of 300 different training/testing splits. This large sample size allows us to calculate the mean accuracy score with a high level of confidence. By averaging these 300 scores, we obtain a mean that is less likely to be skewed by any single result. Averaging means from multiple samples helps to stabilize the final mean estimate to make sure that it is a reliable representation of the model's performance.

### Support Vector Machine
- Mean Accuracy: The average accuracy across all cross-validation runs was approximately 0.845.
- Standard Deviation: The standard deviation of the accuracy scores was approximately 0.017, suggesting a relatively low variability in the model’s performance across different subsets of the data.
- 95% Confidence Interval: The 95% confidence interval for the accuracy scores ranged from approximately 0.807 to 0.888. We can be 95% confident that the true mean accuracy for the model falls within that range.

![](images/support_hist.png)

### Decision Tree Classifier

- Mean Accuracy: The average accuracy across all cross-validation runs was approximately 0.855.
- Standard Deviation: The standard deviation of the accuracy scores was approximately 0.017, reflecting a relatively low variability in the model’s performance across different subsets of the data.
- 95% Confidence Interval: The 95% confidence interval for the accuracy scores ranged from approximately 0.819 to 0.884. We can be 95% confident that the true mean accuracy for the model falls within that range.

![](images/decision_hist.png)

### Random Forest Classifier

- Mean Accuracy: The average accuracy across all cross-validation runs for the Random Forest Classifier was approximately 0.907.
- Standard Deviation: The standard deviation of the accuracy scores was approximately 0.015, indicating a very low variability in the model’s performance.
- 95% Confidence Interval: The 95% confidence interval for the accuracy scores ranged from approximately 0.879 to 0.937. We can be 95% confident that the true mean accuracy for the model falls within that range.

![](images/forest_hist.png)

### Logistic Regression

- Mean Accuracy: The Logistic Regression model demonstrated excellent performance with an average accuracy of approximately 0.908 across all cross-validation runs.
- Standard Deviation: With a standard deviation of approximately 0.015, the model's performance scores exhibit minimal variability.
- 95% Confidence Interval: The 95% confidence interval for the accuracy scores extends from approximately 0.879 to 0.937. We can be 95% confident that the true mean accuracy for the model falls within that range.

![](images/logistic_hist.png)

### Model Selection Conclusion
While all of the models did well, Logistic Regression outperformed the others in terms of accuracy, achieving the highest scores mean cross-validation accuracy (though the Random Forest Classifier was a close second). Logistic regression also demonstrated one of the highest levels of consistency and stability across different data subsets.

# Model Analysis

### Significant Positive Predictors

- Tuition fees up to date (coef = 2.6991, p-value < 0.001): Students who have their tuition fees paid and up to date are significantly more likely to graduate.

- Scholarship holder (coef = 0.9121, p-value < 0.001): Students who are scholarship holders are more likely to succeed.

- Curricular units 1st sem (approved) (coef = 0.5928, p-value < 0.001): The number of approved curricular units in the first semester is a positive predictor of success.

- Curricular units 2nd sem (approved) (coef = 0.9671, p-value < 0.001): The number of approved curricular units in the second semester is an even stronger positive predictor of student success than first semester.

- Curricular units 2nd sem (without evaluations) (coef = 0.3946, p-value = 0.020): A higher number of curricular units in the second semester without evaluations is associated with a positive outcome.

### Significant Negative Predictors
- Debtor (coef = -0.7099, p-value = 0.010): Being in debt is a significant negative predictor of educational success.

- Gender (coef = -0.4059, p-value = 0.013): The coefficient suggests a gender disparity, with men having a lower likelihood of educational success.

- Curricular units 2nd sem (enrolled) (coef = -0.8864, p-value < 0.001): A higher number of enrolled curricular units in the second semester is negatively associated with student success.

### Note Model Simplification for Analysis
We decided to remove one-hot encoded categorical variables from the analysis for the logistic regression model. Including these variables had a minimal impact on the accuracy of the model while significantly increasing the complexity due to the addition of a large number of columns. Focusing on just the key predictors made evaluating the model more straightforward.

# Conclusions and Recommendations

### Predictive Modeling for Targeted Intervention

The logistic regression model we've analyzed in this report has shown high accuracy in predicting educational outcomes for students. With a mean cross-validation accuracy of approximately 0.908 and a consistent performance across various subsets of the data, the model is to be a reliable tool for identifying at risk students. The model could be used to aid educational institutions in implementing targeted support to help improve students success.

### Financial Stability

The analysis of the model highlights several important predictors related to financial stability including tuition fee status, debt and whether or not the student holds a scholarship. Students without debt and students who have additional financial support dropout less frequently. This pattern suggests that success for students at this institution may be being disrupted by financial challenges. Moving forward the institution might focus on providing more opportunities for financial aid, scholarships or other resources to reduce financial stress on students.

### Curricular Units

The model suggests several variables related to curricular units that have high impact. A high number of approved units in the first and second semesters are strong predictors of student success. It could be the case that the number of approved units show a student is committed to their academic success. The greater importance of 2nd semester approved units might indicate the importance of maintaining academic momentum throughout the school year. Identifying students with fewer approved units and offering advising and support may be helpful in keeping these students on track.

### Gender

The model also reveals a disparity in outcomes based on gender with male students being less likely to succeed compared to females students. It may be the case that male students face unique challenges in modern day education and support for them is needed to help bridge this gap. Further investigation should be done to identify the underlying reasons for this imbalance to help ensure equal access to opportunities for students regardless of gender.

### Further Investigation

The relationship between curricular units, gender and financial stability may be intermingled. Many of the strongest predictors in the model were related to financial factors which may be driving lower approved units. It may also be the case that male students are more likely to be in debt than female students meaning it is possible that the gender gap is also ultimately driven by financial factors. Further analysis would be required to disentangle these relationships and to provide a more comprehensive view of how finances relates to various facets of academic success.