# Introduction

As students, we were interested in finding what specific demographic and socioeconomic data play part in a students educational success. While demographic information such as race and socioeconomic status play an important role in educational outcomes, the primary focus of this investigation will be pinpointing and identifying the most significant actionable risk factors and things within the educational system that could be modified to potentially improve outcomes for students. We intend to do a comprehensive exploration of a dataset, collected from Portugal and provided by UCI, that provides details about a student’s academic success rates.

In this report we will use data cleaning and exploratory data analysis to provide the most significant factors in determining a students educational success. We will also examine different models and provide analysis based on how well they perform.



# Exploratory Data Analysis

![alt text](images/data_hist.png "Data Distribution")

![alt text](images/coef_matrix_before.png "Coefficient Matrix")

# Data Cleaning/Preparation

### Handling Missing Values
We started by checking the dataset for any null or missing values. We found no issues as the authors of the dataset had already done basic cleaning and preparation.

### Encoding Categorical Variables
We identified a number of columns in the dataset as categorical and it was necessary to convert these columns to a format that could be easily interpreted and processed by the algorithms used in our analysis. This involved one-hot encoding these variables to convert them into a binary matrix representation of the categories.

### Encoding Target Variable
We identified an issue with the presence of 'enrolled' students in our dataset. Since every student is enrolled at some point, this status does not provide clear insight into the educational outcomes we aim to predict. Our objective is to discern the factors that contribute to a student’s successful completion (graduate) or what might lead to their departure (dropout).

To address this issue and simplify our dataset, we focused only on students who had left school either via graduating or dropping out. We did this by discarding all rows associated with students still enrolled in their courses. We then transformed our target variable into a binary format, encoding 'graduate' as 1 and 'dropout' as 0 for the modeling process.

### Feature Selection
After encoding categorical variables most of the features were usable for in models. However we identified two specific columns: "previous qualification" and "previous qualification (grade)", which posed some concerns. The "previous qualification" column is categorical, indicating the type of qualification a student has obtained, while "previous qualification (grade)" is numerical, reflecting the grade or score achieved in that qualification. We decided to drop these columns because the grade is directly tied to the type of qualification, creating a dependency that we felt might be problematic to represent and interpret within the model.

![alt text](images/coef_matrix_after.png "Title")

# Model Selection

We experimented with a variety of models to achieve the highest accuracy in predicting student outcomes. These included Support Vector Machines, Logistic Regression, Decision Tree Classifier and Random Forest Classifier.

All models were evaluated using repeated 10-fold cross-validation over 30 iterations resulting in a total of 300 different training/testing splits. This large sample size allows us to calculate the mean accuracy score with a high level of confidence. By averaging these 300 scores, we obtain a mean that is less likely to be skewed by any single result. Averaging means from multiple samples helps to stabilize the final mean estimate to make sure that it is a reliable representation of the model's performance.

### Support Vector Machine
- Mean Accuracy: The average accuracy across all cross-validation runs was approximately 0.845.
- Standard Deviation: The standard deviation of the accuracy scores was approximately 0.017, suggesting a relatively low variability in the model’s performance across different subsets of the data.
- 95% Confidence Interval: The 95% confidence interval for the accuracy scores ranged from approximately 0.807 to 0.888. We can be 95% confident that the true mean accuracy for the model falls within that range.

![](images/support_hist.png)

### Decision Tree Classifier

- Mean Accuracy: The average accuracy across all cross-validation runs was approximately 0.855.
- Standard Deviation: The standard deviation of the accuracy scores was approximately 0.017, reflecting a relatively low variability in the model’s performance across different subsets of the data.
- 95% Confidence Interval: The 95% confidence interval for the accuracy scores ranged from approximately 0.819 to 0.884. We can be 95% confident that the true mean accuracy for the model falls within that range.

![](images/decision_hist.png)

### Random Forest Classifier

- Mean Accuracy: The average accuracy across all cross-validation runs for the Random Forest Classifier was approximately 0.907.
- Standard Deviation: The standard deviation of the accuracy scores was approximately 0.015, indicating a very low variability in the model’s performance.
- 95% Confidence Interval: The 95% confidence interval for the accuracy scores ranged from approximately 0.879 to 0.937. We can be 95% confident that the true mean accuracy for the model falls within that range.

![](images/forest_hist.png)

### Logistic Regression

- Mean Accuracy: The Logistic Regression model demonstrated excellent performance with an average accuracy of approximately 0.908 across all cross-validation runs.
- Standard Deviation: With a standard deviation of approximately 0.015, the model's performance scores exhibit minimal variability.
- 95% Confidence Interval: The 95% confidence interval for the accuracy scores extends from approximately 0.879 to 0.937. We can be 95% confident that the true mean accuracy for the model falls within that range.

![](images/logistic_hist.png)

# Model Analysis

# Conclusion and Recommendations