Skip to content

This repository includes detailed data analyses and prediction models for students' on-time graduation using various machine learning algorithms.

Notifications You must be signed in to change notification settings

janasatvika/Exploratory-Data-Analysis-and-Modeling-for-on-Time-Graduation-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

On-Time Graduation Prediction: Data Analytics (EDA) and Machine Learning Approaches

Logo

👉 Introduction

Investigating the phenomenon of students graduating on time was not only an intriguing topic but also a crucial parameter that reflected the quality and success of a university and its programs. Therefore, a comprehensive and measurable strategy was necessary to overcome this challenge. In this context, an Exploratory Data Analysis (EDA) focused on the Informatics Engineering Education or Pendidikan Teknik Informatika (PTI) Program was conducted. The objective was to comprehensively analyze student graduation data, identifying patterns and trends, and examining the relationships between variables to understand the factors that influenced timely graduation. Additionally, the Decision Tree algorithm was implemented in a predictive modeling effort to produce accurate predictions regarding the likelihood of students graduating on time. This analysis aimed to provide valuable insights for decision makers in designing effective strategies that supported on-time student graduation and improved education quality standards.

👉 Method

The process of exploratory data analysis (EDA) involves several crucial steps, including data identification, univariate analysis, bivariate analysis, and multivariate analysis. Once the available data has undergone a series of in-depth analyses, the next step is to apply important pre-processing techniques such as feature encoding and data normalization. These techniques aim to ensure that the data used in constructing the prediction model is in an optimal state. These processes produce final data that is ready to be used in designing the prediction model. In this case, the Decision Tree algorithm serves as the basis for the prediction model. To ensure that the resulting model is optimal and accurate, the Hyperparameter tuning technique is applied using GridSearchCV. This technique enables the exploration of various combinations of Decision Tree parameters, maximizing the resulting classification results for reliability.

📌 Graduation dataset

The dataset used comprises graduation data for PTI program students from 2014-2017, totaling 455 rows of data.

Features Data type Description
🔑 Student ID object Unique student id
Gender object Student gender
SHS type object Previous Student's Educational Background
UKT int64 Tuition fees are required to be paid by students every semester. The fees are adjusted based on the economic ability of the parents.
Parents' income int64 Monthly income received from parents
IPS 1 float64 Grade Point Average in the 1st semester
IPS 2 float64 Grade Point Average in the 2nd semester
IPS 3 float64 Grade Point Average in the 3rd semester
IPS 4 float64 Grade Point Average in the 4th semester
Retake total int64 Grades are repeated in the first four semesters
Graduation status object Dependent attribute or data class, (values: on-Time and Late)

👉 Results & Discussion

💡 EDA insights

  • The high percentage of students unable to complete their studies on time suggests significant challenges within the study program that require immediate attention.
  • The majority of students in the PTI program from 2014-2017 graduated from SMA, followed by those from SMK, and the fewest from MA. This pattern indicates that most PTI students have a high school educational background, while a significant number come from SMK.
  • During the first four semesters, a student may repeat a class a maximum of two times.
  • The proportion of female students who complete their studies on time is higher than that of male students.
  • Students who can graduate on time have an average IPS above 3.30. Meanwhile, students who do not manage to graduate on time tend to have an average IPS below 3.00, with a tendency for the average IPS to decrease over time.
  • Students with a vocational background were found to have a higher number of students who completed their studies on time in the PTI program.
  • There is a correlation between repeating classes and a delay in graduating on time. This suggests that repeating classes may hinder timely completion of studies.
  • Many students who maintain an average IPS above 3.00 for four consecutive semesters can graduate on time. It should be noted, however, that maintaining an average IPS above 3.00 does not guarantee that students will be able to graduate on time.

💡 Modeling process insights

  • Best Hyperparameters of Decision Tree Model: (criterion= entropy, max_depth= 5, max_leaf_nodes= 9, min_samples_split= 6)
  • Confusin matrix table
    Predicted 0 Predicted 1
    Actual 0 80 4
    Actual 1 5 2
  • Performance score: (Accuracy score= 90.11%, Recall score= 28.57%, Precision score= 33.33%, dan f1-score= 30.77%)

👉 Conclusion

While the model's accuracy score is high, the precision, recall, and f1-scores remain low. The low recall score suggests that the model is better at predicting data with the Late class, but less effective in predicting the On-Time class. This is due to the unbalanced distribution between the Late and On-Time classes in the training data, which is a crucial problem that requires careful handling. To address this issue, a comprehensive approach is required. One effective method is the Imbalance dataset handling technique, which balances the training data to improve the model's ability to learn from both existing classes. Additionally, testing is conducted to evaluate the impact of this technique on the overall performance of the model. Thus, the model is expected to provide more accurate and reliable predictions for both Late and On-Time classes.

References:

  • -

Releases

No releases published

Packages

No packages published