# APEX STATS Dataset
Prepared by Michelle Baca Reinke

## Source Attribution

Author: Data Copyright &copy; Realinho et. al. (2021), All Rights Reserved

Title: Predict students' dropout and academic success

Source: [Predict students' dropout and academic success](https://zenodo.org/records/5777340#.Y7FJotJBwUE)

License: CC0: Public Domain

Changes: Data have been adapted for APEX STATS by Michelle Baca Reinke; data has been condensed into a subset of specific variables in the example version.


## Description of the Original Data


The original dataset provides a comprehensive view of students enrolled in various undergraduate degrees offered at a higher education institution in Portugal. It includes demographic data, social-economic factors and academic performance information that can be used to analyze the possible predictors of student dropout and academic success. The dataset contains multiple disjoint databases consisting of relevant information available at the time of enrollment, such as application mode, marital status, course chosen, and more. Data can be used to estimate overall student performance at the end of each semester by assessing curricular units credited/enrolled/evaluated/approved as well as their respective grades.

Unemployment rate, inflation rate, and GDP from the region can help further understand how economic factors play into student dropout rates or academic success outcomes and provide valuable insight into what motivates students to stay in school or abandon their studies for a wide range of disciplines such as agronomy, design, education nursing journalism management social service or technologies.

The full dataset can be downloaded [here](https://www.kaggle.com/datasets/thedevastator/higher-education-predictors-of-student-retention/data).

The original study can be found [here](https://www.mdpi.com/2306-5729/7/11/146) along with detailed descriptors of all variables.

## Description of the Example

Access this example using the file `example.csv`.

The example file is a subset of the original data.

<br>

The following variables are included:

y: Age at the time of enrollment (Discrete)   
x: Grade 1st semester (Continuous: See note below)     
x1: Grade 2nd semester (Continuous: See note below)   
x2: Gender of the student (Dichotomous: 1 male, 0 female)    
x3: Scholarship holder (Dichotomous: 1 yes, 0 no)  
x4: International student (Dichotomous: 1 yes, 0 no)  
x5: End of Semester Status (Categorical: Dropout/Enrolled/Graduate)

<br>

Notes: The grades reflect the Portuguese grading system, rated from 0 to 20. More details on the grading system along with the international equivalence can be found [here](https://www.upt.pt/en/home/internationals/portuguese-grading-system-2/#:~:text=Notes%3A%20A%20grade%20below%20to,are%20even%20more%20rarely%20used.).


## Discipline(s) Represented

- Education
- Sociology

## Dataset Preview

In [3]:
#@title Setup Example Data: Academic Success

# Import library
import pandas as pd

# Read data file: Academic Success
data = pd.read_csv('https://raw.githubusercontent.com/livid2nite/APEX/main/academic_success/example.csv')

# Preview data
data.head()

Unnamed: 0,y,x,x1,x2,x3,x4,x5
0,20,0.0,0.0,1,0,0,Dropout
1,19,14.0,13.666667,1,0,0,Graduate
2,19,0.0,0.0,1,0,0,Dropout
3,20,13.428571,12.4,0,0,0,Graduate
4,45,12.333333,13.0,0,0,0,Graduate


## Exploratory Analyses (untested)


This is a dataset with different types of variables, which opens up the possibility for a wide range of statistical analyses:

1. **Descriptive Statistics**:
   - Calculate summary statistics (mean, median, and standard deviation) for the age variable (y) to understand the age distribution at enrollment.
   - Examine the distribution of 1st and 2nd-semester grades (x and x1) using summary statistics and graphical representations (histograms, box plots).

2. **Gender Differences**:
   - Perform a t-test to compare the mean age at enrollment (y) between male and female students.
   - Explore if there are significant differences in 1st and 2nd-semester grades (x and x1) based on gender using t-tests.

3. **Scholarship Impact**:
   - Conduct a comparison of means or regression analysis to investigate how scholarship status (x3) affects 1st and 2nd-semester grades (x and x1).

4. **International Students and Citizens**:
   - Compare the distribution of age at enrollment (y) between international and students that are citizens using t-tests.
   - Analyze if international students (x4) have significantly different grade outcomes (x and x1) compared to students that are citizens.

5. **End of Semester Status**:
   - Create frequency tables and bar charts to visualize the distribution of end-of-semester status (x5) categories.


6. **Correlation Analysis**:
   - Calculate correlation coefficients (e.g., Pearson) to assess the relationships between age at enrollment (y) and 1st/2nd-semester grades (x and x1).

8. **Data Visualization**:
   - Create scatter plots to visualize the relationships between variables like age, 1st/2nd-semester grades, and end-of-semester status.


