The main dataset being used contains academic scores for student athletes on NCAA Division 1 teams. The granularity of the data is at the school, sport, and gender level. The goal for this project is to determine if the academic scores for sports teams are correlated with the physicality of the sport. In other words, is there a correlation between contact sports and poor academic performance.
The data was extracted from different sources (csv, web scraping), cleaned and transformed to uniformity, and then loaded into a PostgreSQL database according to the below star schema.
Dimension Tables: date_dim
, location_dim
, school_dim
, sport_dim
Fact Table: academic_score_snapshot_fact
The dimensional model is implemented using a star schema.
Create and activate a virtual environment, then install the dependencies. All example code below is using Powershell.
Note: venv_name is the name of your virtual environment
PS C:\> python -m venv venv_name
PS C:\> venv_name\Scripts\Activate.ps1
PS C:\> pip install -r packages.txt
To create PostgreSQL database and dimension and fact tables according to the above star schema, run the create_star_schema.py file.
PS C:\> python create_star_schema.py
Finally, to execute the ETL (Extract, Transform, Load) pipeline and populate the data warehouse according to the above star schema, run the loader.py file.
PS C:\> python loader.py
To test the code using pytest, run the following command in PowerShell:
PS C:\> pytest -q tests.py
Note: "-q" is used to condense the output of the above command.