Skip to content

ETL pipeline that extracts and transforms student athlete academic performance data, then populates a data warehouse using a star schema dimensional model.

Notifications You must be signed in to change notification settings

oiannace/ETL-pipeline

Repository files navigation

ETL Data Pipeline

Project Context

The main dataset being used contains academic scores for student athletes on NCAA Division 1 teams. The granularity of the data is at the school, sport, and gender level. The goal for this project is to determine if the academic scores for sports teams are correlated with the physicality of the sport. In other words, is there a correlation between contact sports and poor academic performance.

The data was extracted from different sources (csv, web scraping), cleaned and transformed to uniformity, and then loaded into a PostgreSQL database according to the below star schema.

Dimensional Model

Dimension Tables: date_dim, location_dim, school_dim, sport_dim
Fact Table: academic_score_snapshot_fact

The dimensional model is implemented using a star schema.

alt_text

Using the code

Create and activate a virtual environment, then install the dependencies. All example code below is using Powershell.
Note: venv_name is the name of your virtual environment

PS C:\> python -m venv venv_name
PS C:\> venv_name\Scripts\Activate.ps1
PS C:\> pip install -r packages.txt

To create PostgreSQL database and dimension and fact tables according to the above star schema, run the create_star_schema.py file.

PS C:\> python create_star_schema.py

Finally, to execute the ETL (Extract, Transform, Load) pipeline and populate the data warehouse according to the above star schema, run the loader.py file.

PS C:\> python loader.py

Testing the code

To test the code using pytest, run the following command in PowerShell:

PS C:\> pytest -q tests.py

Note: "-q" is used to condense the output of the above command.

About

ETL pipeline that extracts and transforms student athlete academic performance data, then populates a data warehouse using a star schema dimensional model.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages