GitHub - lorijta92/wdi-sports-etl: Apply ETL skills by extracting data from 3 different sources, transforming data with Python, and loading data into Postgres with SQLAlchemy.

Goal

Practice ETL skills by extracting data from multiple sources, transforming that data using Pandas and SQLAlchemy, and loading the data into Postgres.

Extract

Our team pulled data from 4 sources:

All data sources were available in CSV format from the source websites so we downloaded the data to our local directory and imported into a Jupyter Notebook using Pandas to perform the Transform and Load processes.

Transform

After the datasets were loaded into Jupyter Notebooks, we needed to perform standard data cleansing: dropping unnecessary columns, renaming columns, changing data types, and creating new columns. More specific steps for various tables are as follows:

Country Data

The goal of this table was to obtain yearly World Development Indicator (WDI) data per country/country code. In order to accomplish that with the raw data provided by the WDI, it was necessary to first utilize the pd.melt() function of Pandas to create a row for each year and then use the pd.pivot_table() function to get a column per each indicator.

Soccer Data

This data set for this table was already very clean, so not too much transformation was needed. However, we wanted to be able to filter this table by year and month, so the date column was split into two other columns to house the year and the month, which was converted from a numerical format to a text format.

Then, in order to make this table easier to compare against the summer and winter tables, a season column was added. Months April through September were categorized as summer months, and months October through March were categorized as winter months. Then the index was reset to make the table comparable to the table created in Postgres before the data was loaded in.

Olympic Data

We decided to keep four out of the 5 tables here. Data transformation included splitting athlete names into first and last name columns, dropping unnecessary columns, adding unique id columns for primary keys, and joining the regions table with a subset from the country table in order to create our regions table with noc values, country code, and country name. Most notably, we had to add extra noc columns to ensure we could use country name as a primary key in the regions table.

Load

To load all the data, we used the PostgreSQL driver, psycopg2 to connect to the Postgres server. Individual passwords were then imported in from an external config file to complete the connection. After the engine was created, we then loaded each data frame into its corresponding table in the database.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Resources		Resources
images		images
README.md		README.md
countries_data.ipynb		countries_data.ipynb
olympic_data.ipynb		olympic_data.ipynb
schema.sql		schema.sql
soccer_data.ipynb		soccer_data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goal

Extract

Transform

Country Data

Soccer Data

Olympic Data

Load

About

Releases

Packages

Languages

lorijta92/wdi-sports-etl

Folders and files

Latest commit

History

Repository files navigation

Goal

Extract

Transform

Country Data

Soccer Data

Olympic Data

Load

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages