Overview

This is a simple ETL data pipeline example that demonstrates the use of the TaskFlow API using two tasks for extract and transform. The sole aim of this project is to demonstrate the use of Terraform to provision data infrastructures on Google cloud platform such as BigQuery and Cloud Storage.

About the Data

The data comes from this website, that covers alot of different cricket tournaments which is saved in a json format which is quite a heavy file. The data contains alot of information about every game including other metadata to help with the processing and normalization. This data requires alot of inspection and experiment before proceeding with the preprocessing and normalization.

Technologies

Python: Data Extraction
Pandas: Data Normalization and preprocessing
Airflow: Orchestration using Astronomer
Terraform: IAC tool for Provisioning infrastructure
Bigquery: Data Warehouse
Cloud Storage: This can serve as a place for storing Terraform state files and staging the normalized data for further preprocessing before moving to the provisioned Bigquery instance.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
IAC		IAC
dags		dags
include/scripts		include/scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cricket.png		cricket.png
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

About the Data

Technologies

About

Releases

Packages

Languages

judeleonard/Cricket-Data-infra

Folders and files

Latest commit

History

Repository files navigation

Overview

About the Data

Technologies

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages