METER-Weather-reports

Technologies

ApacheSpark: Batch-process and merge data from GCS buckets to BigQuery using DataProc.
BigQuery: Host the materialized tables with aggregated data.
Dataproc: GCP's Spark provider for batch processing from GCS bucket to BigQuery tables.
GCSBuckets: For storing datasets and Spark scripts.
Mage: For orchestrating ETL from API calls to GCS buckets.
PySpark: Library for implementing Spark scripts.
Terraform: For creating GCS buckets, BigQuery dataset and Dataproc clusters.

ApacheSpark: BigQuery Dataproc GCSBuckets Mage PySpark Terraform

About the project

An educational project to build an end-to-end pipline for near real-time and batch processing of data and visualisation.

The project is designed to enable the preparation of an analytical summary of the variability of METAR weather reports over the years for airports of European countries.

The dataset is about Meteorological Aerodrome Reports (METAR) which are observations of current surface weather reported in a standard international format.

Here is the Looker Studio demo of the analysis using few stations: CA_BC_ASOS

Setup overview:

Dataset

We will make API calls and save data to GCS buckets using parquet format with the following file path: <network_name>/<station_name>/<station_name>.parquet.

Here is a sample API call:

https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py?station=EPKK&data=all&year1=2023&month1=1&day1=1&year2=2023&month2=3&day2=26&tz=Etc%2FUTC&format=onlycomma&latlon=no&elev=no&missing=null&trace=T&direct=no&report_type=3&report_type=4

More details are in doc/Dataset.md.

Cloud

The project is developed in the cloud using scalable infrastructure provided by Google Cloud Platform. Infrastructure as Code (IaC) tools such as Terraform are utilized to provision and manage the cloud resources efficiently.

Data Ingestion

Data ingestion involves batch processing, where data is collected, processed, and uploaded to the data lake periodically and subsequently to the data warehouse. This ensures that the latest information on customers' meal choices, order values, and sales conversions is readily available for analysis.

Workflow Orchestration

An end-to-end pipeline is orchestrated using Mage to automate data workflows. This pipeline efficiently manages multiple steps in a Directed Acyclic Graph (DAG), ensuring seamless execution and reliability of the data processing tasks.

Data Lake & Data Warehouse

In this project, Google Cloud Storage is used as the data lake where the data is initially stored after ingestion from the source. Google BigQuery is used as the data warehouse and for storing and optimizing structured data for analysis.

Transformations

Data transformations are performed using Apache Spark via DataProc. The transformation logic is defined and executed via script using PySpark and executed in DataProc cluster.

Dashboard

Finally a dashboard is then created using Looker Studio to visualize key insights derived from the processed data. The dashboard comprises of tiles that provide some insights into the customer actions, habits, and engagement with the hotel.

Setting up the project

Clone this repo: git clone git@github.com:prantoran/METER-Weather-reports.git.

GCP

Set up a google cloud platform account.
Create a GCP project and set up service account and authentication as per these instructions.
Setup Terraform in local environment. Check out terraform installation instructions here.
See terraform/README.md for instructions where to copy the service account json credentials.
- Rename the json to service-acc-cred.json.
- Create a keys folder in root directory and copy the json to the folder.

Terraform

Checkout terraform/README.md.

Mage

Checkout mage/README.md.

The contents of the Mage folder were copied from github.com/mage-ai/maze-zoomcamp. In addition, Dynamic blocks are used to trigger multiple instances of children blocks based on the number of stations returned by the API calls. After running docker-compose up, the Mage UI will be accessible in localhost:6789.

Mage blocks/tasks setup:

Note: If there are existing docker instances (i.e. Postgres):

Remove the existing docker instances:

docker stop $(docker ps -aq)
docker rm $(docker ps -aq)

Stop Postgresql service (e.g. in Ubuntu):

sudo systemctl stop postgresql

Process from GCS Bucket to BigQuery using Dataproc/Spark

Note: According to the config in terraform/dev.auto.tfvars, adjust the bucket paths in scripts/upload_pyspark_script_sql_to_gcs.sh and script/submit_dataproc_job.sh.

Upload Pyspark script and sql commands:

./scripts/upload_pyspark_script_sql_to_gcs.sh

Submit jobs to Dataproc

./script/submit_dataproc_job.sh

The desired station (i.e. CA_BC_ASOS) can be changed in script/submit_dataproc_job.sh.

Analyze the generated BigQuery tables using Looker Studio

Here is a sample Looker dashboard: CA_BC_ASOS.

The plots are created using the generated BigQuery tables as data source by running PySpark script in a job in DataProc cluster.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
dataproc		dataproc
doc		doc
mage		mage
terraform		terraform
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

prantoran/METER-Weather-reports

Folders and files

Latest commit

History

Repository files navigation