Data Engineering using PySpark and BigQuery (Google Cloud Platform)

This repository contains a PySpark job that reads data from a CSV file stored in a Google Cloud Storage bucket and writes it into a BigQuery table.

Additional files include shell scripts containing basic CloudShell commands, SQL queries that can be executed in BigQuery to check data quality and the CSV file.

Note:

I made this to get acquainted with Dataproc and BigQuery services.

Diagram:

File Descriptions

job.py: This is the main PySpark script that reads the CSV data, performs some transformations, and loads the data into a BigQuery table.

gcloud_commands.sh: This shell script contains general configuration commands for Google Cloud Shell, including authentication, project setup, and service listing.

dataproc_commands.sh: This shell script contains commands for creating, listing, and deleting Dataproc clusters, as well as submitting PySpark jobs to a cluster.

\*.sql: These SQL scripts are used for validating the data in the BigQuery table. They check for records with 'Return' invoices, null values in certain fields, and invalid quantities.

*.csv: Contains invoice data for ecommerce platform.

Prerequisites

Google Cloud SDK
PySpark
Google Cloud Storage bucket
Google BigQuery

Setup

Authenticate your Google Cloud account using the command in gcloud_commands.sh.
Set your default project, compute region, and zone.
Create a Dataproc cluster using the command in dataproc_commands.sh.
Optional: use pip install -r requirements.txt to install libraries in local environment.
Submit the PySpark job to the cluster using Console / CloudShell

Usage

Update the PROJECT_ID, DATASET_ID, TABLE_NAME, URI, and BUCKET_NAME variables in job.py with your specific Google Cloud project, dataset, table, and bucket details.
After the job completes, you can run the \*.sql scripts in BigQuery to validate the data.

Cleanup

Delete the Dataproc cluster using the command in dataproc_commands.sh.
Optionally, delete the Google Cloud project to avoid incurring charges.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
docs		docs
scripts		scripts
sql		sql
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering using PySpark and BigQuery (Google Cloud Platform)

Note:

Diagram:

File Descriptions

Prerequisites

Setup

Usage

Cleanup

About

Releases

Packages

Languages

mihir-robotics/pyspark-gcp-project

Folders and files

Latest commit

History

Repository files navigation

Data Engineering using PySpark and BigQuery (Google Cloud Platform)

Note:

Diagram:

File Descriptions

Prerequisites

Setup

Usage

Cleanup

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages