spotify_data_engineering

This repository contains an end-to-end data engineering project using AWS Cloud and PySpark, from Spotify data information. The project involves extracting Spotify data using APIs, transforming the data with PySpark, storing it in AWS S3, and performing data analysis. The project involves extracting, transforming, and loading Spotify data to build a data pipeline. Key components include:

Project Structure

Data Extraction: Retrieving Spotify data using upload files inserted into S3, This process can be done via APIs if the files stores somewhere else.
Data Transformation: Processing data with PySpark, which in AWS GLUE have VISUAL ETL the user not necessarily need to understand to code, but need to have understanding the pipeline architecture.
Data Storage: Storing transformed data in AWS S3, which you need to create bucket and designated folder to place transformed files.
Data Analysis: In AWS, you can load the data that just we just transform and make analysis via Amazon Athena (with SQL) and QuickSight for Data Visualization.

Technologies Used

AWS Cloud: S3, IAM
PySpark
Python

Getting Started

To get started with this project, follow the instructions below:

Clone this repository.
Set up AWS credentials and configure S3 buckets.
Install required Python packages.
Run the data extraction scripts.
Execute the data transformation pipeline using PySpark.
Analyze the transformed data using this README.

Data Used

I'm using data placed on kaggle, created by Tony Gordon Jr. Spotify Dataset 2023

albums
artists
tracks

Architecture Diagram

Completed Data Source for Reference

AWS Source	Description
S3 - Staging Folder	Raw dataset used for this project, collected from Kaggle.
Glue - ETL	Even though AWS provides a visual ETL pipeline architecture, these Python scripts might be handy for reference.
S3 - Data Warehouse	Transformed CSV files converted into Parquet format.
Athena - Result Output Saved in S3	Logs for querying the information.

Data Engineering Process

CREATE IAM USER: On the first step we will create a new user via root account, then login via new IAM user for security measures. Click here on how to setup IAM user. Then attaching necessary direct policies for this project, that's include S3, Glue, Athena & QuickSight access. And before setting up the account, do review the account before its creation, refer image below.

Once complete, you can signing via IAM account that you just created, which the system will request Account ID number, Username and password. For first time log in, the system will auto asking to change the password accordingly.

CREATE S3 BUCKET: Go to S3, and create a bucket. As for this project I've named my bucket as my-spotify-de-project you can named it whatever you want, do noted each bucket is unique and remembers its purpose.

Then, you can start to create two new folder staging and datawarehouse.

DOWNLOAD THE DATASET: Before we begin processing, we must acquired dataset. Usually the data we will fetch from DynamoDB or Database Instance. But for this project since we outsource the dataset, we will upload it manually.

Get the Dataset HERE

UPLOAD THE DATASET: For this project, I've upload it inside staging/ that just created earlier, and selected three files out of five files inside the dataset we retrieve earlier, I've upload Artist, Albums & Tracks.

ETL JOB WITH AWS GLUE: Now the tricky and charges apply. In this section, you should beware that this tools have charges. Below is the snippet on the AWS GLUE: ETL Jobs which I named it as DE-Spotify-ETL

Next, you based on the glue architecture you can follow on creating new jobs.

Source: Select the Amazon S3 bucket. in this case, we have three items; artist.csv, tracks.csv & albums.csv so we will insert three s3 buckets.

Transform: We gonna join the relevant items as per images.

Next, remove redundancy using drop fields

Destination Target: There are variety destination target you can choose, I chose S3 bucket that we just created earlier inside datawarehouse folder as my destination target. I use parquet as for the format and snappy ensuring the file is lightweight

Creating new IAM role: We need to create new role from root account, allowing us to run the ETL jobs. Go to root account then go to IAM then click roles at the left navigation pane.

This role will allowing s3fullaccess

Component for ETL Job Details: before you can save the ETL job, you need pre-configure the settings beforehand. Here is my settings for this project, the red line indicates the key elements you need to focus when creating job, yellow highlight are settings that suits for this small project.

RUN THE JOB!!: once everything have been configure, you can the ETL accordingly and visit the Job run monitoring to view the running status.

This process might take a while, you might need coffee break here. It tooks minutes to complete running the ETL jobs.

CRAWLING WITH AWS GLUE: This process will create catalog and database (db). Now, let's revisit the AWS GLUE and go to crawlers under Data Catalog and hit Create Crawler.

Set crawler properties: I've named it as ndl_

Choose data sources and classifiers: Choosing the S3 location the files we just transformed earlier ../datawarehouse

Configure security settings: I'm choosing I'm role created earlier

Set output and scheduling: We need to create new Target database, now, open new tab and go to the AWS Glue and choose database under Data Catalog, add database. I've named it as spotify.

Review and create: once crawler is finished setting up you can create the crawler

QUERYING WITH ATHENA: In this section we gonna query data through AWS Athena. At the Amazon Athena, you can go Query Editor, this tools is using SQL language. Then, in this query we need to adjust few settings. First you need to create new S3 bucket to insert the output querying location.

DATA VISUALIZATION WITH QUICKSIGHT: This tool will help you to create proper data visualization similar powerBI or Tableau. Now go to QUICKSIGHT. If this is your first time trying to access the QUICKSIGHT you need to login via your root account and create an account for QUICKSIGHT, Please be noted that tool have charges apply (quite expensive)

source:

Tutorial by Date with Data

References:

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
glue/ETL		glue/ETL
img		img
s3		s3
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spotify_data_engineering

Project Structure

Technologies Used

Getting Started

Data Used

Architecture Diagram

Completed Data Source for Reference

Data Engineering Process

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

spotify_data_engineering

Project Structure

Technologies Used

Getting Started

Data Used

Architecture Diagram

Completed Data Source for Reference

Data Engineering Process

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages