This repository contains an end-to-end data engineering project using AWS Cloud and PySpark, from Spotify data information. The project involves extracting Spotify data using APIs, transforming the data with PySpark, storing it in AWS S3, and performing data analysis. The project involves extracting, transforming, and loading Spotify data to build a data pipeline. Key components include:
- Data Extraction: Retrieving Spotify data using upload files inserted into S3, This process can be done via APIs if the files stores somewhere else.
- Data Transformation: Processing data with PySpark, which in AWS GLUE have
VISUAL ETLthe user not necessarily need to understand to code, but need to have understanding the pipeline architecture. - Data Storage: Storing transformed data in AWS S3, which you need to create bucket and designated folder to place transformed files.
- Data Analysis: In AWS, you can load the data that just we just transform and make analysis via Amazon Athena (with SQL) and QuickSight for Data Visualization.
- AWS Cloud: S3, IAM
- PySpark
- Python
To get started with this project, follow the instructions below:
- Clone this repository.
- Set up AWS credentials and configure S3 buckets.
- Install required Python packages.
- Run the data extraction scripts.
- Execute the data transformation pipeline using PySpark.
- Analyze the transformed data using this README.
I'm using data placed on kaggle, created by Tony Gordon Jr. Spotify Dataset 2023
- albums
- artists
- tracks
| AWS Source | Description |
|---|---|
| S3 - Staging Folder | Raw dataset used for this project, collected from Kaggle. |
| Glue - ETL | Even though AWS provides a visual ETL pipeline architecture, these Python scripts might be handy for reference. |
| S3 - Data Warehouse | Transformed CSV files converted into Parquet format. |
| Athena - Result Output Saved in S3 | Logs for querying the information. |
CREATE IAM USER: On the first step we will create a new user via root account, then login via new IAM user for security measures. Click here on how to setup IAM user. Then attaching necessarydirect policiesfor this project, that's include S3, Glue, Athena & QuickSight access. And before setting up the account, do review the account before its creation, refer image below.
Once complete, you can signing via IAM account that you just created, which the system will request Account ID number, Username and password. For first time log in, the system will auto asking to change the password accordingly.
CREATE S3 BUCKET: Go to S3, and create a bucket. As for this project I've named my bucket asmy-spotify-de-projectyou can named it whatever you want, do noted each bucket is unique and remembers its purpose.
Then, you can start to create two new folder staging and datawarehouse.
DOWNLOAD THE DATASET: Before we begin processing, we must acquired dataset. Usually the data we will fetch fromDynamoDBorDatabase Instance. But for this project since we outsource the dataset, we will upload it manually.
Get the Dataset HERE
UPLOAD THE DATASET: For this project, I've upload it insidestaging/that just created earlier, and selected three files out of five files inside the dataset we retrieve earlier, I've upload Artist, Albums & Tracks.
ETL JOB WITH AWS GLUE: Now the tricky and charges apply. In this section, you should beware that this tools have charges. Below is the snippet on theAWS GLUE: ETL Jobswhich I named it as DE-Spotify-ETL
Next, you based on the glue architecture you can follow on creating new jobs.
- Source: Select the Amazon S3 bucket. in this case, we have three items; artist.csv, tracks.csv & albums.csv so we will insert three s3 buckets.
- Transform: We gonna join the relevant items as per images.
Next, remove redundancy using drop fields
- Destination Target: There are variety destination target you can choose, I chose S3 bucket that we just created earlier inside
datawarehousefolder as my destination target. I useparquetas for the format andsnappyensuring the file islightweight
- Creating new IAM role: We need to create new role from root account, allowing us to run the ETL jobs. Go to
root accountthen go toIAMthen click roles at the left navigation pane.
This role will allowing s3fullaccess
Component for ETL Job Details: before you can save the ETL job, you need pre-configure the settings beforehand. Here is my settings for this project, the red line indicates the key elements you need to focus when creating job, yellow highlight are settings that suits for this small project.
RUN THE JOB!!: once everything have been configure, you can the ETL accordingly and visit theJob run monitoringto view the running status.
This process might take a while, you might need coffee break here. It tooks minutes to complete running the ETL jobs.
CRAWLING WITH AWS GLUE: This process will create catalog and database (db). Now, let's revisit the AWS GLUE and go tocrawlersunderData Catalogand hitCreate Crawler.
Set crawler properties: I've named it asndl_
Choose data sources and classifiers: Choosing the S3 location the files we just transformed earlier ../datawarehouse
Configure security settings: I'm choosing I'm role created earlier
Set output and scheduling: We need to create new Target database, now, open new tab and go to the AWS Glue and choosedatabaseunder Data Catalog,add database. I've named it asspotify.
Review and create: once crawler is finished setting up you can create the crawler
QUERYING WITH ATHENA: In this section we gonna query data through AWS Athena. At the Amazon Athena, you can goQuery Editor, this tools is usingSQLlanguage. Then, in this query we need to adjust few settings. First you need to create new S3 bucket to insert the output querying location.
DATA VISUALIZATION WITH QUICKSIGHT: This tool will help you to create proper data visualization similarpowerBIorTableau. Now go to QUICKSIGHT. If this is your first time trying to access theQUICKSIGHTyou need to login via your root account and create an account forQUICKSIGHT, Please be noted that tool have charges apply (quite expensive)
source:
References:











































