# Project Report
## Big data for Music Recommendation System
### Thành viên:  
- Huỳnh Minh Thuận - 22110217  
- Trương Minh Thuật - 22110218  
- Nguyễn Phạm Anh Trí - 22110236  
- Nguyễn Minh Trí - 22110235  
- Nguyễn Đình Tiến - 22110230 

### Table of Contents
1. [Introduction](#1.-introduction)  
2. [Data Collection and Ingestion](#2.-data-collection-and-ingestion)  
    2.1 [Data Retrieval Functions and Execution Process](#2.1-data-retrieval-functions-and-execution-process)  
    2.2 [Daily Data Scraping and Storing Strategy](#2.2-daily-data-scraping-and-storing-strategy)
3. [Three-Layer Data Lake Processing](#3.-three-layer-data-lake-processing)

### 1. Introduction 
- In this day and age, music is an essential part of life, offering both entertainment and emotional connection. Our team aims to create an end-to-end date pipeline architecture that covers data collection, processing, storage, analysis, reporting, and building a recommendation system for music based on user input.

- **Data sources**: The data source is initially collected from https://kworb.net/itunes/extended.html, which includes top 15000 artist names that will change daily. After that, we use Spotify API to retrieve data about artist's information, albums, tracks and track features based on the list of artist names from Kworb website.

- **Tools**:
    - **Python**: Main programming language.
    - **Docker**: Run containers, ensuring consistent and scalable environments.
    - **MongoDB**: Used for data storage as Database
    - **HDFS**: A part of Hadoop architecture, used for data storage as Data Lake.
    - **Snowflake**: Cloud-Based Data Warehouse.
    - **PowerBI**: A tool for displaying data and providing comprehensive overview.
    - **Airflow**: A framework that uses Python to schedule and run tasks.

- **Architecture**:    
![My Image](./images/Architecture.png)
- **Link**: To explore the full source code, feel free to check out our GitHub repository:  
*https://github.com/mjngxwnj/Big-Data-for-Music-Recommendation-System*

### 2. Data Collection and Ingestion  
We start by collecting data from **Kworb.net**, which includes **15,000** artist name, then use the **Spotify API** to fetch more music-related details. This data is stored in **MongoDB** for further processing and analysis.

#### 2.1 Data Retrieval Functions and Execution Process
#####   2.1.1 Get data from **Kworb.net**  
- **Step 1**: We get the link from the **Kworb.net** and use pandas to get data from the site and use the read html function (**pandas.read_html(url)**) to read all tables from the returned web page. Function pandas.read_html will return a list of tables available on the web page.  

- **Step 2**: Because the information we need to get is from the first table, we proceed to get the first table with 2 columns: **Pos** corresponding to the artist's position on the rankings table and  **Artist** corresponding to the artist's name.  

- **Step 3**: We save the 2 columns we have retrieved to **MongoDB** to proceed with the next steps.  

##### 2.1.2 Get Information of **Artist**:  
- **Step 1**: We get the artist's name from the **MongoDB** database after filtering out artists with no information. We use the spotipy library to connect the api to spotify and get that artist's information. To connect the api with spotify we need 2 things: client id and client secret to connect  

- **Step 2**: We use the function **sp.search** to get the artist's information. This function returns a dictionary with the artist's information.  

- **Step 3**: We put the information into the corresponding columns and transfer that information back to MongoDB.  
    
##### 2.1.3 Get Information of **Artist's Album** and **Track**:  

- **Step 1**: After obtaining the artist's data, including artist id, we use artist id to get data about the album id and save that data into a list for continued use.  

- **Step 2**: To optimize the number of api calls (avoid overload), we divide the album id list into smaller lists to call the api for each sublist. By using the **spotipy.album** function we can get data for 20 albums and tracks contained in that album in one api call.  

- **Step 3**: We save all the data to the **MongoDB** database.  

##### 2.1.4 Get Information of **Track Feature**:  
- **Step 1**: Using the data from the previous step, we have obtained the track id to get more features of each track.  

- **Step 2**: Also to optimize the number of api calls, we divide the list of tracks into sublists to call the api from spotify. By using the **spotipy.audio_features** function we can get the audio features of 100 tracks in one api call.  

- **Step 3**: We save all the data to the **MongoDB** database.

#### 2.2 Daily Data Scraping and Storing Strategy
##### 2.2.1 Initial Data Scraping and Storing  
- After attempting the first day of data scraping, we realized that crawling 15,000 **artist names** in only one day was to much to call for **Spotify API** and led to the API being blocked.  

- To address this, we decided to divide 15,000 **artist names** over 3 days. Each day, we will call **Spotify API** for 5,000 **artist names** and get information about **artist**, **albums**, **tracks**, **track features** and save all the results to a **CSV file**. When all 15,000 artist names have been processed, we will upload the entire **CSV** to **MongoDB** for integration with the query data flow for subsequent days.  

##### 2.2.2 Subsequent Data Scraping and Storing  
- In the following days, if we continue to crawl 15,000 artist names then apply **Spotify API** to get album, track,.. there is a risk of data duplication for popular artists such as **Taylor Swift**, because the name **Taylor Swift** has been in **initial load**.  

- To avoid this, we will scrape 15,000 **daily artist names** and then perform a `Left Anti Join` with the 15,000 **old artist names** in **MongoDB** to identify **new artist names** . This way, only the new artists not already present in the database will be processed.  
<p style="text-align: center;">
    <img src="./images/leftanti_join_artistname.png" alt="Image">
</p>  

- This process ensures that only new artist names will be processed, and the number of Spotify API calls will be reduced (~3,000 new artist names daily).  

- Next, new artist names will be saved to **MongoDB**, then fetched from **MongoDB** to call the **Spotify API** for artist data, which will be stored back in **MongoDB**. The same process applies for getting albums and tracks data. An `execute_date` column will be added to track daily data execution.
<p style="text-align: center;">
    <img src="./images/daily_crawl_data.png" alt="Image">
</p>  

- You can see, we will query to select **artist**, **artist names** that crawled on a given day and then use it to call **Spotify API** to get data about **albums**, **tracks**, **respectively**.

### 3. Three-Layer Data Lake Processing