<div style="font-family: Arial, Helvetica, sans-serif;">
    <div style="display: flex;padding-top: 20px">
        <div><strong>Course:</strong> Introduction to Data Science</div>
    </div>
    <div style="display: flex;padding-top: 20px">
        <div style="padding-right: 10px;"><strong>Class:</strong> KHDL1</div>
        <div></div>
    </div>
    <div style="display: flex;padding-top: 20px">
        <div style="padding-right: 10px;"><strong>Group:</strong> 11 - HAHA</div>
    </div>
    <div>
        <div style="display: flex;padding-top: 20px">
            <div style="padding-right: 10px;"><strong>Members:</strong></div>
            <div></div>
        </div>
        <table style="font-size: 15px; display:flex;padding-top: 20px">
            <tr>
                <th>No.</th>
                <th>Student ID</th>
                <th>Name</th>
            </tr>
            <tr>
                <td>1</td>
                <td>22127008</td>
                <td style="text-align:left;">Đặng Châu Anh</td>
            </tr>
            <tr>
                <td>2</td>
                <td>22127014</td>
                <td style="text-align:left;">Nguyễn Kim Anh</td>
            </tr>
            <tr>
                <td>3</td>
                <td>22127147</td>
                <td style="text-align:left;">Đỗ Minh Huy</td>
            </tr>
            <tr>
                <td>4</td>
                <td>22127170</td>
                <td style="text-align:left;">Trần Dịu Huyền</td>
            </tr>
        </table>
    </div>
    <div style="font-size: 25px ;font-weight: 800; text-align: center;padding-top: 20px;">FINAL PROJECT</div>
    <div style="font-size: 20px ;font-weight: 800; text-align: center;padding-top: 20px;">SPOTIFY 2024 REWIND - DATA COLLECTION</div>
</div>

# **Data Description**
## **1. Objective**
Spotify, a leading music streaming platform, has gained significant popularity in Vietnam. Its charts offer real-time insights into the evolving music preferences of Vietnamese listeners. This project involves some actionable steps:
-    Gathering Spotify streaming data specific to Vietnam.
-    Analyzing trends, identifying popular songs, albums, and artists.
-    Presenting the findings in a way that provides insights into local music preferences.
## **2. Purpose**
The Vietnamese music industry has significant growth in 2024, with a surge in both quantity and quality of music products. This project uses Spotify data to better understand the changing tastes of Vietnamese listeners. By identifying popular songs, albums, and new trends, the project will provide useful insights to help industry professionals, artists, and fans stay updated on current music preferences.
## **3. Data Source**
- Data are scraped from two websites that provide weekly top 200 music charts in Viet Nam: https://charts.spotify.com/charts/overview/vn
- The data spans from January 1, 2024 – October 31, 2024

# **Import Library**

- Our data source of this project is a dynamic website, so we need to use `Selenium` to scrape the data.
- We also use `BeautifulSoup` to parse the HTML content.
- We use `pandas` to store the scraped data into DataFrame.
- We use `time` to handle the delay between requests to avoid overloading the maximum number of requests.
- We use `os` to handle the file system.

In [3]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time as t
import pandas as pd
from bs4 import BeautifulSoup
import os

# **Data Collection**

First, we need to set up the chromedriver to use Selenium.

In [None]:
def initialize_chrome_driver(user_name="ACER"):
   """
   Initialize Chrome WebDriver with specific configurations
   
   Args:
       user_name (str): Windows username for Chrome profile path
       
   Returns:
       webdriver.Chrome: Configured Chrome WebDriver instance
   """
   # Set Chrome profile path based on Windows username
   chrome_profile_path = f"C:/Users/{user_name}/AppData/Local/Google/Chrome/User Data"
   
   # Configure Chrome options
   chrome_options = Options()
   chrome_options.add_argument("--no-sandbox")
   chrome_options.add_argument("window-size=1920,1080")
   chrome_options.add_argument(f"user-data-dir={chrome_profile_path}")
   chrome_options.add_argument("profile-directory=Default")
   
   # Configure download preferences
   chrome_options.add_experimental_option("prefs", {
       "download.default_directory": os.getcwd() + '\Data',
       "download.prompt_for_download": False,
       "profile.default_content_setting_values.automatic_downloads": 1,
       "profile.default_content_setting_values.popups": 0,
   })
   
   # Initialize and return WebDriver
   return webdriver.Chrome(
       # Your path that contains chromedriver
       service=Service("C:/Program Files/chromedriver/chromedriver.exe"), 
       options=chrome_options
   )

Then, we want to get the data of the top 200 songs of the year 2024 on Spotify, so we decide to scrape the data at the last week of each month in 2024.

In [1]:
def calculate_week_details(month, last_day, months):
   """
   Calculate weekly details for Spotify charts URL
   
   Args:
       month (int): Current month
       last_day (int): Last day of previous week
       months (dict): Dictionary of days in each month
       
   Returns:
       tuple: (updated_month, updated_last_day, url)
   """
   # Calculate first day of last week
   begin_day = ((months[month] - last_day - 1)//7)*7 + last_day + 1
   
   # Calculate last day of current week
   last_day = 7 - (months[month] - begin_day) - 1
   save_last_day = last_day
   
   # Adjust last day within month boundaries
   last_day = last_day if last_day > 0 else months[month]
   
   # Check if moving to next month
   month = month + 1 if last_day >= 1 and last_day < 7 else month
   
   # Construct URL
   url = f"https://charts.spotify.com/charts/view/regional-vn-weekly/2024-{month:02d}-{last_day:02d}"
   
   return month, save_last_day, url

Finally, we will get data by using `Selenium` to automate the browser and scrape the data from the website. The csv file stored data will be saved in the `./SOURCECODE/data` folder.

In [None]:
def download_spotify_charts(end_month=10):
   """
   Download Spotify charts data for specified months
   
   Args:
       end_month (int): Last month to download (default: 10)
   """
   # Dictionary of days in each month (2024 leap year)
   months = {1:31, 2:29, 3:31, 4:30, 5:31, 6:30, 7:31, 8:31, 9:30, 10:31, 11:30, 12:31}
   
   # Initialize Chrome
   browser = initialize_chrome_driver()
   
   # Initialize tracking variables
   last_day, month, current_month = 1, 1, 1
   
   # Process charts from January to specified end month
   while month <= end_month:
       if month == 1:
           url = "https://charts.spotify.com/charts/view/regional-vn-weekly/2024-02-01"
           month += 1
       else:
           month, last_day, url = calculate_week_details(month, last_day, months)
       
       # Download chart data
       browser.get(url)
       download_button = WebDriverWait(browser, 10).until(
           EC.element_to_be_clickable((By.CSS_SELECTOR, 'button[data-encore-id="buttonTertiary"]'))
       )
       download_button.click()
       
       # Update month tracking
       if current_month < month:
           current_month += 1
       elif current_month == month:
           month += 1
           current_month += 1

download_spotify_charts(end_month=10)

All the csv data files has the same name format `regional-vn-weekly-2024-xx-xx.csv` with `xx-xx` is the date of the last week of each month in 2024.

All the data files will be stored in the `./SOURCECODE/data` folder.

All the data files will be merged into a single file named `spotify-2024-vn.csv` and add the column `week` to indicate the week of the data in Data Exploration step.

# **Verify data**

Read all downloaded files and store to DataFrame

In [9]:
data = pd.DataFrame()
for file in os.listdir('./SOURCECODE/data'):
    if file.endswith('.csv'):
        df = pd.read_csv(f'./SOURCECODE/data/{file}')
        data = pd.concat([data, df], ignore_index=True)  # Concatenate all csv files into one DataFrame

Take a look at the data information by using the `info()` function.

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   rank            1800 non-null   int64 
 1   uri             1800 non-null   object
 2   artist_names    1800 non-null   object
 3   track_name      1800 non-null   object
 4   source          1800 non-null   object
 5   peak_rank       1800 non-null   int64 
 6   previous_rank   1800 non-null   int64 
 7   weeks_on_chart  1800 non-null   int64 
 8   streams         1800 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 126.7+ KB


Verify the data by checking the project requirements: at least 5 columns, 1000 rows

In [13]:
def verify_dataset(data):
    """
    Verify dataset for project requirements: at least 1000 rows and 5 columns
   
    Args:
        data (pd.DataFrame): Input dataset
    """
    assert data.shape[0] >= 1000, "Dataset must have at least 1000 rows"
    assert data.shape[1] >= 5, "Dataset must have at least 5 columns"
    print("Dataset verified successfully")