# Spotify Playlist Scraper - Project Summary
This notebook scrapes a Spotify playlist using Selenium, retrieves track details via the Spotify API, stores data in PostgreSQL, and visualizes playlist popularity with a Plotly gauge chart. Track popularity is analyzed and displayed, showing whether the playlist is obscure, niche, known, popular, or a hit.

## Step 0: Install Necessary Packages and Set Up PostgreSQL Server
Install essential packages and set up PostgreSQL to ensure smooth code execution for scraping, storing, and visualizing Spotify playlist data.

### 0.1 Set Up PostgreSQL Server in pgAdmin:

1. Launch pgAdmin, create a new server connection with default settings (localhost, port 5432, user: postgres).

2. Create a new database (e.g., spotify_db).

3. Create a table (e.g., playlist_data) with SQL.
```
CREATE TABLE playlist_data (
    index SERIAL PRIMARY KEY,
    song_title VARCHAR(255),
    artist VARCHAR(255),
    album VARCHAR(255)
);
```

## 0.2 Install Necessary Packages
Install required libraries for PostgreSQL interaction, Spotify API access, and Selenium automation:
- **psycopg2-binary**: For PostgreSQL database interaction
- **requests**: For HTTP requests to access the Spotify Web API
- **spotify**: For Spotify Web API integration
- **sqlalchemy**: For SQL toolkit (database interactions)
- **selenium**: For automating web browser interaction

Ensure these are installed before proceeding.

In [24]:
# Installing psycopg2-binary to interact with PostgreSQL from Python.
!pip install psycopg2-binary requests spotify  

# Installing SQLAlchemy, a toolkit to interact with SQL databases (including PostgreSQL) and ORM support.
!pip install sqlalchemy

# Installing Selenium to automate web browser interactions in Python (used for scraping).
!pip install selenium

# Upgrading the Jupyter Notebook format to the latest version for improved compatibility and features.
!pip install --upgrade nbformat



## 0.3: Establish Spotify API Credentials
Authenticates with the Spotify Web API using a Client ID and Client Secret to access detailed song metadata.

In [27]:
# Import Spotipy library for simplified Spotify Web API access
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials  # Handles token-based client authentication

# Spotify API credentials (you can store these securely in environment variables for production)
client_id = '663e7a68071e4da0977727fabda13479'
client_secret = 'b304161f09de45fd86b51550462130cb'

def authenticate_spotify(client_id, client_secret):
    """Authenticate with Spotify API using client credentials."""
    # Set up the authorization manager with the provided credentials
    auth_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
    
    # Create a Spotipy client instance with the authorization manager
    sp = spotipy.Spotify(auth_manager=auth_manager)
    
    return sp  # Return the authenticated Spotify client


## Step 1: Import Required Packages
Imports the necessary libraries for web scraping, database interaction, and data visualization tasks.

In [28]:
# Import standard libraries
import time  # Time delays for loading web content

# Data manipulation and storage
import pandas as pd  # Data handling
import psycopg2  # PostgreSQL connection
import requests  # Sending HTTP requests

# SQLAlchemy for database interaction
from sqlalchemy import create_engine, text, inspect  # DB engine and query tools

# Selenium for web scraping automation
from selenium import webdriver
from selenium.webdriver.common.by import By  # Element locator
from selenium.webdriver.support.ui import WebDriverWait  # Wait conditions
from selenium.webdriver.support import expected_conditions as EC  # For dynamic loading

# Plotly for data visualization
import plotly.graph_objects as go  # Gauge chart plotting

## Step 2: Database Configuration and Helper Functions
Configures the PostgreSQL connection and provides helper functions to interact with the database, such as storing and retrieving data.

In [None]:
# Database connection URL with credentials
DB_URL = "postgresql://postgres:admin@localhost:5432/spotify_db"  # Update with your credentials

# Helper function to establish DB connection
def get_db_engine(db_url):
    """Returns SQLAlchemy engine for the provided DB URL."""
    return create_engine(db_url) # Connect to the PostgreSQL database

# Store DataFrame to PostgreSQL
def store_data_in_db(df, db_url, table_name='playlist_data'):
    """Store DataFrame in PostgreSQL. Create the table if it doesn't exist."""
    print(f"Storing data in PostgreSQL table '{table_name}'...")  # Print starting message
    try:
        # Create a database engine 
        engine = create_engine(db_url) # Connect to the PostgreSQL database
        
        # Check if the table exists
        inspector = inspect(engine)  # Inspector to check database schema
        if not inspector.has_table(table_name):  # If the table doesn't exist
            print(f"Table '{table_name}' does not exist. Creating it...")  # Notify user

        # Store the DataFrame in the table
        df.to_sql(table_name, engine, if_exists='replace', index=False)  # Save DataFrame to the table
        print(f"Data successfully stored in PostgreSQL table '{table_name}'.")  # Confirmation message
    except Exception as e:
        print(f"Error while storing data in DB: {e}")  # Catch any errors during storage

# Retrieve song titles from PostgreSQL
def get_song_titles_from_postgres(db_url, table_name):
    """Retrieve song titles from PostgreSQL."""
    engine = create_engine(db_url)  # Create engine to connect to the DB
    query = f"SELECT DISTINCT \"Song_Title\" FROM {table_name};"  # SQL query to get distinct song titles
    df = pd.read_sql(query, engine)  # Execute the query and store the result in a DataFrame
    return df["Song_Title"].tolist()  # Return the song titles as a list


## Step 3: Define Scrolling and Loading Function
Implements a function to scroll through the Spotify playlist, triggering the lazy loading of additional tracks.

In [30]:
# Define the scroll_and_load function to handle scrolling behavior
def scroll_and_load(driver):
    """Scroll to the bottom of the playlist and trigger loading more songs."""
    try:
        # Scroll down to the last visible row to trigger lazy loading
        last_element = driver.find_elements(By.XPATH, "//div[@role='row']")[-1]  # Get the last row element
        driver.execute_script("arguments[0].scrollIntoView(true);", last_element)  # Scroll the page to it
        time.sleep(3)  # Wait for new songs to load
    except IndexError:
        print("No more rows found to scroll.")  # In case no more rows are found

## Step 4: Scraping Functions
Defines functions to scrape Spotify playlist data, including handling popups, extracting song details, and saving the data to a CSV file.

In [31]:
# Scraping functions
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("incognito")  # Open the browser in incognito mode to avoid cached data

def close_popup(driver):
    """Close any pop-up dialogs like cookie consent or login prompts."""
    try:
        close_button = driver.find_element(By.XPATH, "//button[text()='Close']")
        if close_button.is_displayed():
            close_button.click()  # Click the 'Close' button to dismiss the pop-up
            print("Popup dialog detected and closed.")
    except Exception:
        pass  # No pop-up to close, continue execution

def extract_song_data(driver, start_index):
    """Extract song data (title, artist, album) from the playlist."""
    songs = []  # List to hold extracted song data
    WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@role='row']")))
    rows = driver.find_elements(By.XPATH, "//div[@role='row']")  # Get all rows that may contain songs

    # Iterate through the rows and extract details for each song starting from the given index
    for idx, row in enumerate(rows[start_index:], start=start_index):
        try:
            # Extract song details using helper function `extract_text`
            song_title = extract_text(row, ".//div[@aria-colindex='2']//div[contains(@class, 'standalone-ellipsis-one-line')]")
            artist_name = extract_text(row, ".//div[@aria-colindex='2']//a[@draggable='true']")
            album = extract_text(row, ".//div[@aria-colindex='3']//a")

            # Append extracted song details to the list
            songs.append({
                "Index": idx,
                "Song_Title": song_title,
                "Artist": artist_name,
                "Album": album
            })
        except Exception as e:
            print(f"Error extracting row {idx}: {e}")
    return songs

def extract_text(element, xpath):
    """Safely extracts text content from an XPath element."""
    try:
        sub_element = element.find_element(By.XPATH, xpath)
        return sub_element.text.strip() if sub_element else "N/A"
    except Exception:
        return "N/A"  # Return "N/A" if text can't be extracted

def scrape_playlist(spotify_playlist_url, output_csv):
    """Scrapes song data from a Spotify playlist using Selenium."""
    spotify_playlist_url = spotify_playlist_url.split('?')[0]  # Remove query parameters from URL
    print(f"Stripped URL: {spotify_playlist_url}")
    print("Starting to scrape playlist...")

    # Initialize the browser driver and open the Spotify playlist
    driver = webdriver.Edge(options=chrome_options)
    driver.get(spotify_playlist_url)
    print("Page loaded. Waiting for content...")
    time.sleep(10)  # Allow page to load

    close_popup(driver)  # Handle any popups like cookies or login prompts

    all_songs, seen_songs = [], set()  # Lists to store song data and prevent duplicates
    global_index = 1  # Global index to track the position in the playlist

    # Loop through the playlist and extract songs until all are loaded
    while True:
        print(f"Extracting songs starting from index {global_index}...")  # Inform user about data extraction
        songs = extract_song_data(driver, global_index)  # Extract song data from the page

        # Add new songs to the list if they haven't been seen before
        for song in songs:
            key = (song["Song_Title"], song["Artist"], song["Album"])
            if key not in seen_songs:
                seen_songs.add(key)
                all_songs.append(song)

        global_index += len(songs)  # Update the global index
        print(f"Loaded {len(songs)} new songs...")

        scroll_and_load(driver)  # Scroll to load more songs

        # If no new songs were loaded, break the loop
        if len(driver.find_elements(By.XPATH, "//div[@role='row']")) == global_index:
            print("All songs loaded.")
            break

        print("Waiting for new songs to load...")  # Inform user during wait
        time.sleep(3)  # Pause to allow new songs to load

    driver.quit()  # Close the browser session after scraping
    print("Scraping complete.")
    return save_to_csv(all_songs, output_csv)  # Save the data to a CSV file

def save_to_csv(songs, output_csv):
    """Clean and save song data to CSV."""
    print("Saving data to CSV...")
    df = pd.DataFrame(songs)
    df = df[df["Song_Title"] != "N/A"].drop_duplicates(subset=["Song_Title", "Artist", "Album"]).sort_values(by="Index")
    df.to_csv(output_csv, index=False)
    print(f"Data saved to {output_csv}")
    return df


## Step 5: Authenticate and Fetch Spotify Song Metadata
This step authenticates with the Spotify API and retrieves detailed metadata for each song in the playlist. It consists of:

1. get_spotify_access_token: Authenticates using client credentials and returns an access token for API requests.

2. get_song_data_from_spotify: Uses the access token to fetch song details (title, artist, album, popularity, and Spotify URL) for each track.

This metadata enhances the playlist data with valuable song information.

In [32]:
def get_spotify_access_token(client_id, client_secret):
    url = "https://accounts.spotify.com/api/token"  # Spotify API token endpoint
    headers = {"Content-Type": "application/x-www-form-urlencoded"}
    data = {"grant_type": "client_credentials"}  # Request for client credentials grant
    response = requests.post(url, headers=headers, data=data, auth=(client_id, client_secret))
    
    if response.status_code == 200:
        # Return access token if authentication is successful
        return response.json()["access_token"]
    else:
        raise Exception(f"Failed to authenticate with Spotify API: {response.json()}")

def get_song_data_from_spotify(song_title, access_token):
    """Search for a song on Spotify and return its details."""
    url = "https://api.spotify.com/v1/search"  # Spotify API search endpoint
    headers = {"Authorization": f"Bearer {access_token}"}  # Set authorization header with access token
    params = {"q": song_title, "type": "track", "limit": 1}  # Search query for a track (song)
    response = requests.get(url, headers=headers, params=params)  # Send the GET request to the API

    if response.status_code == 200:
        results = response.json()
        if results["tracks"]["items"]:
            track = results["tracks"]["items"][0]
            return {
                "Song_Title": track["name"],
                "Artist": ", ".join([artist["name"] for artist in track["artists"]]),  # Get the artists' names
                "Album": track["album"]["name"],
                "Spotify URL": track["external_urls"]["spotify"],  # Get the URL to the track on Spotify
                "Popularity": track["popularity"]  # Include the track's popularity
            }
        else:
            return None  # Return None if no results are found
    else:
        raise Exception(f"Failed to search for song: {response.json()}")  # Raise error if request fails


## Step 6: Plotting a Popularity Gauge Chart
Analyzes the playlist's popularity score and visualizes it using a gauge chart, highlighting popularity ranges from obscure to hit.

In [None]:
def analyze_playlist(db_url, table_name):
    """Analyze the playlist data stored in the PostgreSQL table."""
    from sqlalchemy import create_engine
    import pandas as pd

    # Create a database engine
    engine = create_engine(db_url) # Connect to the PostgreSQL database

    # Query the data from the table
    query = f"SELECT * FROM {table_name};"
    df = pd.read_sql(query, engine) # Execute the query and store the result in a DataFrame

    # Perform analysis (e.g., calculate average popularity)
    if 'Popularity' in df.columns:
        avg_popularity = df['Popularity'].mean()
        print(f"Average popularity of songs in the playlist: {avg_popularity:.2f}")
    else:
        print("The 'Popularity' column is not found in the table.")
        
# Plotting a gauge chart for Popularity Score
def plot_popularity_gauge(df):
    # Calculate average, minimum, and maximum popularity values from the DataFrame
    avg_popularity = df['Popularity'].mean()  # Average popularity score for the playlist
    min_popularity = df['Popularity'].min()  # Minimum popularity score
    max_popularity = df['Popularity'].max()  # Maximum popularity score

    # Create the gauge chart    
    fig = go.Figure(go.Indicator(
        mode="gauge+number+delta",  # Display gauge, number, and delta (change from minimum)
        value=avg_popularity,  # Set the needle position to the average popularity score
        delta={'reference': min_popularity, 'increasing': {'color': "green"}},  # Show delta (change) from the minimum, color green if increasing
        gauge={
            'axis': {'range': [0, 100], 'tickwidth': 1, 'tickcolor': "white"},  # Define the range from 0 to 100 and customize ticks
            'bar': {'color': "rgba(0,0,0,0)"},  # Transparent bar for cleaner look
            'bgcolor': "white",  # Set the background color of the gauge
            'borderwidth': 2,  # Define border thickness
            'bordercolor': "gray",  # Set border color to gray
            'steps': [  # Define the color ranges for different popularity tiers
                {'range': [0, 25], 'color': "#B3B3B3", 'name': "Obscure"},  # Obscure range (0-25)
                {'range': [25, 50], 'color': "#808080", 'name': "Niche"},  # Niche range (25-50)
                {'range': [50, 70], 'color': "#535353", 'name': "Known"},  # Known range (50-70)
                {'range': [70, 85], 'color': "#1ED760", 'name': "Popular"},  # Popular range (70-85)
                {'range': [85, 100], 'color': "#1DB954", 'name': "Hit"},  # Hit range (85-100)
            ],
            'threshold': {  # Threshold line for visual indication of the current value
                'line': {'color': "red", 'width': 4},  # Red threshold line
                'thickness': 0.75,  # Thickness of the threshold line
                'value': avg_popularity  # Set the threshold line to the average popularity
            },
            'bar': {'color': 'cyan',  # Set needle color to cyan
                    'thickness' : 0.5      # Define needle thickness (between 0 and 1)
                   },
            'shape': "angular"  # Make the needle pointer more visible by using an angular shape
        },
        title={
            'text': "Average Popularity Score",  # Title displayed at the top of the gauge
            'font': {'size': 24}  # Font size for the title
        },
        number={'font': {'size': 28}, 'suffix': ""},  # Display the average score as a number with no suffix
        domain={'x': [0, 1], 'y': [0, 1]}  # Set the domain of the gauge chart (full width and height)
    ))

    # Add multi-line annotation (legend) below the chart for clarity
    fig.update_layout(
        paper_bgcolor="white",  # Set the paper background color to white
        font={'color': "black"},  # Set the font color for the chart to black
        annotations=[  # Add a text annotation as a legend below the chart
            dict(
                x=0.5,  # Position the annotation in the center horizontally
                y=-0.2,  # Lower the annotation to position it below the chart
                showarrow=False,  # Don't show an arrow for this annotation
                text="🎶 <b>Popularity Scale:</b><br>"
                     "0–25: Obscure | 25–50: Niche | 50–70: Known | 70–85: Popular | 85–100: Hit",  # Text description for each range
                font=dict(size=12, color="blue"),  # Font size and color for the annotation text
                xref="paper",  # Set x-axis reference to the paper coordinate system
                yref="paper",  # Set y-axis reference to the paper coordinate system
                align="center"  # Center the annotation text
            )
        ]
    )

    fig.show()  # Display the gauge chart

## Step 7: Executing Entire Automation with Main() Function
The main function orchestrates the entire process of the project. 
- It starts by scraping the playlist data from Spotify and saving it to a CSV file. 
- Then, it stores the scraped data in PostgreSQL. 
- Next, it retrieves song metadata (including popularity scores) using the Spotify Web API and saves this enriched data to another CSV. 
- Both sets of data are also stored in PostgreSQL tables. Finally, the data is analyzed, and the playlist's popularity is visualized with a gauge chart.

In [36]:
# Main execution logic
def main():
    # Step 1: Prompt for Spotify playlist URL and scrape the playlist (if needed)
    spotify_url = input("Please input Spotify playlist URL here: ")
    if spotify_url:
        # Scrape the playlist and save to CSV ("spotify_playlist.csv")
        playlist_df = scrape_playlist(spotify_url, "spotify_playlist.csv")
        print(f"Scraped {len(playlist_df)} songs from the Spotify playlist.")

        # Step 2: Store scraped playlist DataFrame to PostgreSQL (Table: 'playlist_data')
        store_data_in_db(playlist_df, DB_URL, 'playlist_data')
    else:
        print("No Spotify URL provided. Proceeding with existing data in PostgreSQL.")

    # Step 3: Retrieve song titles from PostgreSQL
    song_titles = get_song_titles_from_postgres(DB_URL, 'playlist_data')
    print(f"Retrieved {len(song_titles)} song titles from PostgreSQL.")

    # Step 4: Authenticate with Spotify API
    access_token = get_spotify_access_token(client_id, client_secret)

    # Step 5: Fetch metadata from Spotify API for each song
    song_data = []
    for song_title in song_titles:
        try:
            data = get_song_data_from_spotify(song_title, access_token)
            if data:
                song_data.append(data)
                print(f"Retrieved data for: {song_title}")
            else:
                print(f"No data found for: {song_title}")
        except Exception as e:
            print(f"Error retrieving data for {song_title}: {e}")

    # Step 6: Convert fetched metadata to DataFrame
    song_data_df = pd.DataFrame(song_data)

    # Step 7: Save metadata to second CSV file
    song_data_df.to_csv("playlist_metadata.csv", index=False)
    print("Saved enriched Spotify metadata to 'playlist_metadata.csv'.")

    # Step 8: Store metadata DataFrame to PostgreSQL (Table: 'spotify_song_data')
    store_data_in_db(song_data_df, DB_URL, 'spotify_song_data')

    # Step 9: Analyze the data and plot a popularity gauge chart
    analyze_playlist(DB_URL, 'spotify_song_data')
    plot_popularity_gauge(song_data_df)

if __name__ == "__main__":
    main()


Stripped URL: https://open.spotify.com/playlist/37i9dQZF1DXe3opFF4aPDr
Starting to scrape playlist...
Page loaded. Waiting for content...
Extracting songs starting from index 1...
Loaded 51 new songs...
All songs loaded.
Scraping complete.
Saving data to CSV...
Data saved to spotify_playlist.csv
Scraped 51 songs from the Spotify playlist.
Storing data in PostgreSQL table 'playlist_data'...
Data successfully stored in PostgreSQL table 'playlist_data'.
Retrieved 51 song titles from PostgreSQL.
Retrieved data for: 你要的全拿走
Retrieved data for: 無人知曉
Retrieved data for: 我還是愛著你
Retrieved data for: 輸情歌
Retrieved data for: 給我一個理由忘記
Retrieved data for: 一個人想著一個人
Retrieved data for: 讓我留在你身邊
Retrieved data for: 關於愛的定義
Retrieved data for: 小半
Retrieved data for: 那些年
Retrieved data for: 我多喜歡你,你會知道(網劇<致我們單純的小美好>推廣曲)
Retrieved data for: 最後一次
Retrieved data for: Letting Go
Retrieved data for: 偉大的渺小
Retrieved data for: 匆匆那年
Retrieved data for: 沒那麽簡單
Retrieved data for: 連名帶姓
Retrieved data for: Why You Gonna