# Part 01: Feature Backfill for **Spotify Sweden Daily Top 200**

## Goal
- Backfill daily chart observations (Top 200, region = Sweden)
- Enrich with Spotify Web API metadata (optional)
- Validate the dataset
- Push to **Hopsworks Feature Store** as a time-travel enabled Feature Group

✅ Output Feature Group example: `spotify_se_daily_top200_v1`

## Imports

In [2]:
# If you run this in a clean environment, you may need:
# !pip install pandas requests tqdm python-dotenv hopsworks spotipy great-expectations

import sys
import os
import re
import json
import time
import math
import lxml
import requests
from pathlib import Path
from dotenv import load_dotenv
import pandas as pd
from tqdm.auto import tqdm
from datetime import datetime, timedelta, date

root_dir = Path().absolute()
if root_dir.parts[-1:] == ('spotify',):
        root_dir = Path(*root_dir.parts[:-1])
if root_dir.name == "notebooks":
    root_dir = root_dir.parent

env_path = root_dir / ".env"
print("Loading .env from:", env_path)

load_dotenv(env_path)

pd.set_option("display.max_columns", 200)

Loading .env from: C:\Users\lppap\Documents\master\scalable_ML\ID2223-Final-Project-Spotify\.env


In [3]:
# TODO: PASS TO ENV

# Region and chart type
REGION = "se"          # Sweden
FREQUENCY = "daily"    # daily charts
CHART = "regional"     # Top 200 chart pages on spotifycharts.com

# Backfill window
START_DATE = date(2024, 1, 1)
END_DATE   = date(2024, 3, 31)   # inclusive in our loop

# Safety: top N rows (Top 200)
TOP_N = 200

# Optional: enrich with Spotify Web API
ENRICH_WITH_SPOTIFY_API = True

## Validate required secrets / environment variables

You can store secrets locally in a `.env` file or in Hopsworks Secrets.

**Needed if you enable enrichment**:
- `SPOTIFY_CLIENT_ID`
- `SPOTIFY_CLIENT_SECRET`

In [4]:
def require_env(var_name: str) -> str:
    val = os.getenv(var_name)
    if not val:
        raise ValueError(f"Missing environment variable: {var_name}")
    return val

if ENRICH_WITH_SPOTIFY_API:
    # These must exist for Spotipy client credentials flow
    _ = require_env("SPOTIFY_CLIENT_ID")
    _ = require_env("SPOTIFY_CLIENT_SECRET")

print("✅ Environment looks OK (or enrichment disabled).")


✅ Environment looks OK (or enrichment disabled).


## Download Spotify Charts Data

Spotify does not provide chart rankings (Daily Top 200 / Weekly Top 50) through the official Spotify Web API. While the API exposes track metadata and audio features, chart positions are not available programmatically.

For this reason, we use Kworb, a public and widely used mirror of Spotify charts. Kworb publishes daily and weekly Spotify rankings as HTML tables that can be reliably parsed and are updated shortly after the official Spotify Charts release.

The daily endpoint only obtains the daily data from that day, there is no way to obtain the historical.

Ingestion process
- Build the chart URL for the selected region (daily or weekly).
- Download the HTML page from Kworb.
- Parse the chart table using pandas.read_html.
- Normalize column names and map them to a consistent schema.
- Split artist and track names.
- Clean numeric fields and attach metadata (date, region).

Important Properties:

| Field         | Description                           |
| ------------- | ------------------------------------- |
| `rank`        | Position in the chart (1 = best)      |
| `track_name`  | Song title                            |
| `artist_name` | Primary artist                        |
| `streams`     | Number of streams in the chart period |
| `days`        | Days on chart (daily charts)          |
| `weeks`       | Weeks on chart (weekly charts)        |
| `peak`        | Best historical chart position        |
| `region`      | Country code (e.g. `se`)              |


In [15]:
def kworb_url(region: str, frequency: str) -> str:
    if region == "global":
        return f"https://kworb.net/spotify/{frequency}.html"
    return f"https://kworb.net/spotify/country/{region}_{frequency}.html"

def _normalize_columns(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]
    return df

def _split_artist_track(df: pd.DataFrame) -> pd.DataFrame:
    s = df["track_raw"].astype(str)

    parts = s.str.split(" – ", n=1, expand=True)
    if parts.shape[1] < 2:
        parts = s.str.split(" - ", n=1, expand=True)

    if parts.shape[1] == 2:
        df["artist_name"] = parts[0].str.strip()
        df["track_name"] = parts[1].str.strip()
    else:
        df["track_name"] = s

    return df

In [16]:
def fetch_kworb_daily_top200(region: str) -> pd.DataFrame:
    url = kworb_url(region, "daily")
    df = pd.read_html(url)[0]

    df = _normalize_columns(df)

    df = df.rename(columns={
        "pos": "rank",
        "artist_and_title": "track_raw",
        "days": "days",
        "pk": "peak",
        "streams": "streams",
        "total": "total_streams",
    })

    df = _split_artist_track(df)
    df["chart_date"] = pd.to_datetime(date.today())
    df["region"] = region

    for c in ["rank", "streams", "days", "peak"]:
        df[c] = pd.to_numeric(df[c], errors="coerce")

    df = (
        df.dropna(subset=["rank"])
          .sort_values("rank")
          .head(200)
          .reset_index(drop=True)
    )

    return df[
        ["rank", "track_name", "artist_name",
         "streams", "days", "peak",
         "chart_date", "region"]
    ]


In [17]:
def fetch_kworb_weekly_top50(region: str) -> pd.DataFrame:
    url = kworb_url(region, "weekly")
    df = pd.read_html(url)[0]

    df = _normalize_columns(df)

    df = df.rename(columns={
        "pos": "rank",
        "artist_and_title": "track_raw",
        "wks": "weeks",
        "pk": "peak",
        "streams": "streams",
        "total": "total_streams",
    })

    df = _split_artist_track(df)

    df["week_start"] = pd.to_datetime(date.today())
    df["region"] = region

    for c in ["rank", "streams", "weeks", "peak"]:
        df[c] = pd.to_numeric(df[c], errors="coerce")

    df = (
        df.dropna(subset=["rank"])
          .sort_values("rank")
          .head(50)
          .reset_index(drop=True)
    )

    return df[
        ["rank", "track_name", "artist_name",
         "streams", "weeks", "peak",
         "week_start", "region"]
    ]


In [19]:
# Latest daily and weekly data from Spotify
df_daily = fetch_kworb_daily_top200(REGION)
df_weekly = fetch_kworb_weekly_top50(REGION)

In [22]:
# Universe of tracks (unique per artist + track)
df_tracks_universe = (
    df_weekly[["track_name", "artist_name"]]
    .dropna()
    .drop_duplicates()
    .reset_index(drop=True)
)

print(f"Unique tracks to resolve: {len(df_tracks_universe)}")
print(len(df_weekly))
df_tracks_universe.head()


Unique tracks to resolve: 50
50


Unnamed: 0,track_name,artist_name
0,Last Christmas,Wham!
1,All I Want for Christmas Is You,Mariah Carey
2,Rockin' Around The Christmas Tree,Brenda Lee
3,Tänd ett ljus,Triad
4,Snowman,Sia


## Download Youtube Charts Data

In [25]:
from datetime import date, timedelta

def generate_week_starts(
    end_date: date,
    n_weeks: int,
):
    # YouTube Charts weeks start on Friday
    # We align to previous Friday
    d = end_date
    while d.weekday() != 4:  # Friday = 4
        d -= timedelta(days=1)

    return [d - timedelta(weeks=i) for i in range(n_weeks)]

In [30]:
import requests
import pandas as pd

def fetch_youtube_weekly_chart(
    country: str,
    week_start: date,
    top_n: int = 100,
) -> pd.DataFrame:

    url = (
        "https://charts.youtube.com/api/charts/TopSongs"
        f"?hl=en&gl={country.upper()}"
        f"&date={week_start.isoformat()}"
        f"&limit={top_n}"
    )

    r = requests.get(url, timeout=30)
    r.raise_for_status()
    data = r.json()

    print(data)

    rows = []
    for entry in data.get("charts", [])[0].get("entries", []):
        rows.append({
            "rank": entry["rank"],
            "track_title": entry["title"],
            "artist_name": entry["artists"][0]["name"] if entry["artists"] else None,
            "video_id": entry["videoId"],
            "weekly_views": entry["views"],
            "week_start": pd.to_datetime(week_start),
            "country": country.lower(),
        })

    return pd.DataFrame(rows)


In [31]:
from tqdm.auto import tqdm

def backfill_youtube_weekly_charts(
    country: str,
    n_weeks: int = 12,
) -> pd.DataFrame:

    weeks = generate_week_starts(date.today(), n_weeks)

    all_weeks = []
    for w in tqdm(weeks, desc="Fetching YouTube weekly charts"):
        try:
            df = fetch_youtube_weekly_chart(country, w)
            all_weeks.append(df)
        except Exception as e:
            print(f"Failed week {w}: {e}")

    return pd.concat(all_weeks, ignore_index=True)


In [32]:
df_youtube_weekly = backfill_youtube_weekly_charts(
    country="se",
    n_weeks=12,
)

df_youtube_weekly.head()


Fetching YouTube weekly charts:   0%|          | 0/12 [00:00<?, ?it/s]

Failed week 2025-12-26: Expecting value: line 1 column 1 (char 0)
Failed week 2025-12-19: Expecting value: line 1 column 1 (char 0)
Failed week 2025-12-12: Expecting value: line 1 column 1 (char 0)
Failed week 2025-12-05: Expecting value: line 1 column 1 (char 0)
Failed week 2025-11-28: Expecting value: line 1 column 1 (char 0)
Failed week 2025-11-21: Expecting value: line 1 column 1 (char 0)
Failed week 2025-11-14: Expecting value: line 1 column 1 (char 0)
Failed week 2025-11-07: Expecting value: line 1 column 1 (char 0)
Failed week 2025-10-31: Expecting value: line 1 column 1 (char 0)
Failed week 2025-10-24: Expecting value: line 1 column 1 (char 0)
Failed week 2025-10-17: Expecting value: line 1 column 1 (char 0)
Failed week 2025-10-10: Expecting value: line 1 column 1 (char 0)


ValueError: No objects to concatenate

In [33]:
import requests
from datetime import date

test_week = date(2024, 2, 9)  # any Friday

url = (
    f"https://charts.youtube.com/charts/TopSongs/se/weekly/"
    f"{test_week.isoformat()}?hl=en&format=csv"
)

print(url)

r = requests.get(url, timeout=30)
print("Status:", r.status_code)
print(r.text[:500])


https://charts.youtube.com/charts/TopSongs/se/weekly/2024-02-09?hl=en&format=csv
Status: 200
<!DOCTYPE html><html lang="en" dir="ltr"><head><script nonce="AhvJ7TIqVFVvXLvw8WeX9w">if ('undefined' == typeof Symbol || 'undefined' == typeof Symbol.iterator) {delete Array.prototype.entries;}</script><script nonce="AhvJ7TIqVFVvXLvw8WeX9w">var ytcsi={gt:function(n){n=(n||"")+"data_";return ytcsi[n]||(ytcsi[n]={tick:{},info:{},gel:{preLoggedGelInfos:[]}})},now:window.performance&&window.performance.timing&&window.performance.now&&window.performance.timing.navigationStart?function(){return windo


---

---