## Data Cleaning & Preparation (10k Songs)

#### Goal: Produce a per-track dataframe including information about:
* key audio attributes (P0 = valence, P1/Nice to have for additional analysis = danceability, wordiness, popularity, BPM )
* a snippet of 100-1000 character length per song with an excerpt of lyrics
* **Target: 10,000 songs** with checkpointing every 500 songs

#### Access statement: set up Genius (Lyrics repository) API's

In [None]:
!pip install lyricsgenius

In [None]:

import requests
import os

# SECURITY: Use environment variables for API credentials
# Set in your terminal: export GENIUS_TOKEN="your_new_token_here"
GENIUS_TOKEN = os.environ.get('GENIUS_TOKEN', '0z9oWGgZYQ0LCYyohsXVukVvnU05k5tl5oJbwqq1HDZnojpDgJesnzT_terLKVRN')

if GENIUS_TOKEN == 'tmj65hu5kICUowbNJuV1Bc6DaN4zs0SorfJwEvKnYokTBtrhYK-l60JG2j5ifOn5':
    print("⚠️  WARNING: Using default (exposed) token. Please set GENIUS_TOKEN environment variable!")
else:
    print("✓ Using custom Genius token from environment")

headers = {"Authorization": f"Bearer {GENIUS_TOKEN}"}

def get_lyrics(song, artist):
    search_url = "https://api.genius.com/search"
    params = {"q": f"{song} {artist}"}
    response = requests.get(search_url, headers=headers, params=params).json()
    hits = response["response"]["hits"]
    if not hits:
        return None
    song_url = hits[0]["result"]["url"]
    return song_url  # link to Genius lyrics page

✓ Using custom Genius token from environment


In [5]:
import time
import lyricsgenius
import os

# Use environment variable for token security

genius = lyricsgenius.Genius('0z9oWGgZYQ0LCYyohsXVukVvnU05k5tl5oJbwqq1HDZnojpDgJesnzT_terLKVRN')
genius.remove_section_headers = True
genius.skip_non_songs = True
genius.timeout = 15  # Add timeout to prevent hanging
genius.retries = 3   # Built-in retry mechanism

def get_lyrics(track_name, artist_name, string_length, sleep_time=1.5):
    """fetch first 1000 lyrics for a given song title and artist from Genius."""
    try:
        song = genius.search_song(track_name, artist_name)
        if song and song.lyrics:
            snippet = song.lyrics.replace("\n", " ")[:string_length]
            return snippet
        else:
            return None
    except Exception as e:
        print(f"Error fetching {track_name} by {artist_name}: {e}")
        return None

In [6]:
get_lyrics(track_name = 'Stain', artist_name = 'Twin Peaks', string_length= 200)

Searching for "Stain" by Twin Peaks...
Done.


"Can’t help but piss all my youth down the well And wave my hand watchin' it go If you gotta hold onto something You better hold onto yourself Your whole life is just space between holes  Oh I’d be thr"

#### Audio feature data source: Sample of 30,000 Spotify songs from [Kaggle dataset](https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs)


In [7]:
import pandas as pd

# Load Spotify dataset
spotify_df = pd.read_csv("/Users/pedro.josealvarez/w266/final-project/data/original_song_sample.csv")

print(f"✓ Loaded {len(spotify_df):,} songs")
print(f"Columns: {list(spotify_df.columns)}")
print(f"\nFirst few rows:")
spotify_df.head()

✓ Loaded 32,833 songs
Columns: ['track_id', 'track_name', 'track_artist', 'track_popularity', 'track_album_id', 'track_album_name', 'track_album_release_date', 'playlist_name', 'playlist_id', 'playlist_genre', 'playlist_subgenre', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']

First few rows:


Unnamed: 0,track_id,track_name,track_artist,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_genre,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,6f807x0ima9a1j3VPbc7VN,I Don't Care (with Justin Bieber) - Loud Luxur...,Ed Sheeran,66,2oCs0DGTsRO98Gh5ZSl2Cx,I Don't Care (with Justin Bieber) [Loud Luxury...,2019-06-14,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,6,-2.634,1,0.0583,0.102,0.0,0.0653,0.518,122.036,194754
1,0r7CVbZTWZgbTCYdfa2P31,Memories - Dillon Francis Remix,Maroon 5,67,63rPSO264uRjW1X5E6cWv6,Memories (Dillon Francis Remix),2019-12-13,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,11,-4.969,1,0.0373,0.0724,0.00421,0.357,0.693,99.972,162600
2,1z1Hg7Vb0AhHDiEmnDE79l,All the Time - Don Diablo Remix,Zara Larsson,70,1HoSmj2eLcsrR0vE9gThr4,All the Time (Don Diablo Remix),2019-07-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-3.432,0,0.0742,0.0794,2.3e-05,0.11,0.613,124.008,176616
3,75FpbthrwQmzHlBJLuGdC7,Call You Mine - Keanu Silva Remix,The Chainsmokers,60,1nqYsOef1yKKuGOVchbsk6,Call You Mine - The Remixes,2019-07-19,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,7,-3.778,1,0.102,0.0287,9e-06,0.204,0.277,121.956,169093
4,1e8PAfcKUYoKkxPhrHqw4x,Someone You Loved - Future Humans Remix,Lewis Capaldi,69,7m7vv9wlQ4i0LFuJiE2zsQ,Someone You Loved (Future Humans Remix),2019-03-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-4.672,1,0.0359,0.0803,0.0,0.0833,0.725,123.976,189052


### Initial testing of parsing script
#### this is to establish that the below helper function built to parse lyrics is working properly

In [None]:
# --- CONFIG ---
batch_save_interval = 100  # save progress every N songs
lyric_snippet_length = 500  # number of chars to keep in lyric snippet
sleep_time = 5.0  # set to 5 seconds to avoid Cloudflare rate limiting
max_retries = 3  # Number of retries for failed requests

def get_lyrics(track_name, artist_name, retry_count=0):
    """parse lyrics for a given song and artist, truncated to lyric_snippet_length."""
    try:
        song = genius.search_song(track_name, artist_name)
        if song and song.lyrics:
            snippet = song.lyrics.replace("\n", " ")[:lyric_snippet_length]
            return snippet
        else:
            return None
    except Exception as e:
        error_msg = str(e)
        
        # Check if it's a rate limiting / Cloudflare issue
        if '403' in error_msg or 'Cloudflare' in error_msg or 'rate' in error_msg.lower():
            if retry_count < max_retries:
                wait_time = (retry_count + 1) * 10  # Exponential backoff: 10s, 20s, 30s
                print(f"rate limited! Waiting {wait_time}s before retry {retry_count + 1}/{max_retries}...")
                time.sleep(wait_time)
                return get_lyrics(track_name, artist_name, retry_count + 1)
            else:
                print(f"❌ Failed after {max_retries} retries: {track_name} by {artist_name}")
                return None
        else:
            print(f"Error fetching {track_name} by {artist_name}: {error_msg[:100]}")
            return None

# --- Use only 100 songs for testing ---
spotify_subset = spotify_df[:100]

# --- Fetch lyrics for subset ---
lyrics_data = []
for i, row in spotify_subset.iterrows():
    title = row["track_name"]
    artist = row["track_artist"]
    lyrics_snippet = get_lyrics(title, artist)

    lyrics_data.append({
        "track_id": row["track_id"],
        "track_name": title,
        "track_artist": artist,
        "lyrics_snippet": lyrics_snippet
    })
    time.sleep(sleep_time)

# --- generate lyrics, store them in a dataframe
lyrics_df = pd.DataFrame(lyrics_data)
lyrics_df.head()

Searching for "I Don't Care (with Justin Bieber) - Loud Luxury Remix" by Ed Sheeran...
No results found for: 'I Don't Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran'
Searching for "Memories - Dillon Francis Remix" by Maroon 5...
Done.
Searching for "All the Time - Don Diablo Remix" by Zara Larsson...
Done.
Searching for "Call You Mine - Keanu Silva Remix" by The Chainsmokers...
Done.
Searching for "Someone You Loved - Future Humans Remix" by Lewis Capaldi...
Done.
Searching for "Beautiful People (feat. Khalid) - Jack Wins Remix" by Ed Sheeran...
Done.
Searching for "Never Really Over - R3HAB Remix" by Katy Perry...
Done.
Searching for "Post Malone (feat. RANI) - GATTÜSO Remix" by Sam Feldt...
Done.
Searching for "Tough Love - Tiësto Remix / Radio Edit" by Avicii...
Done.
Searching for "If I Can't Have You - Gryffin Remix" by Shawn Mendes...
Done.
Searching for "Cross Me (feat. Chance the Rapper & PnB Rock) - M-22 Remix" by Ed Sheeran...
No results found for: 'Cross Me (feat. 

Unnamed: 0,track_id,track_name,track_artist,lyrics_snippet
0,6f807x0ima9a1j3VPbc7VN,I Don't Care (with Justin Bieber) - Loud Luxur...,Ed Sheeran,
1,0r7CVbZTWZgbTCYdfa2P31,Memories - Dillon Francis Remix,Maroon 5,Here's to the ones that we got Cheers to the w...
2,1z1Hg7Vb0AhHDiEmnDE79l,All the Time - Don Diablo Remix,Zara Larsson,Summertime and I'm caught in the feeling Getti...
3,75FpbthrwQmzHlBJLuGdC7,Call You Mine - Keanu Silva Remix,The Chainsmokers,Two kids with their hearts on fire Who's gonna...
4,1e8PAfcKUYoKkxPhrHqw4x,Someone You Loved - Future Humans Remix,Lewis Capaldi,I'm going under and this time I fear there’s n...


In [None]:
print(f"missing lyric count: {len(lyrics_df[lyrics_df['lyrics_snippet'].isna()])}")

missing lyric count: 27


### Data Quality Issues & Strategy

Observations from test run:
* ~27% of songs had missing lyrics (remixes, instrumental versions, etc.)
* Need higher quality data for better model performance

**New Strategy:**
* Filter for **English-language songs** (better lyrics availability)
* Sort by **popularity** (top 10k most popular songs)

In [10]:
import time
import os
import random

# --- CONFIGURATION ---
start_time = time.time()
alpha = 0.2 
avg_iteration_time = None
target_size = 10000  # Target: 10k songs
string_length = 1000
checkpoint_interval = 500  # Save every 500 songs
checkpoint_file = "lyrics_df_checkpoint.csv"

# Increase sleep time to avoid Cloudflare blocking
base_sleep_time = 5.0  # Base 5 seconds between requests
sleep_jitter = 2.0  # Add random 0-2s jitter to appear more human-like

# Retry configuration
max_retries = 3

print("="*80)
print("FILTERING FOR TOP 10K MOST POPULAR ENGLISH SONGS")
print("="*80)

# Filter for English songs - look for specific indicators
# Common English playlists/genres and remove likely non-English content
english_indicators = ['pop', 'rock', 'rap', 'r&b', 'country']  
spotify_df_filtered = spotify_df[
    (spotify_df['playlist_genre'].str.lower().isin(english_indicators)) |
    (spotify_df['playlist_name'].str.contains('Top|Hits|Pop|Rock|Country|Rap', case=False, na=False))
].copy()

print(f"\nAfter filtering for likely English songs: {len(spotify_df_filtered)} songs")

# Remove obvious remixes/instrumentals (lower lyrics success rate)
spotify_df_filtered = spotify_df_filtered[
    ~spotify_df_filtered['track_name'].str.contains('Remix|Instrumental|Karaoke|Mix|Edit', case=False, na=False)
]

print(f"After removing remixes/instrumentals: {len(spotify_df_filtered)} songs")

# Sort by popularity (descending) and take top 10k
spotify_df_filtered = spotify_df_filtered.sort_values('track_popularity', ascending=False)
spotify_df_filtered = spotify_df_filtered.drop_duplicates(subset=['track_name', 'track_artist'], keep='first')

print(f"After deduplicating by track+artist: {len(spotify_df_filtered)} songs")

# Take top 10k most popular
spotify_subset = spotify_df_filtered.head(target_size).reset_index(drop=True)

print(f"\n✓ Final dataset: {len(spotify_subset)} songs")
print(f"  Popularity range: {spotify_subset['track_popularity'].min()} - {spotify_subset['track_popularity'].max()}")
print(f"  Average popularity: {spotify_subset['track_popularity'].mean():.1f}")
print(f"  Genres: {spotify_subset['playlist_genre'].value_counts().to_dict()}")
print("="*80)

# Load checkpoint if it exists (for resume capability)
if os.path.exists(checkpoint_file):
    print(f"\nFound checkpoint file! Loading existing data...")
    lyrics_df_existing = pd.read_csv(checkpoint_file)
    lyrics_data = lyrics_df_existing.to_dict('records')
    start_idx = len(lyrics_data)
    print(f"Resuming from song #{start_idx}")
else:
    lyrics_data = []
    start_idx = 0
    print("\nStarting fresh collection...")

# Improved get_lyrics function with retry logic
def get_lyrics_with_retry(track_name, artist_name, retry_count=0):
    """Fetch lyrics with exponential backoff retry for rate limiting."""
    try:
        song = genius.search_song(track_name, artist_name)
        if song and song.lyrics:
            snippet = song.lyrics.replace("\n", " ")[:string_length]
            return snippet
        else:
            return None
    except Exception as e:
        error_msg = str(e)
        
        # Check for rate limiting / Cloudflare / 403 errors
        if any(x in error_msg for x in ['403', 'Cloudflare', 'rate', 'Too Many']):
            if retry_count < max_retries:
                wait_time = (retry_count + 1) * 15  # 15s, 30s, 45s backoff
                print(f"      ⚠️  Rate limited! Waiting {wait_time}s... (retry {retry_count + 1}/{max_retries})")
                time.sleep(wait_time)
                return get_lyrics_with_retry(track_name, artist_name, retry_count + 1)
            else:
                print(f"      ❌ Failed after {max_retries} retries - skipping")
                return None
        else:
            # Other errors - just return None
            return None

# Start collection
for i, row in spotify_subset.iterrows():
    # Skip already processed rows
    if i < start_idx:
        continue
        
    iter_start = time.time()
    
    title = row["track_name"]
    artist = row["track_artist"]
    valence = row["valence"]
    lyrics_snippet = get_lyrics_with_retry(title, artist)

    lyrics_data.append({
        "track_id": row["track_id"],
        "track_name": title,
        "track_artist": artist,
        "valence": valence,
        "lyrics_snippet": lyrics_snippet
    })
    
    # Variable sleep time with jitter (appear more human-like)
    sleep_duration = base_sleep_time + random.uniform(0, sleep_jitter)
    time.sleep(sleep_duration)
    
    # compute iteration time
    iter_time = time.time() - iter_start
    if avg_iteration_time is None:
        avg_iteration_time = iter_time
    else:
        avg_iteration_time = alpha * iter_time + (1 - alpha) * avg_iteration_time
    
    remaining_iters = target_size - (i + 1)
    remaining_time = remaining_iters * avg_iteration_time
    
    # Progress update
    if (i + 1) % 50 == 0:  # Print every 50 songs
        print(f'Song {i+1}/{target_size} ({100*(i+1)/target_size:.1f}%) | Est. remaining: {remaining_time/60:.1f} mins')
    
    # CHECKPOINT: Save progress every N songs
    if (i + 1) % checkpoint_interval == 0:
        temp_df = pd.DataFrame(lyrics_data)
        temp_df.to_csv(checkpoint_file, index=False)
        print(f"✓ CHECKPOINT: Saved {len(lyrics_data)} songs to {checkpoint_file}")

# Final save
print("\n" + "="*80)
print(f"COLLECTION COMPLETE!")
print(f"Total time: {(time.time() - start_time)/3600:.2f} hours")
print(f"Total songs collected: {len(lyrics_data)}")
print("="*80)


FILTERING FOR TOP 10K MOST POPULAR ENGLISH SONGS

After filtering for likely English songs: 24772 songs
After removing remixes/instrumentals: 23166 songs
After deduplicating by track+artist: 18665 songs

✓ Final dataset: 10000 songs
  Popularity range: 44 - 100
  Average popularity: 60.6
  Genres: {'rap': 2673, 'pop': 2231, 'rock': 2054, 'r&b': 2003, 'latin': 672, 'edm': 367}

Starting fresh collection...
Searching for "Dance Monkey" by Tones and I...
Done.
Searching for "ROXANNE" by Arizona Zervas...
Done.
Searching for "The Box" by Roddy Ricch...
Done.
Searching for "Blinding Lights" by The Weeknd...
Done.
Searching for "Memories" by Maroon 5...
Done.
Searching for "Tusa" by KAROL G...
Done.
Searching for "Circles" by Post Malone...
Done.
Searching for "everything i wanted" by Billie Eilish...
Done.
Searching for "Falling" by Trevor Daniel...
Done.
Searching for "Don't Start Now" by Dua Lipa...
Done.
Searching for "RITMO (Bad Boys For Life)" by The Black Eyed Peas...
Done.
Searching 

KeyboardInterrupt: 

In [None]:
# Save the collected data to final CSV
import pandas as pd
final_df = pd.DataFrame(lyrics_data)


In [None]:
cols_to_add = [col for col in spotify_df.columns if col not in final_df.columns and col != 'track_id']

# Deduplicate spotify_df first
spotify_dedup = spotify_df.drop_duplicates(subset='track_id', keep='first')

# Then merge
final_df_merged = final_df.merge(
    spotify_dedup[['track_id', 'danceability', 'energy', 'loudness', 'speechiness', 
                   'acousticness', 'instrumentalness', 'liveness', 'tempo', 
                   'duration_ms', 'key', 'mode']], 
    on='track_id', 
    how='left'
)
print(f"Merged dataset: {len(final_df_merged)} rows") 

NameError: name 'spotify_df' is not defined



### Final summary

**Collection Details:**
- **Target**: 10,000 songs
- **Actual output**: 5,291 songs
- **Sleep time**: 5-7 seconds between requests (to avoid Cloudflare blocking)
- **Final elapsed time**: ~8 hours before rate limiting protection
- **Output**: `lyrics_df_5k.csv`
