## Task 4.2: Title vs. Lyrics Relationship

**Cosine over TF-IDF vectors**<br>
I created a function where, when a song is passed, the function transforms its title into TF-IDF space.<br>It also transforms all lyrics in the dataset into TF-IDF vectors, then calculates cosine similarity between the title and each lyrics vector.<br> The output prints the top songs with lyrics most similar to the title, as well as the 5 least similar. A CSV file is also saved for further analysis.


In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import os

# --- 1. Connect to SQL Server ---
with open("db_config.txt", "r") as f:
    db_target = f.read().strip()
engine = create_engine(f"mssql+pyodbc://{db_target}?driver=ODBC+Driver+17+for+SQL+Server")

# --- 2. Pull song SimilarityData including genre ---
df = pd.read_sql("""
    SELECT song_id, name, lyrics, genre
    FROM songs
    WHERE lyrics IS NOT NULL
""", engine)

# --- 3. TF-IDF Vectorizer for lyrics ---
vectorizer_lyrics = TfidfVectorizer(
    max_features=10000,
    ngram_range=(1, 2),
    lowercase=True,
    strip_accents="unicode"
)
X_lyrics = vectorizer_lyrics.fit_transform(df['lyrics'])

# --- 4. Function to analyze title vs lyrics ---
def title_lyrics_cosine(song_id, top_n=5):
    """
    Computes cosine similarity between a song's title and all lyrics,
    prints top N most similar and N least similar lyrics with genre,
    and optionally saves results to a CSV.
    """
    try:
        idx = df.index[df['song_id'] == song_id][0]
    except IndexError:
        print(f"Song ID {song_id} not found.")
        return

    # Transform the title using the lyrics vectorizer
    title_text = df.iloc[idx]['name']
    title_vec = vectorizer_lyrics.transform([title_text])

    # Cosine similarity between the title and all lyrics
    sims = cosine_similarity(title_vec, X_lyrics).flatten()

    # Exclude the song itself
    sims_masked = np.copy(sims)
    sims_masked[idx] = np.nan

    # Top N most similar lyrics
    top_idx = np.argsort(-sims_masked)[:top_n]

    # Top N least similar lyrics
    bottom_idx = np.argsort(sims_masked)[:top_n]

    # --- Prepare results for printing ---
    def clean_lyrics(text):
        return re.sub(r'\s+', ' ', str(text)).strip()  # remove newlines and extra spaces

    top_results = []
    bottom_results = []

    print(f"\n=== Top {top_n} lyrics most similar to title: {title_text} (ID={song_id}) ===")
    for rank, i in enumerate(top_idx, start=1):
        snippet = clean_lyrics(df.iloc[i]['lyrics'])
        print(f"{rank}. [{df.iloc[i]['song_id']}] {df.iloc[i]['name']} "
              f"(Genre: {df.iloc[i]['genre']}, similarity={sims[i]:.3f})\n   Lyrics: {snippet}\n")
        top_results.append({
            'song_id': df.iloc[i]['song_id'],
            'name': df.iloc[i]['name'],
            'genre': df.iloc[i]['genre'],
            'similarity': sims[i],
            'lyrics': snippet
        })

    print(f"\n=== Top {top_n} lyrics least similar to title: {title_text} (ID={song_id}) ===")
    for rank, i in enumerate(bottom_idx, start=1):
        snippet = clean_lyrics(df.iloc[i]['lyrics'])
        print(f"{rank}. [{df.iloc[i]['song_id']}] {df.iloc[i]['name']} "
              f"(Genre: {df.iloc[i]['genre']}, similarity={sims[i]:.3f})\n   Lyrics: {snippet}\n")
        bottom_results.append({
            'song_id': df.iloc[i]['song_id'],
            'name': df.iloc[i]['name'],
            'genre': df.iloc[i]['genre'],
            'similarity': sims[i],
            'lyrics': snippet
        })

   #  save to CSV

        out_dir = "SimilarityData"
        os.makedirs(out_dir, exist_ok=True)
        out_path = os.path.join(out_dir,"cosine_title_similarity.csv")
        all_results = top_results + bottom_results
        pd.DataFrame(all_results).to_csv(out_path, index=False)


# --- call function ---
title_lyrics_cosine('13741824864810098075', top_n=5)



=== Top 5 lyrics most similar to title: Death Whispered a Lullaby (ID=13741824864810098075) ===
1. [5506909473774015947] Lullaby (Genre: grunge, metal, similarity=0.271)
   Lyrics: I know the feeling Of finding yourself stuck out on the ledge And there ain't no healing From cuttin' yourself with the jagged edge I'm tellin' you that it's never that bad And take it from someone who's been where your at You're laid out on the floor and you're not sure You can take this anymore So just give it one more try With a lullaby And turn this up on the radio If you can hear me now I'm reachin' out to let you know That you're not alone And you can't tell, I'm scared as hell 'Cause I can't get you on the telephone So just close your eyes Well honey here comes a lullaby Your very own lullaby Please let me take you Out of the darkness and into the light 'Cause I have faith in you That you're gonna make it through another night Stop thinkin' about the easy way out There's no need to go and blow the ca

**Jaccard on Clean Tokens**

I created a function where, when a song is passed, the function transforms its title into a token set.<br>
It also converts all lyrics in the dataset into token sets and calculates Jaccard similarity between the title and each lyrics token set.<br
The output prints the top songs with lyrics most similar to the title, as well as the 5 least similar. A CSV file is also saved for further analysis.

In [2]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import os

# --- Connect to SQL Server ---
engine = create_engine(f"mssql+pyodbc://{db_target}?driver=ODBC+Driver+17+for+SQL+Server")

# --- Pull song data ---
df = pd.read_sql("""
    SELECT song_id, name, genre, cleanTokens
    FROM songs
    WHERE cleanTokens IS NOT NULL
""", engine)

# --- Convert tokens to sets ---
def tokens_to_set(tokens_str):
    return set(str(tokens_str).split())

df['token_set'] = df['cleanTokens'].apply(tokens_to_set)

# --- Jaccard similarity function ---
def title_lyrics_jaccard(song_id, top_n=5):
    try:
        idx = df.index[df['song_id'] == song_id][0]
    except IndexError:
        print(f"Song ID {song_id} not found.")
        return

    title_tokens = tokens_to_set(df.iloc[idx]['name'])

    def jaccard(set1, set2):
        if not set1 or not set2:
            return 0.0
        return len(set1 & set2) / len(set1 | set2)

    sims = df['token_set'].apply(lambda x: jaccard(title_tokens, x))

    # Exclude the song itself
    sims[idx] = -1

    # Sort by similarity
    top_idx = sims.sort_values(ascending=False).head(top_n).index
    bottom_idx = sims.sort_values(ascending=True).head(top_n).index

    # --- Print top results ---
    print(f"\n=== Top {top_n} lyrics most similar to title: {df.iloc[idx]['name']} (ID={song_id}) ===")
    top_results = []
    for rank, i in enumerate(top_idx, start=1):
        snippet = " ".join(df.iloc[i]['token_set'])
        print(f"{rank}. [{df.iloc[i]['song_id']}] {df.iloc[i]['name']} "
              f"(Genre: {df.iloc[i]['genre']}, Jaccard={sims[i]:.3f})\n   Tokens: {snippet}\n")
        top_results.append({
            'song_id': df.iloc[i]['song_id'],
            'name': df.iloc[i]['name'],
            'genre': df.iloc[i]['genre'],
            'jaccard_similarity': sims[i],
            'tokens': snippet
        })

    # --- Print bottom results ---
    print(f"\n=== Top {top_n} lyrics least similar to title: {df.iloc[idx]['name']} (ID={song_id}) ===")
    bottom_results = []
    for rank, i in enumerate(bottom_idx, start=1):
        snippet = " ".join(df.iloc[i]['token_set'])
        print(f"{rank}. [{df.iloc[i]['song_id']}] {df.iloc[i]['name']} "
              f"(Genre: {df.iloc[i]['genre']}, Jaccard={sims[i]:.3f})\n   Tokens: {snippet}\n")
        bottom_results.append({
            'song_id': df.iloc[i]['song_id'],
            'name': df.iloc[i]['name'],
            'genre': df.iloc[i]['genre'],
            'jaccard_similarity': sims[i],
            'tokens': snippet
        })

    # --- Save CSV ---

        out_dir = "SimilarityData"
        os.makedirs(out_dir, exist_ok=True)
        out_path = os.path.join(out_dir,"jaccard_title_similarity.csv")
        all_results = top_results + bottom_results
        pd.DataFrame(all_results).to_csv(out_path, index=False)


# --- call function ---
title_lyrics_jaccard('13860806082654952241', top_n=5)



=== Top 5 lyrics most similar to title: Hallelujah I Love Her So (ID=13860806082654952241) ===
1. [10018510778896022773] How Can You Mend A Broken Heart (Genre: disco, Jaccard=0.000)
   Tokens: "man", "but", "makes", "living", "win", "see", "life", "never", "round", "you", "still", "no", "feel", "my", "think", "word", "please", "days", "one", "let", "younger", "ever", "rustles", "misty", "stop", "mend", "sun", "world", "heart", "shining", "trees", "tomorrow", "told", "loser", "help", "broken", "said", "breeze", "sorrow", "man"] "i", "everything", "could", "falling", ["i", "memories", "live", "want", "rain", "gone",

2. [4597666000887757323] Everlong (Genre: alt-rock, alternative, grunge, metal, Jaccard=0.000)
   Tokens: "sang", "everlong", "along", "sang"] "sing", "know", "you", "throw", "head", "feel", "anything", "my", "real", "waste", "good", "always", "hold", "wonder", "she", "breathe", "not", "ever", "promise", "stop", "ask", "come", "thing", "wanted", "red", "away", "forever", "

## Final Thoughts:

Evidently, although Jaccard is faster in performance, its output return is much less comparable to TF-IDF + cosine similarity.<br> This is because titles consist of very few words, while lyrics contain many more words, making the overlap too small to generate meaningful scores. <br>In contrast, TF-IDF + cosine similarity can pick up patterns through repeated words, and TF-IDF assigns higher weights to rare or important words, <br>making it more sensitive to meaningful overlap between a title and lyrics.