## Information Retrieval & Advanced Insights
**Task 4.1: Similarity Search** -TF-IDF + Cosine Similarity <br>I created a function that takes two inputs: a song ID and the number of similar songs to return.<br>
The function vectorizes the raw lyrics using TF-IDF, computes similarity scores with cosine similarity,<br>
and then retrieves the top-N most similar songs from the entire dataset.<br>Finally, the results are saved into a CSV to enable further analysis of lyric similarity.




In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os

# --- 1. Connect to SQL Server ---
# read from file
with open("SQL_DB/db_config.txt", "r") as f:
    db_target = f.read().strip()
engine = create_engine(f"mssql+pyodbc://{db_target}?driver=ODBC+Driver+17+for+SQL+Server")

# --- 2. Pull lyrics from DB ---
df = pd.read_sql("""
    SELECT song_id, name, lyrics, cleanGenre
    FROM songs
    WHERE lyrics IS NOT NULL
""", engine)

# --- 3. TF-IDF Vectorizer on full lyrics ---
vectorizer = TfidfVectorizer(
    max_features=20000,       # keep more vocab for longer lyrics
    ngram_range=(1, 2),       # capture unigrams + bigrams
    min_df=1,
    max_df=0.9,
    lowercase=True,
    strip_accents="unicode"
)

# Transform all lyrics into TF-IDF vectors
X_tfidf = vectorizer.fit_transform(df['lyrics'])

# --- 4. Similarity search function by song_id ---
def top_similar(song_id, top_n=5, return_df=False):
    """
    Input:
        song_id  -> integer or string, the unique ID of the song
        top_n    -> number of top similar songs to print
        return_df -> if True, also return DataFrame of top results
    Output:
        Prints the top similar songs with song_id, name, and similarity
        Optionally returns a DataFrame
        Also saves results to 'SimilarityData/cosine_similarity.csv'
    """
    # Find the DataFrame index corresponding to the song_id
    try:
        idx = df.index[df['song_id'] == song_id][0]
    except IndexError:
        print(f"Song ID {song_id} not found in the dataset.")
        return None

    # Extract full lyrics for the query
    query_lyrics = df.iloc[idx]['lyrics']

    # Transform the query into TF-IDF space
    query_vec = vectorizer.transform([query_lyrics])

    # Compute cosine similarity
    sims = cosine_similarity(query_vec, X_tfidf).flatten()

    # Exclude the song itself
    sims[idx] = -1

    # Get top N indices
    top_idx = np.argsort(-sims)[:top_n]

    # Collect results into DataFrame and include similarity
    results = df.iloc[top_idx][['song_id', 'cleanGenre','name', 'lyrics']].copy()
    results['similarity'] = sims[top_idx]

    # Include the query song as the first row
    query_song = df[df['song_id'] == song_id][['song_id', 'cleanGenre','name', 'lyrics']].copy()
    query_song['similarity'] = 1.0
    updated_results = pd.concat([query_song, results], ignore_index=True)

    # Print results
    print(f"\n=== Top {top_n} songs similar to: {df.iloc[idx]['name']} (ID={song_id}) ===")
    for rank, row in results.iterrows():
        print(f"{rank+1}. [{row['song_id']}] {row['name']} {row['cleanGenre']} (similarity={row['similarity']:.3f})")

    # Save results to CSV
    out_dir = "SimilarityData"
    os.makedirs(out_dir, exist_ok=True)  # create folder if it doesn't exist
    out_path = os.path.join(out_dir, "tfidf_cosine_similarity.csv")
    updated_results[['song_id', 'name', 'lyrics']].to_csv(out_path, index=False, encoding='utf-8')


    return None


# --- 5. Function Call ---
# Input song_id  to generate top 5 similar songs
top5_similar = top_similar('8036593539990052832', top_n=5 )





=== Top 5 songs similar to: Love of Money (ID=8036593539990052832) ===
113. [11190317389501666971] Mi Remember dancehall (similarity=0.400)
810. [2091181825240645407] My Crew j-dance (similarity=0.312)
52. [10528770166993144871] Voglio restare cosi opera (similarity=0.174)
782. [18377019154707697331] Wine Pon Me j-dance (similarity=0.140)
1253. [6923628035955983211] Hey Mama edm (similarity=0.116)



**Task 4.1: Similarity Search** -Word2Vec + Cosine Similarity <br>I created a function that takes two inputs: a song ID and the number of similar songs to return.<br>
The function converts the lyrics into vectors trying capturing semantic meaning of the lyrics , computes similarity scores with cosine  similarity,<br>
and then retrieves the top-N most similar songs from the entire dataset.<br>Finally, the results are saved into a CSV to enable further analysis of lyric similarity.




In [4]:
from gensim.models import Word2Vec
import pandas as pd
from sqlalchemy import create_engine
import numpy as np

# --- 1. Connect to SQL Server ---
engine = create_engine(f"mssql+pyodbc://{db_target}?driver=ODBC+Driver+17+for+SQL+Server")

# --- 2. Pull cleanTokens and lyrics from DB ---
df = pd.read_sql("""
    SELECT song_id, name, cleanTokens, lyrics , cleanGenre
    FROM songs
    WHERE cleanTokens IS NOT NULL
""", engine)

# --- 3. Prepare tokens ---
df['tokens'] = df['cleanTokens'].apply(lambda x: x if isinstance(x, list) else str(x).split())

# --- 4. Train Word2Vec model ---
w2v_model = Word2Vec(sentences=df['tokens'], vector_size=100, window=5, min_count=1, workers=4, sg=1)

# --- 5. Function to compute average Word2Vec vector for a song ---
def song_vector(tokens, model):
    vecs = [model.wv[word] for word in tokens if word in model.wv]
    if len(vecs) == 0:
        return np.zeros(model.vector_size)
    return np.mean(vecs, axis=0)

# Precompute vectors for all songs
song_vectors = np.array([song_vector(tokens, w2v_model) for tokens in df['tokens']])

# --- 6. Similarity search function with snippet preview ---


def top_similar_word2vec(song_id, top_n=5):
    """
    Input:
        song_id     -> integer or string, the unique ID of the song
        top_n       -> number of top similar songs to print
     Output:
        Prints top similar songs with song_id, name, similarity, and snippet
    """
    # --- 1. Find the song index ---
    try:
        idx = df.index[df['song_id'] == song_id][0]
    except IndexError:
        print(f"Song ID {song_id} not found in dataset.")
        return None

    # --- 2. Compute cosine similarity ---
    query_vec = song_vectors[idx].reshape(1, -1)
    sims = cosine_similarity(query_vec, song_vectors).flatten()
    sims[idx] = -1  # exclude itself
    top_idx = np.argsort(-sims)[:top_n]

    #---- 3. Collect results into DataFrame---
    results = df.iloc[top_idx][['song_id', 'cleanGenre' ,'name', 'lyrics']].copy()
    results['similarity'] = sims[top_idx]

    # Include the query song as the first row
    query_song = df[df['song_id'] == song_id][['song_id', 'cleanGenre','name', 'lyrics']].copy()
    query_song['similarity'] = 1.0
    updated_results = pd.concat([query_song, results], ignore_index=True)

    # --- 4. Print results ---
    print(f"\n=== Top {top_n} songs similar to: {df.iloc[idx]['name']} (ID={song_id}) ===")
    for rank, row in results.iterrows():
        print(f"{rank+1}. [{row['song_id']}] {row['name']} {row.get('cleanGenre','')} "
              f"(similarity={row['similarity']:.3f})")

    # --- 5. Save CSV ---
    out_dir = "SimilarityData"
    os.makedirs(out_dir, exist_ok=True)
    out_path = os.path.join(out_dir,"word2vec_cosine_similarity.csv")
    updated_results[['song_id', 'name', 'lyrics']].to_csv(out_path, index=False, encoding='utf-8')

    return None

# Function call:
top_similar_word2vec('8036593539990052832', top_n=5)


=== Top 5 songs similar to: Love of Money (ID=8036593539990052832) ===
1257. [6959893070932530374] Clarks j-dance (similarity=0.998)
95. [10929755647275482411] Maybellene rockabilly (similarity=0.996)
1195. [6291538727295026626] Why We Thugs funk (similarity=0.996)
8. [10094682406723622078] mOBSCENE industrial (similarity=0.996)
1428. [8830279411611115885] X hardcore (similarity=0.995)


 ## Conclusion:
When I tested the system with an extreme genre like death metal, both TF-IDF and Word2Vec performed well, consistently returning songs from the same genre with equally harsh and extreme lyrics. Similarly, when I chose a Christmas song, all the returned songs were of the Christmas theme, showing that both methods can capture strong thematic signals. However, for more nuanced semantic similarity, Word2Vec clearly outperformed TF-IDF. For example, when I passed the song “Love of Money”, TF-IDF returned songs even in Italian lyrics with weak relevance due to simple word overlap, while Word2Vec successfully identified modern hip-hop and RnB tracks centered on money, luxury, and lifestyle. This demonstrates that TF-IDF is effective for lexical similarity, whereas Word2Vec excels at capturing deeper semantic meaning.