# Problem Set 3 CS174
---
The following problem set will focus on the lessons on Bag of Words and TFIDF. Answer the following guide questions with your groups by writing functions or code in cells that accomplish the requirements.

1. Take your favorite Spotify or Apple Music playlist, and use their APIs to programmatically extract the first 20 song titles. Build a dictionary that has the song index as the key, and the title as the value. Please make sure that the playlist is **NOT** an instrumental playlist. **3 pts.**

2. Use the Genius API to programmatically download the lyrics of these songs and store them in a dictionary with the index as the key, and the lyrics as the value. **3 pts.**

3. Build a 20x20 matrix containing the cosine similarities of the songs to each other. Use Bag of Words to determine these similarities. **4 pts.**


4. Build a 20x20 matrix containing the cosine similarities of the songs to each other. Use TFIDF scores to determine these similarities. **5 pts.**

5. Compare the similarities using BoW and TFIDF. Analyze the results and discuss any findings that interest you. You can use heatmaps or other plots to present your analysis. **3 pts**

Guide Questions:
- Can this be used to determine playlist quality?
- What does this say about the homogeneity of the playlist themes?
- What does this say about how songs (in that playlist genre) are written?

6. Identify the top 5 most important words for each song using TFIDF. Discuss the relationship of these words to their respective songs and analyze if they can be used as passable summaries of the songs. **2 pts** 

Make sure to remove stopwords. No need to lemmatize or stem, but is very welcome.
       
### Bonus:
Do a TFIDF analysis for a single artist's top 5 songs. Identify the top 10 most important words and discuss an analysis of these. **5 pts.**

Deadline **March 17 11:59PM**. Submission link to be posted in Moodle. 
Submit a .zip file containing the notebook and a 'data/' directory containing the songs with name <SURNAME>_<ID NUMBER>.ipynb.
Make sure to remove or obfuscate any API keys you include in the final submission.
    
Sample: **"BAUTISTA_110464.zip"**

# 1. Spotify API to extract 20 song titles

1. Take your favorite Spotify or Apple Music playlist, and use their APIs to programmatically extract the first 20 song titles. Build a dictionary that has the song index as the key, and the title as the value. Please make sure that the playlist is **NOT** an instrumental playlist. **3 pts.**

## Setting up Spotify API

In [1]:
!pip install colab-env -qU
import colab_env

ImportError: colab-env only works in a Google Colab notebook

In [None]:
# To add the Spotify client ID and secret, modify the add_env lines below and uncomment
from colab_env import envvar_handler

envvar_handler.add_env('SPOTIPY_CLIENT_ID', '8b4ef1fef08a46d1aae4e9f957491aa6')
envvar_handler.add_env('SPOTIPY_CLIENT_SECRET', 'e2c07f4c07f743018da06acbd1623009')

## Getting the playlist

In [None]:
!pip install spotipy

In [None]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

# Gets the credentials from the environment variables. Make sure that SPOTIPY_CLIENT_ID and SPOTIPY_CLIENT_SECRET
# are in your environment.
client_credentials_manager = SpotifyClientCredentials()
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [None]:
# Editing titles (i.e. removing the "Remastered" parts of the title)
def edit_title(title):
  if title == 'You Make My Dreams - Remastered':
    return 'You Make My Dreams'
  elif title == 'Let It Be - Remastered 2009':
    return 'Let It Be'
  elif title == "Don't Stop Me Now - 2011 Mix":
    return "Don't Stop Me Now"
  else:
    return title

In [None]:
PLAYLIST_URI = 'spotify:playlist:0fLLZQlgUdCUbfiDZ3kJq7'
NUM_TRACKS = 20

# Will modify the dictionary so as to include the artist as well as the song title.
# The structure will be a dictionary of two-element tuples. Example: {1 : ('Take On Me', 'a-ha')}

playlist_tracks = sp.playlist_tracks(PLAYLIST_URI)['items']
playlist_dict = {}
for i, track in enumerate(playlist_tracks[:NUM_TRACKS]):
  title = track['track']['name']
  title = edit_title(title)
  artists = ', '.join([artist['name'] for artist in track['track']['artists']])
  playlist_dict[i] = (title, artists)

print(playlist_dict)
songtitles = [playlist_dict[x][0] for x in playlist_dict]
songtitles

## 2. Retrieving Lyrics from Genius
We obtain the lyrics of songs in the Spotify playlist using the Genius API.

In [None]:
dummy_song = ['Sun goes down, sun comes up\n', 'Days can drown in a plastic cup\n', 'In this town\n', 'In this town\n', '\n', "Don't know how, don't know when\n", 'Something came but it left again\n', "And I'm down\n", 'On this town\n', '\n', "'Cause no matter what I see\n", 'People love to disagree\n', '\n', 'Every time I say what I want to be\n', "Someone says that's not how it's going to be\n", 'Come on, baby, quit your dreaming\n', 'Grab your things, the train is leaving\n', 'Time to pick that somewhere you want to go\n', 'Get there quick or drown in the undertow\n', 'Come on baby, time is wasting\n', 'Choose a wheel and get to racing\n', '\n', 'Sun goes up and down again\n', 'Hard to stop feeling broken in\n', 'And worn down\n', 'By this town\n', '\n', 'All the same\n', 'You find me here\n', 'Placing blame and escaping fear and self-doubt\n', 'And this town\n', '\n', 'But no matter where I go\n', 'People love to tell me, "No"\n', '\n', 'Every time I say what I want to be\n', "Someone says that's not how it's going to be\n", 'Come on, baby, quit your dreaming\n', 'Grab your things, the train is leaving\n', 'Time to pick that somewhere you want to go\n', 'Get there quick or drown in the undertow\n', 'Come on, baby, time is wasting\n', 'Choose a wheel and get to racing\n', '\n', 'Every time I say what I want to be\n', "Someone says that's not how it's going to be\n", 'Come on baby, quit your dreaming\n', 'Grab your things, the train is leaving\n', 'And if you feel like nobody understands\n', 'You just smile and pray that they clap their hands\n', 'Come on, baby, drown your sorrow\n', 'Work today and live tomorrow\n', 'Time to pick that somewhere you want to go\n', 'Get there quick or drown in the undertow\n', 'Come on baby, time is wasting\n', 'Choose a wheel and get to racing ']

In [None]:
! pip install lyricsgenius

In [None]:
import lyricsgenius

#songs = {'Green Day':'Holiday', 'The All-American Rejects':'Dirty Little Secret', 'Blink-182':'First Date'}
songs = playlist_dict
lyrics_dict = {}
genius = lyricsgenius.Genius("lgrmQD5L0EqGrGCqmiXmHvizYtaDYeu5gAn0TRwD3FPEzE1WRL_Y2mBAZTdrMGB-")

for song in songs:
    lyric = genius.search_song(songs[song][0], songs[song][1])
    #print(lyric.lyrics,"\n")
    lyrics_dict[songs[song]] = lyric.lyrics

print(lyrics_dict)

Below this line, we process lyrics data.


# Pre-processing
The functions in this section take a list of lyrics (assumed to be one line per element) then performs word splitting, normalization (reduce all letters to lowercase), and lemmatization.

To make this a list of songs, we simply have one **song** per element.

Libraries for NLP.

In [None]:
!pip install -U nltk
import nltk
nltk.download('stopwords')


In [None]:
import math
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import string
from nltk.corpus import stopwords 
from sklearn.metrics.pairwise import cosine_similarity

stopWords = set(stopwords.words('english'))

In [None]:
def lemmatize(word):
    res = word.lower().translate(str.maketrans('', '', string.punctuation))
    return res if not res in stopWords else ''

def get_song_lines_from_dict(lyrics_dict):
    song_lines = []
    for song_line in [lyrics_dict[index].rstrip().split() for index in lyrics_dict if len(lyrics_dict[index].strip()) > 0]:
      # Lemmatize using the function, then filter out empty strings.
        song_line = list(filter(lambda x: len(x)>0, [lemmatize(word) for word in song_line]))
        if len(song_line) > 0:
            song_lines.append(song_line)
    return song_lines

def get_song_lines_from_file(lyrics_file):
    with open(lyrics_file, 'r') as lyrics:
        song_lines = []
        for song_line in [line.rstrip().split() for line in lyrics if len(line.strip()) > 0]:
          # Lemmatize using the function, then filter out empty strings.
            song_line = list(filter(lambda x: len(x)>0, [lemmatize(word) for word in song_line]))
            if len(song_line) > 0:
                song_lines.append(song_line)
    return song_lines

def get_song_lines_from_list(lyrics_list):
    song_lines = []
    for song_line in [line.rstrip().split() for line in lyrics_list if len(line.strip()) > 0]:
      # Lemmatize using the function, then filter out empty strings.
        song_line = list(filter(lambda x: len(x)>0, [lemmatize(word) for word in song_line]))
        if len(song_line) > 0:
            song_lines.append(song_line)
    return song_lines

In [None]:
lyrics_dict

# Bag-of-words
Given a lyrics file, returns the BOW vector representation of each sentence (where sentence can be a line in a song, or an entire song).
# Cosine Similarity Matrix
Returns a matrix of the pairwise cosine similarities given a set of lyrics vectors.

In [None]:
def bag_of_words(list_of_sentences):
    wordset = set([lemmatize(word) for sentence in list_of_sentences for word in sentence if len(lemmatize(word)) > 0])
    results = []
    for sentence in list_of_sentences:
        results.append([1 if word in sentence else 0 for word in wordset])
    nicetable = pd.DataFrame(np.array(results), columns=list(wordset))
    return results

def cosine_similarity_matrix(bow_vectors):
    answer_matrix = [[] for i in range(len(bow_vectors))]
    for i in range(len(bow_vectors)):
      for j in range(len(bow_vectors)):
        answer_matrix[i].append(cosine_similarity([bow_vectors[i]], [bow_vectors[j]])[0][0])
    return pd.DataFrame(answer_matrix)#, index=songtitles, columns=songtitles)

def cosine_similarity_matrix_tfidf(tfidf_df):
    answer_matrix = [[] for i in range(len(tfidf_df))]
    for i in range(len(tfidf_df)):
      for j in range(len(tfidf_df)):
        answer_matrix[i].append(cosine_similarity([tfidf_df.iloc[i]], [tfidf_df.iloc[j]])[0][0])

    return pd.DataFrame(answer_matrix)#, index=songtitles, columns=songtitles)

In [None]:
lines_from_dict = get_song_lines_from_dict(lyrics_dict)

In [None]:
testing_vectors = bag_of_words(lines_from_dict)
cosine_bow = cosine_similarity_matrix(testing_vectors)
cosine_bow

# TF-IDF.



In [None]:
# Get TF of a line.
def get_frequencyBoW(unique_words, line):
    bow = dict.fromkeys(unique_words, 0)
    for word in line:
        bow[word] += 1
    return bow
  
def get_TF(dictionary, line):
    tf_dict = {}
    bow_count = len(line)
    for word, count in dictionary.items():
        tf_dict[word] = count / float(bow_count)
    return tf_dict

def idf(documents):
    n = len(documents)
    idf_dict = dict.fromkeys(documents[0].keys(), 0)
    for document in documents:
        for word, val in document.items():
            if val > 0:
                idf_dict[word] += 1
    
    for word, val in idf_dict.items():
        idf_dict[word] = math.log(n / float(val))
    return idf_dict

def tfidf(song_lines):
    unique_words = set()
    for song_line in song_lines:
        unique_words = unique_words.union(set(song_line))
    frequencyBoWs = []
    for song_line in song_lines:
        frequencyBoWs.append(get_frequencyBoW(unique_words, song_line))
    IDFs = idf(frequencyBoWs)
    TFIDFs = []
    for song_line in song_lines:
        frequencyBoW = get_frequencyBoW(unique_words, song_line)
        TF = get_TF(frequencyBoW, song_line)
        TFIDF = {}
        for word, val in TF.items():
            TFIDF[word] = val * IDFs[word]
        TFIDFs.append(TFIDF)
    return pd.DataFrame(TFIDFs)

In [None]:
tfidf_df = tfidf(lines_from_dict)
cosine_tfidf = cosine_similarity_matrix_tfidf(tfidf_df)
cosine_tfidf

# Analysis

In [None]:
# Bag of Words Heatmap
import seaborn as sns
sns.heatmap(cosine_bow)

In [None]:
# TFIDF Heatmap
sns.heatmap(cosine_tfidf)

In [None]:
def normalize(df):
  mn = 1
  mx = 0
  for i, j in df.iterrows():
    for x in j:
      if (abs(1-x) > 0.0000002):
        mn = min(mn, x)
        mx = max(mx, x)
  return df.applymap(lambda x : (x-mn)/(mx-mn) if abs(1-x) > 0.0000002 else 1)

In [None]:
sns.heatmap(normalize(cosine_bow))

In [None]:
sns.heatmap(normalize(cosine_tfidf))

To give a bit of description on the playlist, it's a "nostalgia" playlist containing songs dating from the 60s even until the 00s. Hence, the songs in the playlist span different eras of music which  means that the songs may be very different in terms of words used and the themes of the songs. To illustrate this difference, we look at the similarities in language through the cosine similarity metric used below.

Initially, the cosine similarity pairs for both BoW and TFIDF were very low, mostly within the range of 0.0 to 0.3. In a vacuum, this means that songs aren't really similar at all to each other, at least with respect to the lyrics of the songs. Again, the songs vary in terms of their release dates and thus their "generations", so the vocabulary and themes used vary, from friendly to romantic to breakup to even infatuation with a place.

We can the BoW Heatmap has a slightly lighter tone than the TFIDF one, indicating slightly higher similarity scores. However, since we wanted to look at the relative similarities instead of looking at them in a vacuum, we decided to normalize these scores. After normalizing, we noticed that there were more _relatively_ similar pairs of songs in the BoW matrix, such as I Wanna Dance with Somebody by Whitney Houston and My Girl by The Temptations. This may be because of words used that are similar to each other or even words that appear in both songs in the pair. For example, "hot" was used in the former song while "cold" was used in the latter, while "feel" and "way" appeared in both songs. In a nutshell, the BoW has more relatively similar songs because they look at the words used without gauging their importance in the songs, so songs that use cloesly associated words may be similar.

This is where TFIDF comes in. The heatmap for TFIDF looks terrible, however, but it's because the importance is looked at there. There are fewer relatively similar pairs of songs, and most of them still have a low cosine similarity score. Going back to our example of Whiteney Houston and The Temptations, this time the score is a lot lower because the similar words mention (hot, cold, feel, way) have differing importance in the two songs, or aren't even important at all ("hot" and "cold" are only mentioned once or twice in both songs).

# Five Most Important Words

You can find the 5 important words for each song below. We can see that most of the time, the words in the title, if these words aren't stopwords, are also found in the most important words. If not the title, the words in the chorus are found. This is because the chorus is the most repeated part of the song usually.

As for being summaries, these words aren't enough still because a lot of these words should be stopwords (e.g. ooh, gogo, woah, im) even though they aren't, so these don't convey much meaning, if at all.

In [None]:
impt_df = pd.DataFrame(columns=['title', 'important_words'])
for index, row in tfidf_df.iterrows():
    impt_df.loc[index] = {
      'title': playlist_dict[index][0],
      'important_words': ' '.join(tfidf_df.iloc[index][tfidf_df.iloc[index] > 0].sort_values(ascending=False).head().index)
  }

impt_df

# **Bonus**

The following is a TF-IDF analysis of the top 5 songs of the band **Bread**. We identify the top 10 most important words and discuss an analysis of these.

In [None]:
# Searches for artist URI
results = sp.search("artist:bread")
artist_id = results['tracks']['items'][0]['artists'][0]['uri']
print(artist_id)

#sp.artist_top_tracks(artist_id)

In [None]:
ARTIST_URI = 'spotify:artist:70ZTdbPEcEugBNay4MvxfL' # Artist URI of the band Bread
ARTIST = 'Bread'
TOP_SONGS = 5

artist_tracks = sp.artist_top_tracks(ARTIST_URI)
track_dict = {}

for i, track in enumerate(artist_tracks['tracks'][:TOP_SONGS]):
    title = track['name']
    track_dict[i] = title

print(track_dict)

In [None]:
# import lyricsgenius

songs = track_dict
lyrics_dict_2 = {}
genius = lyricsgenius.Genius("lgrmQD5L0EqGrGCqmiXmHvizYtaDYeu5gAn0TRwD3FPEzE1WRL_Y2mBAZTdrMGB-")

for song in songs:
    lyric = genius.search_song(songs[song], ARTIST)
    #print(lyric.lyrics,"\n")
    lyrics_dict_2[songs[song]] = lyric.lyrics

print(lyrics_dict_2)

In [None]:
# Using functions from above, we use TF-IDF on the artist's top 5 song
addtl_stopwords = ['chorus','verse','1','2','3','4','5']

lines_from_dict = get_song_lines_from_dict(lyrics_dict_2)
for i in lines_from_dict:
  for j in i:
    if j in addtl_stopwords:
      i.remove(j)

tfidf_df = tfidf(lines_from_dict)
cosine_tfidf = cosine_similarity_matrix_tfidf(tfidf_df)
cosine_tfidf

In [None]:
tfidf_df

## Analysis

In [None]:
sns.heatmap(cosine_tfidf)

In [None]:
sns.heatmap(normalize(cosine_tfidf))


Bread is a 70's soft rock band. Most of their songs, at least in the lyrics, capture the 70's culture. Using Spotify's API, we retrieved the band's five most popular songs, and mined each songs' lyrics from Genius.com. 

We used TF-IDF to analyze the songs' lyrics and came up with a similar looking heatmap as the ones used from the playlist analysis. Obviously, the lyrics of the songs don't quite look similar with each other, as the cosine similarity pairs for TF-IDF were very low. This isn't surprising since an artist would most probably write different songs depending on its topic and that no two songs are inherently similar, lyrics-wise. Making songs with similar topics and word content would make the artist bland.

Much like we did previously, normalizing the cosine similarity lightened the tone of the heatmap. This means that some songs became more 'similar' to certain songs, such as *Everything I Own* and *If*. 

Below is the normalized TF-IDF matrix for the top 5 songs from Bread:

In [None]:
normalize(cosine_tfidf)

### Top 10 words

Given below are the most common words per song. This however won't effectively give us a glimpse of the most important words if the top 5 songs were aggregated. Thus, in the next cell, we see gather the 10 words with the greatest summed TF-IDF scores.

In [None]:
# Most important words for each song

impt_df = pd.DataFrame(columns=['title', 'important_words'])
for index, row in tfidf_df.iterrows():
    impt_df.loc[index] = {
        'title': track_dict[index],
        'important_words': ', '.join(tfidf_df.iloc[index][tfidf_df.iloc[index] > 0].sort_values(ascending=False).head(10).index)
    }

impt_df

In [None]:
# 10 most important words for the top 5 songs
from collections import Counter

impt_dict = {}
for (colName, colData) in tfidf_df.iteritems():
    impt_dict[colName] = colData.sum()

top = Counter(impt_dict)
top.most_common(10)


The top 10 words of Bread's top 5 songs were given above. The top word "I'm-a", spelled here as *'ima'*, only appeared in the song *Baby I'm-a Want You* although it appeared in that song many times. The other words appeared in 2 or more songs and are given much more importance than any other words given their aggregated TF-IDF scores. Unsuprisingly, some of these words are already found in their respective song's titles or within the choruses of certain songs.