# L01: Corpus exploration: Song lyrics
In this Lab, we're going to explore the song lyrics of our favorite artists 🥳

Using **Levenshtein**, we first match the search for an artist name to those in the data.
1. Given a search string, display the 5 most similar names.

Using **TFIDF-Vectorization** of text and the **Cosine similarity**, we implement the following functionality:

2. Given an artist, print out the 5 most similar artists based on their lyrics.
3. Print the 10 most important words/phrases of an artist.

## Data loading
The data we're using consists of lyrics crawled from the website metrolyrics (seems to not exist anymore).

The data is online here: https://github.com/hiteshyalamanchili/SongGenreClassification/raw/master/dataset/original_cleaned_lyrics.zip

In [4]:
import pandas as pd
from zipfile import ZipFile

In [19]:
zfile = ZipFile('L01_original_cleaned_lyrics.zip')
df = pd.read_csv(zfile.open('original_cleaned_lyrics.csv'))
df

Unnamed: 0.1,Unnamed: 0,index,song,year,artist,genre,lyrics
0,0,0,ego-remix,2009,beyonce-knowles,Pop,Oh baby how you doing You know I'm gonna cut r...
1,1,1,then-tell-me,2009,beyonce-knowles,Pop,playin everything so easy it's like you seem s...
2,2,2,honesty,2009,beyonce-knowles,Pop,If you search For tenderness It isn't hard to ...
3,3,3,you-are-my-rock,2009,beyonce-knowles,Pop,Oh oh oh I oh oh oh I If I wrote a book about ...
4,4,4,black-culture,2009,beyonce-knowles,Pop,Party the people the people the party it's pop...
...,...,...,...,...,...,...,...
227444,362232,362232,who-am-i-drinking-tonight,2012,edens-edge,Country,I gotta say Boy after only just a couple of da...
227445,362233,362233,liar,2012,edens-edge,Country,I helped you find her diamond ring You made me...
227446,362234,362234,last-supper,2012,edens-edge,Country,Look at the couple in the corner booth Looks a...
227447,362235,362235,christ-alone-live-in-studio,2012,edens-edge,Country,When I fly off this mortal earth And I'm measu...


In [22]:
# TODO
# Create the variable artists_set which contains a set of all the artists in the data
artists_set = set(df['artist'])
print(len(artists_set))
sorted_artists = sorted(artists_set)

# Print the first 10 artists from the sorted list
print(sorted_artists[:10])

11117
['009-sound-system', '047', '1-800-zombie', '10-cc', '10-years', '100-demons', '100-monkeys', '10000-maniacs', '1000mods', '104']


Let's look at the genre distribution

In [24]:
from collections import Counter
# TODO 
# use the Counter class to count and print the number of songs per genre in descending number of counts
# Hint: check the most_common() function of the Counter class
counter = Counter(df['genre'])
print(counter.most_common())

[('Rock', 104137), ('Pop', 36439), ('Hip-Hop', 23215), ('Metal', 22420), ('Country', 14182), ('Jazz', 7520), ('Electronic', 7231), ('Other', 3989), ('R&B', 3354), ('Indie', 2970), ('Folk', 1992)]


Which 50 artists have the most song lyrics in the data?

In [26]:
# TODO
# use a Counter to count how many songs each artist has and print the 50 artists with the most songs.
counter = Counter(df['artist'])
print(counter.most_common(50))

[('dolly-parton', 742), ('elton-john', 666), ('chris-brown', 617), ('barbra-streisand', 598), ('bee-gees', 590), ('eddy-arnold', 585), ('bob-dylan', 584), ('eminem', 569), ('american-idol', 568), ('ella-fitzgerald', 566), ('dean-martin', 552), ('david-bowie', 546), ('b-b-king', 541), ('elvis-costello', 516), ('bruce-springsteen', 508), ('beach-boys', 477), ('bill-anderson', 460), ('eric-clapton', 454), ('frank-zappa', 431), ('chumbawamba', 422), ('frank-sinatra', 412), ('celine-dion', 410), ('britney-spears', 402), ('chicago', 398), ('beatles', 396), ('chamillionaire', 396), ('diana-ross', 396), ('50-cent', 392), ('emmylou-harris', 384), ('bon-jovi', 373), ('barry-manilow', 368), ('babyface', 367), ('fall', 366), ('2pac', 361), ('elvis-presley', 359), ('drake', 359), ('conway-twitty', 355), ('game', 326), ('electric-light-orchestra', 326), ('buck-owens', 321), ('alice-cooper', 319), ('fleetwood-mac', 316), ('billie-holiday', 315), ('beck', 312), ('ferlin-husky', 303), ('cliff-richard',

## 1. Search function for artist names
We use the Levenshtein distance to compare a search string to our artist names in the data

In [33]:
from Levenshtein import distance

def search_artist(artist_name: str, num_print: int = 5):
    artist_name = artist_name.replace(' ', '-')  # artist names don't have whitespaces in the data
    
    # TODO
    # If artist_name is in the set of artists, don't perform the Levenshtein comparison (saves compute!)
    if artist_name in artists_set:
        print('Artist ' + artist_name + ' found in set.')
    else:
        string_similarities = []
        # TODO 
        # iterate the artists_set and compare the search string artist_name to each entry
        # store the result of each comparison as a tuple (Levenshtein distance value, artist) 
        # in the list string_similarities
        for artist in artists_set:
            string_similarities.append((distance(artist, artist_name),artist))
        
        # TODO
        # print the num_print most similar names to the searched artist_name with a "Did you mean:" prefix
        string_similarities.sort()
        print("Did you mean:")
        for similarity in string_similarities[:num_print]:
            print(f"{similarity[1]}? (Distance: {similarity[0]})")

In [34]:
search_artist('dolly parton')
search_artist('billy hollidays')

Artist dolly-parton found in set.
Did you mean:
billie-holiday? (Distance: 4)
billy-childs? (Distance: 6)
billy-jo-spears? (Distance: 6)
billy-jonas? (Distance: 6)
ally-hills? (Distance: 7)


## 2. Search function to find similar artists based on lyrics
First, concatenate the lyrics per artist

In [35]:
from collections import defaultdict
artist_lyrics = defaultdict(str)
for artist, track in zip(df['artist'], df['lyrics']):
    artist_lyrics[artist] += ' ' + track  #  here's a good place to do preprocessing as you touch each lyrics string
len(artist_lyrics)

11117

### Text vectorization
Now vectorize the lyrics using TFIDF from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Study the parameters carefully to improve your outputs!

Note that the TfidfVectorizer accepts list of strings as input, so we need to get them from the artist_lyrics dict. The values() function below creates an iterable that keeps the sequence of the dict, so we don't lose the link to the artists the lyrics belong to.

In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer
# TODO
# instantiate the TfidfVectorizer with the desired parameters from the documentation
tfidf = TfidfVectorizer(stop_words='english')
tfidf_vecs = tfidf.fit_transform(artist_lyrics.values())
tfidf_vecs.shape

(11117, 313178)

### Similarity calculation
Now calculate the cosine distance of the TFIDF vector of a given artist to all other artists.
Use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

In [59]:
from sklearn.metrics.pairwise import cosine_distances
import numpy as np

artist_list = list(artist_lyrics.keys())  # we cannot use the set artists_set here as we need to keep the artist names and their lyrics aligned!

def find_similar_artists(artist_name: str, num_print: int = 10):
    
    # TODO
    # Make sure we have artist_name in our data!
    # If not, call your function search_artist() to suggest similar artist names
    if not artist_name in artist_list:
        search_artist(artist_name)
    else:
        # We first need to get the correct vector using the list index of artist_name in artist_list
        artist_tfidf_vec = tfidf_vecs[artist_list.index(artist_name)]
        # TODO
        # apply cosine_distances() between the tfidf vector of the artist and those of the other artists
        distances = cosine_distances(artist_tfidf_vec, tfidf_vecs)
    
        sorted_indices = np.argsort(distances)[0]  # we sort the indices of the vectors to map them to the artist_list
        for i in sorted_indices[1: num_print + 1]:  # skip the closest artist, as it is itself!
            print(artist_list[i], distances[0][i])

In [60]:
find_similar_artists('anthraxxxx')
print("====================")
find_similar_artists('anthrax')

Did you mean:
anthrax? (Distance: 3)
ataraxia? (Distance: 5)
ataraxie? (Distance: 5)
a-trak? (Distance: 6)
abraxas? (Distance: 6)
avenged-sevenfold 0.3635765521870937
black-sabbath 0.366487267098607
ani-difranco 0.3831536538537017
atmosphere 0.3853595791430938
bon-jovi 0.3861465679145868
blues-traveler 0.3927929390116326
elvis-costello 0.3967996499913934
311 0.3986935511449764
genesis 0.3991778397447068
alice-cooper 0.3998815037834318


## 3. Most important words for an artist
Now, let's find the most distinct words that an artist uses based on the TFIDF values. To map the TFIDF values to the words, we first need to extract the vocabulary of the TFIDF vectorizer. 

Note that TfidfVectorizer returns sparse arrays (this makes sense, most values in the vectors are zeros, so a sparse representation is more efficient) which are not suited for common indexing. We thus have to use toarray() to change the vectors into dense (i.e. "normal") vectors.

In [61]:
tfidf_vocab_list = tfidf.get_feature_names_out()

def find_important_words(artist_name: str, num_print: int = 20):
    # TODO
    # Make sure we have artist_name in our data!
    # If not, call your function search_artist() to suggest similar artist names
    if not artist_name in artist_list:
        search_artist(artist_name)
    else:
        artist_tfidf_vec = tfidf_vecs[artist_list.index(artist_name)].toarray()[0]  # as the array is two dimensional, we have to access the 1st dimension which contains the actual vectors
        # TODO 
        # Sorting the indices of the words by their TF-IDF values in ascending order
        sorted_word_indices = np.argsort(artist_tfidf_vec)

        # TODO
        # use np.flip() to reverse sorted_word_indices, so we have the words with the highest weigths at the beginning of the array
        # Reverse the array to get the highest values first
        sorted_word_indices = np.flip(sorted_word_indices)
        for index in sorted_word_indices[:num_print]:
            print(tfidf_vocab_list[index], artist_tfidf_vec[index])        

In [62]:
find_important_words('eminem')

shit 0.268050720546989
like 0.25187611021022027
fuck 0.2363867206629078
just 0.21709599921613354
don 0.19643115885087775
shady 0.18121457402527022
bitch 0.175857471709649
ain 0.164104307211459
cause 0.15772494551348668
got 0.14618851659084042
ass 0.13918616244872614
em 0.13810585508888273
fuckin 0.12840186761574435
know 0.1281132653877803
slim 0.11723684951392975
ll 0.11066274841617774
eminem 0.09995538485961142
dre 0.09919221526998229
man 0.09533451935249489
say 0.09030341835623744
