Looking up lyrics of a song, and trying to relate them to our own lives is something a lot of us have done. I have been doing that a lot this year. Often, I would go to this website called [songmeanings.com](www.songmeanings.com), and look at the interpretations of a song's lyrics given by other people. In the end, after I'm done with a song, I end up longing for more songs based on the same "topic". Now, this belongs to the area of "subjective recommendations". Over the years, I have tried all sorts of recommendation methods - including but not limited to - Spotify radio, Reddit's [ifyoulikeblank](reddit.com/r/ifyoulikeblank/), asking friends, to pure googling. And I've always wondered how cool it would be to have a program which does this, customized for my style, rather than going through a manual process of these four methods.

Well, the obvious question is - why not just use Spotify's Discover weekly recommendations? After all, that is supposed to be one of the best in the business. But, that is a problem, because Spotify apparently uses 3 different methods - [list methods] and this blog post is based on only one of them - lyrics.

This post consists of 3 parts - Data collection and preprocessing, 

## Data collection and preprocessing

I have discussed in detail about getting lyrics of artists from Genius' API on my previous [blog post on sentiment analysis of music lyrics](https://kvsingh.github.io/lyrics-sentiment-analysis.html). 
For preprocessing our text data, we will be doing a number of basic NLP related cleaning techniques, customized for our particular domain.

Here are the data preprocessing steps:

* Remove song section identifiers (like "[Verse]", "[Chorus]", etc)
* Tokenize and convert to lowercase
* Remove punctuation
* Lemmatize
* Remove "common" words

In [None]:
def clean_raw_data(raw_text):
    #remove section identifiers
    cleaned_text = re.sub(r'[\(\[].*?[\)\]]', '', raw_text)
    cleaned_text = os.linesep.join([s for s in cleaned_text.splitlines() if s])
    return cleaned_text

After this, we "tokenize" the words (a process of splitting the text into smaller chunks). Consider the following lyrics from Eminem's "Lose Yourself".

> There's vomit on his sweater already: Mom's spaghetti

This gets split into:

> ['There', "'s", 'vomit', 'on', 'his', 'sweater', 'already', ':', 'Mom', "'s", 'spaghetti']

We also convert the text to lowercase.

In [1]:
tokens = word_tokenize(text)
tokens = [t.lower() for t in tokens]

NameError: name 'word_tokenize' is not defined

Now, remove the punctuation and lemmatize the tokens.

In [None]:
#remove punctuation
translate_table = dict((ord(char), None) for char in string.punctuation)
tokens = [t.translate(translate_table) for t in tokens]
tokens = filter(bool, tokens)

wordnet_lemmatizer = WordNetLemmatizer()
tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]

Now, we remove stop words (which are commonly occuring words in the english language), because these do not contribute  to a particular topic, but may show up because of the fact that they occur a lot. We also remove tokens with less than or equal to 2 characters. The list of stopwords used is given on [this link](https://gist.github.com/sebleier/554280).

In [None]:
tokens = filter(lambda a: len(a) > 2, tokens)
tokens = filter(lambda a: a not in stopwords.words('english'), tokens)

In addition to the english stopwords, there are other common words which are specific to our domain (song lyrics). How do we find such words? I decided to remove the 50 most commonly occuring words in our lyrics corpus. 

I played around with different orders for the rest of the steps. Initially, I tried removing the punctuation before tokenization. This led to a situation, where words would lose their inherent structure, and some stop words would not get removed. Consider the following example: The word "I'm". If we remove punctuation before tokenizing, it gets reduced to "Im". The lemmatization process doesn't really change this word, and finally, it is not detected in the stop words, leading to it being a prominent word in the topics. To fix this, I tokenize first. "I'm" gets split into "I", "'" and "m". Removing punctuation leads to removal of the token "'". Farther, "I" gets recognized as a stopword (and "m" gets removed when I remove all tokens which contain less than or equal to two characters).