# Predicting Logic's Lyrics with Machine Learning
### Hans Kamin

---

**Logic** has been a remarkable influence on my life since middle school, when I heard his song [“All I Do”](https://www.youtube.com/watch?v=eIGh4Nc1fAM) for the first time. All avid music fans have those select few artists in their library whom they’ll never stop listening to; the connections these musicians make with our emotions & memories become so firmly rooted that we simply can’t help but enjoy each piece of work they put out. Such is absolutely the case in my relationship with Logic’s music, so I instantly knew whose lyrics I’d choose when I was presented with this project. 

More than any other coding assignment I’ve been given, this project was by far the most intriguing and exhilarating to me, so I decided to write about how one might go about implementing the algorithm. I’ll walk through and provide the Python code I wrote and then discuss some of the strengths and weaknesses of this implementation, as well as how it can be improved in the future. And, of course, a *humongous* shoutout to Professor Dennis Sun at Cal Poly SLO for providing excellent solutions and help, and for assigning such an awesome lab through which to explore data science!

Before we begin, though, it would be wise to _**[visit this webpage](http://setosa.io/ev/markov-chains/)**_ for a detailed, visual explanation of Markov Chains and how they work—this is crucial to understanding how to approach this problem.

## Building the Algorithm
The crux of this implementation involves using a Bigram Markov Chain to represent the English language. More specifically, our chain will be a dictionary object in which each key is a unique tuple consisting of a word and the word that follows it. Using bigrams rather than single words allows us to increase the accuracy and readability of our generated lines because it defines our model such that the next word in our sentence is predicted based on the previous two words rather than just the immediately preceding one.

The true first step, though, is to gather all of the lyrics we’ll be analyzing. To do this, I scraped links to lyrics for each of Logic’s songs, then went through each link to gather all of the lyrics from them.

In [7]:
import requests
import time
from bs4 import BeautifulSoup

links = []
for pagenum in range(1,3):
    url = "http://www.metrolyrics.com/logic-alpage-%d.html" % pagenum
    soup = BeautifulSoup(requests.get(url).text, "html.parser")
    # First table on this page contains links to songs by Logic.
    table = soup.find("table")
    for song in table.find_all('a'):
        links.append(song.get("href"))

# Enter each link and scrape all of the lyrics.
# Each element in our lyrics list will pertain to one song.
# Parsing through each link takes a while, expect long runtime.
lyrics = []
for link in links:
    time.sleep(0.1)
    bs = BeautifulSoup(requests.get(link).text, "html.parser")
    paragraphs = bs.find_all('p')
    song_text = ""
    for p in paragraphs:
        if p.get("class") != None and "verse" in p.get("class"):
            song_text = song_text + p.text
    lyrics.append(song_text)

# Print out the lyrics to the first song.
print(lyrics[0][:227])

# `pickle` is a Python package that serializes Python objects to disk so that you can load them in later.
import pickle
pickle.dump(lyrics, open("lyrics.pkl", "wb"))

I've been on the low
I been taking my time
I feel like I'm out of my mind
It feel like my life ain't mine
Who can relate?
I've been on the low
I been taking my time
I feel like I'm out of my mind
It feel like my life ain't mine


The end result is a list in which each element is a string containing all of the lyrics to one song.

Now we can finally dig into building our Markov Chain. We write a function that iterates through each word in all of Logic’s lyrics in order to generate the model we discussed before by examining each sequence of two words and creating a list of all of the words that follow each sequence. For more efficient/practical iteration, we use `"<START>"`, `"<END>"`, and `"<N>"` tags to represent a song’s beginning, its end, and its newline characters, respectively.

In [8]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.
    Returns:
      A dict that maps a tuple of 2 words ("bigram") to a list of
      words that follow that bigram, representing the Markov
      chain trained on the lyrics.
    """

    # Initialize the beginning of our chain.
    chain = {
        (None, "<START>"): []
    }

    for lyric in lyrics:
        # Replace newline characters with our tag.
        lyric_newlines = lyric.replace('\n', ' <N> ')
        # Create a tuple representing the most recent (current) bigram.
        last_2 = (None, "<START>")
        for word in lyric_newlines.split():
            # Add the word as one that follows the current bigram.
            chain[last_2].append(word)
            # Shift the current bigram to account for the newly added word.
            last_2 = (last_2[1], word)
            if last_2 not in chain:
                chain[last_2] = []
        chain[last_2].append("<END>")

    return chain

# Load the pickled lyrics object that we created earlier.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Train a Markov Chain over all of Logic's lyrics.
chain = train_markov_chain(lyrics)

Once we’ve built and returned the dictionary representing our Markov Chain, we can move onto the final portion of the algorithm: generating predicted lyrics. Beginning from the `(None, "<START>")` key (the first key in our chain), we randomly sample one of the words in the list connected to that key, then shift the key we’re currently examining to account for the word we just sampled. We continue this process all the way through until the `"<END>"` tag is finally encountered.

In [9]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()
    Returns:
      A string representing the randomly generated song.
    """

    # a list for storing the generated words
    words = []
    # generate the first word
    word = random.choice(chain[(None, "<START>")])
    words.append(word)

    # Begin with the first bigram in our chain.
    last_2 = (None, "<START>")
    while words[-1] != "<END>":
        # Generate the next word.
        word = random.choice(chain[last_2])
        words.append(word)
        # Shift the current bigram to account for the newly added word.
        last_2 = (last_2[1], words[-1])

    # Join the words together into a string with line breaks.
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

Thus, we can now `print(generate_new_lyrics(chain))` to display our predicted lyrics in the console.

In [11]:
print(generate_new_lyrics(chain))

[Hook] Living life behind the scenes 
 He one of us 
 Had my share of stealin' 
 But it doesn't matter, homie I got robbed 
 Life is moving fast it need to pretend Imma never do it for the show and come back Monday with a gun to bust 
 Police looking for talent, I hope that you can't be playin' that, that's the album 
 That's what's so crazy because I've always-like I'm on this killing spree 
 All the grass is green 
 And now I hate shots 
 That's that road I been lookin' for something 
 Everybody want me to be bumping this shit 
 Yeah I've been, yeah I've been vibing out here tryna gas it up for these insta folk 
 And I really do anything, you can or can't be off right 
 But now I needBlack people: to just fight, fight for ya life 
 Yeah, your girl on a Monday 
 Drop hits, get money 
 All the haters and what the drama bring, 
 It ain't about it 
 But I was like 15 for the music, I'm gone 
 Yes, I know 
 Probably wanna fight like this 
 Stab a motherfucker bleed, yeah theyFuckin with m

It’s imperative to note, however, that because I use simple random sampling to create new lyrics, I’m also randomizing how much output I actually receive. There were a select few instances in which I received less than one line or even just one word of output, but most of the time the algorithm printed out a giant amount of lyrics. Nonetheless, after searching through the outputs I received from many runs of the algorithm, I got a handful of pretty good lyrics overall, ranging from raw punchlines to downright hilarious quips.

## Analyzing Our Results
Observing many, many outputs from our implementation and those from a unigram implementation allows us to reach some important conclusions:

1. **Our model’s predictions are accurate, but often recycled.** It’s important to note that many of our predicted lines turned out to be nearly identical to lines Logic has actually written, i.e. half of a line from one verse/song combined with half of a line from another verse/song. This is to be expected, as using bigrams yields less variability in predicted words due to basing predictions off the previous two words instead of the previous one, resulting in sequences of three or more words coming from the same Logic lyric. In other words, *using bigrams instead of single words increases readability and similarity to Logic’s style, but decreases creativity.*
2. **Our model is slower and generates less output.** The unigram model runs faster because the dictionary object representing its Markov Chain has far fewer keys. Our model has so many more keys because it has to process tuples of two words. Furthermore, as I mentioned before, there were times when I received very little to no output, and generally I received less than I did from the unigram implementation. This can be attributed to the smaller number of possibilities for the next word when we’re basing it off the previous two words.

So where do we go from here? We’ve highlighted the strengths and weaknesses of our implementation; how do we actually mitigate those weaknesses and make our model even better? The key to discovering a superior design is to first discern the central Markov Assumption limiting the model that we built.

## Finding A Better Way
Modeling a situation with a Markov Chain requires assuming that the situation satisfies one key statement: a prediction for the next state of the situation only depends on the current state, not the rest of the situation’s history. For example, using Markov Chains to predict tomorrow’s weather involves the assumption that weather from the past two weeks or more has no effect on tomorrow’s conditions—something I think we can all agree sounds pretty far-fetched. Thus, even though using bigrams helped us decrease the magnitude of this assumption in our model, its impact was still prevalent and weakened our results. We need to find an alternative to our model that can at the very least make fewer assumptions.

A recurrent neural network is one example of a replacement we can use. While I won’t go into much detail here about RNNs, mostly because I’m still only cracking the surface with them myself, I can provide some very brief notes on them. Two of the key characteristics of RNNs are that they don’t assume that all inputs are independent of each other and that they’re capable of keeping a history of what they’ve processed, both of which are necessary to improving our model. For more information on RNNs, how they work, and how to implement them, check out the [Wikipedia page](https://en.wikipedia.org/wiki/Recurrent_neural_network) as well as [this tutorial](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/); I’ll be learning from both to eventually create better predictions.

---

If you’ve made it this far, thanks for reading about and (hopefully) understanding my newfound interest in machine learning! Data science as a whole already has so many fascinating and creative applications. I look forward to exploring the many nuances and intricacies in further detail as I work on more projects and continue to improve. After all, as Logic once wrote (and Paul Brandt before him), how can the sky be the limit when there are footprints on the moon?