# 8A. Song Lyrics Generator

In this lab, you will scrape a website to get lyrics of songs by your favorite artist. Then, you will train a model called a Markov chain on these lyrics so that you can generate a song in the style of your favorite artist.

# Question 1. Scraping Song Lyrics

Find a web site that has lyrics for several songs by your favorite artist. Scrape the lyrics into a Python list called `lyrics`, where each element of the list represents the lyrics of one song.

**Tips:**
- Find a web page that has links to all of the songs, like [this one](http://www.azlyrics.com/n/nirvana.html). [_Note:_ It appears that `azlyrics.com` blocks web scraping, so you'll have to find a different lyrics web site.] Then, you can scrape this page, extract the hyperlinks, and issue new HTTP requests to each hyperlink to get each song. 
- Use `time.sleep()` to stagger your HTTP requests so that you do not get banned by the website for making too many requests.

In [1]:
import requests, re, time, pandas as pd
from bs4 import BeautifulSoup

def depaginate(url):
    #https://genius.com/api/artists/330928/songs?sort=title&page=1
    resps = []
    next_page = ""
    resp = requests.get(url + "&page=" + "1").json()
    while next_page is not None:
        resps.append(resp)
        next_page = resp["response"]["next_page"]
        resp = requests.get(url + "&page=" + str(next_page)).json()
        time.sleep(0.25)
    return resps

def parseLyric(string):
    lyric = re.sub("\[([A-Za-z0-9_ -\?]+)\]", "", string)
    lyric = lyric.strip()
    return re.sub("[\\()\"]", "", lyric.replace("\n", " <N> "))

def getLyrics(songs):
    lyrics = []
    for song in songs:
        html = requests.get("https://genius.com" + song["path"])
        soup = BeautifulSoup(html.content, "html.parser")
        lyrics.append(parseLyric(soup.find("div", class_="lyrics").p.text))
    return lyrics

In [2]:
lyrics = []
for page in depaginate(
    "https://genius.com/api/artists/330928/songs?sort=title"):
    lyrics += getLyrics(page["response"]["songs"])

In [3]:
# Print out the lyrics to the first song.
print(lyrics[0])

Just forget it <N> I can make it mind over matter <N> I can make you feel something better <N> I don't wanna be apart <N> And I said it <N> Everything in life has an ending <N> Think of all the days we were spending <N> Hoping for a way to start <N>  <N>  <N> I came to leave it right <N> I hope to stay the night <N> You back away but it's already love <N>  <N>  <N> I know you think that I'm crazy <N> Cause I can't stop calling you baby <N> And I know that you'll never break me <N> Cause it's already love <N> Cause it's already love <N>  <N>  <N> I'm a soldier <N> Fight it, but I know that I need it <N> Try to look away but I see ya <N> Yeah, your body feels like home <N> Is it over? <N> We don't have to wait on forever <N> We can go a long way together <N> I just wanna let you know <N>  <N>  <N> I came to leave it right <N> I hope to stay the night <N> You back away but it's already love <N>  <N>  <N> I know you think that I'm crazy <N> Cause I can't stop calling you baby <N> And I kno

In [4]:
lyrics[0].split()[0:8]

['Just', 'forget', 'it', '<N>', 'I', 'can', 'make', 'it']

`pickle` is a Python library that serializes Python objects to disk so that you can load them in later.

In [5]:
import pickle
pickle.dump(lyrics, open("lyrics.pkl", "wb"))

# Question 2. Unigram Markov Chain Model

You will build a Markov chain for the artist whose lyrics you scraped in Lab A. Your model will process the lyrics and store the word transitions for that artist. The transitions will be stored in a dict called `chain`, which maps each word to a list of "next" words.

For example, if your song was ["The Joker" by the Steve Miller Band](https://www.youtube.com/watch?v=FgDU17xqNXo), `chain` might look as follows:

```
chain = {
    "some": ["people", "call", "people"],
    "call": ["me", "me", "me"],
    "the": ["space", "gangster", "pompitous", ...],
    "me": ["the", "the", "Maurice"],
    ...
}
```

Besides words, you should include a few additional states in your Markov chain. You should have `"<START>"` and `"<END>"` states so that we can keep track of how songs are likely to begin and end. You should also include a state called `"<N>"` to denote line breaks so that you can keep track of where lines begin and end. It is up to you whether you want to include normalize case and strip punctuation.

So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    "<START>": ["Some", ...],
    "Some": ["people", ...],
    "people": ["call", ...],
    "call": ["me", ...],
    "me": ["the", ...],
    "the": ["space", ...],
    "space": ["cowboy,", ...],
    "cowboy,": ["yeah", ...],
    "yeah": ["<N>", ...],
    "<N>": ["Some", ..., "Come"],
    ...,
    "Come": ["on", ...],
    "on": ["baby", ...],
    "baby": ["and", ...],
    "and": ["I'll", ...],
    "I'll": ["show", ...],
    "show": ["you", ...],
    "you": ["a", ...],
    "a": ["good", ...],
    "good": ["time", ...],
    "time": ["<END>", ...],
}
```

Your chain will be trained on not just one song, but by all songs by your artist.

In [15]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.
    
    Returns:
      A dict that maps a single word ("unigram") to a list of
      words that follow that word, representing the Markov
      chain trained on the lyrics.
    """
    chain = {"<START>": []}
    for lyric in lyrics:
        key = "<START>"
        for word in lyric.split():
            chain[key].append(word)
            key = word
            if key not in chain:
                chain[key] = []
        chain[key].append("<END>")
    return chain

In [16]:
# Load the pickled lyrics object that you created in Lab A.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Call the function you wrote above.
chain = train_markov_chain(lyrics)

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(chain["<START>"])

# What words tend to begin a line (i.e., what words follow the line break tag?)
print(chain["<N>"][:20])

['Just', 'Waiting', 'I', 'I', 'Every', 'Baby', 'I', 'Do', 'Are', 'Yeah,', 'You', 'No', 'Easy', 'Who', 'I', "I've", "I've", "I've", "I've", 'Walking', 'You', 'You', 'Treat', 'Any', 'Thought', "It's", 'I', 'I', 'You', 'You', 'Jump', 'I', 'I', 'Meant', "It's", "I've", 'Ooh', 'Ooh']
['I', 'I', 'I', 'And', 'Everything', 'Think', 'Hoping', '<N>', '<N>', 'I', 'I', 'You', '<N>', '<N>', 'I', 'Cause', 'And', 'Cause', 'Cause', '<N>']


Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `"<START>"` state and randomly sample a word from the list of words that follow `"<START>"`. Then, at each step, we'll randomly sample the next word from the list of words that followed each current word. We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

You may find the `random.choice()` function helpful for this question.

In [17]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()
    
    Returns:
      A string representing the randomly generated song.
    """
    
    # a list for storing the generated words
    words = []
    key = "<START>"
    while key != "<END>":
        link = random.choice(chain[key])
        words.append(link)
        key = link
    
    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [20]:
print(generate_new_lyrics(chain))

Walking down 
 Say it 
 I'm gonna rush 
 My side I haven't ruined all battling fear the blinding light 
 I've gotta get to be? 
 
 If I stare at the ways that castle's got a way 
 You will work it all on my heart alone 
 And I'm looking at the thing through hell and you how little I tried so hard to make you make a sound? 
 Take all of something great 
 
 Yeah I feel like you're brave enough 
 I'm looking out for sure 
 
 
 
 So don't let someone find you 
 Ya and far past you're sorry 
 You're the table 
 
 
 
 No denying 
 Are you know 
 I'm looking down 
 You're the night my heart alone 
 I'm gonna love it right now I wanna feel like trying 
 I'll show me where the thing you got room for me that 
 But you 
 
 I'm gonna love 
 I'm running my body said, no other sides? 
 See me high 
 I'm never say 
 And I won't make those things I either way 
 What do 
 We'll find that I'm my mind 
 And don't change 
 
 
 You can really hold on 
 
 I'm playing the love 
 Memories are 
 She's an easy 

# Question 3. Bigram Markov Chain Model

Now you'll build a more complex Markov chain that uses the last _two_ words (or bigram) to predict the next word. Now your dict `chain` should map a _tuple_ of words to a list of words that appear after it.

As before, you should also include tags that indicate the beginning and end of a song, as well as line breaks. That is, a tuple might contain tags like `"<START>"`, `"<END>"`, and `"<N>"`, in addition to regular words. So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    (None, "<START>"): ["Some", ...],
    ("<START>", "Some"): ["people", ...],
    ("Some", "people"): ["call", ...],
    ("people", "call"): ["me", ...],
    ("call", "me"): ["the", ...],
    ("me", "the"): ["space", ...],
    ("the", "space"): ["cowboy,", ...],
    ("space", "cowboy,"): ["yeah", ...],
    ("cowboy,", "yeah"): ["<N>", ...],
    ("yeah", "<N>"): ["Some", ...],
    ("time", "<N>"): ["Come"],
    ...,
    ("<N>", "Come"): ["on", ...],
    ("Come", "on"): ["baby", ...],
    ("on", "baby"): ["and", ...],
    ("baby", "and"): ["I'll", ...],
    ("and", "I'll"): ["show", ...],
    ("I'll", "show"): ["you", ...],
    ("show", "you"): ["a", ...],
    ("you", "a"): ["good", ...],
    ("a", "good"): ["time", ...],
    ("good", "time"): ["<END>", ...],
}
```

In [10]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.
    
    Returns:
      A dict that maps a tuple of 2 words ("bigram") to a list of
      words that follow that bigram, representing the Markov
      chain trained on the lyrics.
    """
    chain = {(None, "<START>"): []}
    for lyric in lyrics:
        key = (None, "<START>")
        for word in lyric.split():
            chain[key].append(word)
            key = (key[1], word)
            if key not in chain:
                chain[key] = []
        chain[key].append("<END>")
    return chain

In [11]:
# Load the pickled lyrics object that you created in Lab A.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Call the function you wrote above.
chain = train_markov_chain(lyrics)

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(chain[(None, "<START>")])

print(chain[("<START>", "Just")])

['Just', 'Waiting', 'I', 'I', 'Every', 'Baby', 'I', 'Do', 'Are', 'Yeah,', 'You', 'No', 'Easy', 'Who', 'I', "I've", "I've", "I've", "I've", 'Walking', 'You', 'You', 'Treat', 'Any', 'Thought', "It's", 'I', 'I', 'You', 'You', 'Jump', 'I', 'I', 'Meant', "It's", "I've", 'Ooh', 'Ooh']
['forget']


Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `(None, "<START>")` state and randomly sample a word from the list of words that follow this bigram. Then, at each step, we'll randomly sample the next word from the list of words that followed the current bigram (i.e., the last two words). We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

In [12]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()
    
    Returns:
      A string representing the randomly generated song.
    """
    
    # a list for storing the generated words
    words = []
    
    # YOUR CODE HERE
    key = (None, "<START>")
    while key[1] != "<END>":
        link = random.choice(chain[key])
        words.append(link)
        key = (key[1], link)
    
    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [14]:
print(generate_new_lyrics(chain))

Jump out of the heights 
 With sting with bite 
 I'm gonna tell you how I feel like, feel like, feel like I'll never change 
 I feel like, feel like trying 
 
 I saw you I am dying 
 To say that it's me? 
 Forget about the past you're holding 
 Move on to 
 Lie on the line 
 If you're brave enough to love it if we try 
 But you got your things to hold her 
 You'd better forget it 
 Will you come here to see it's not my time 
 Love you 
 How will I let you go? 
 You were there for me, and I was doing 
 Turn salt to pearl 
 True no duke no earl 
 But from the other way 
 I know that I'm crazy 
 Cause it's already love 
 
 I won't bruise you either way 
 I've been running my whole life 
 Say it all 
 Say it all 
 I've come to get you I'm all I find 
 Every time I ever fought it 
 Oh ooh oh oh ooh 
 It's almost too much to handle


# Analysis

Compare the quality of the lyrics generated by the unigram model (in Lab B) and the bigram model (in Lab C). Which model seems to generate more reasonable lyrics? Can you explain why? What do you see as the advantages and disadvantages of each model?

It appears that the bi-gram markov chain generates more cohesive (both gramatically and lyrically) song lyrics. This is because (at least for the grammar part) it is sampling a bi-gram from naturally occuring two word sequences. Therefore the words "I've been running my" ("I've been", "been running", "running my") appear naturally frequently together in the song lyrics versus "I've given you got" ("I've", "given", "you", "got"). To be clearer, the bigrams are more probabilistically prone to be more grammatically correct than the unigrams

# Submission Instructions

Once you are finished, follow these steps:

1. Restart the kernel and re-run this notebook from beginning to end by going to `Kernel > Restart Kernel and Run All Cells`.
2. If this process stops halfway through, that means there was an error. Correct the error and repeat Step 1 until the notebook runs from beginning to end.
3. Double check that there is a number next to each code cell and that these numbers are in order.

Then, submit your lab as follows:

1. Go to `File > Export Notebook As > PDF`.
2. Double check that the entire notebook, from beginning to end, is in this PDF file. (If the notebook is cut off, try first exporting the notebook to HTML and printing to PDF.)
3. Upload the PDF [to PolyLearn](https://polylearn.calpoly.edu/AY_2018-2019/mod/assign/view.php?id=349486).