# 8A. Song Lyrics Generator

In this lab, you will scrape a website to get lyrics of songs by your favorite artist. Then, you will train a model called a Markov chain on these lyrics so that you can generate a song in the style of your favorite artist.

# Question 1. Scraping Song Lyrics

Find a web site that has lyrics for several songs by your favorite artist. Scrape the lyrics into a Python list called `lyrics`, where each element of the list represents the lyrics of one song.

**Tips:**
- Find a web page that has links to all of the songs, like [this one](http://www.azlyrics.com/n/nirvana.html). [_Note:_ It appears that `azlyrics.com` blocks web scraping, so you'll have to find a different lyrics web site.] Then, you can scrape this page, extract the hyperlinks, and issue new HTTP requests to each hyperlink to get each song. 
- Use `time.sleep()` to stagger your HTTP requests so that you do not get banned by the website for making too many requests.

#### BROCKHAMPTON
https://genius.com/artists/Brockhampton

In [1]:
import requests
import time
import random

from bs4 import BeautifulSoup

In [2]:
resp = requests.get("https://genius.com/artists/Brockhampton")

In [3]:
soup = BeautifulSoup(resp.content,"html.parser")
song_cards = soup.find_all("div", {"class":"mini_card_grid-song"})

In [4]:
len(song_cards)

10

In [5]:
song_cards[0]

<div class="mini_card_grid-song">
<div>
<a class="mini_card" href="https://genius.com/Brockhampton-bleach-lyrics">
<div class="mini_card-thumbnail clipped_background_image--background_fill clipped_background_image" style="background-image: url('https://images.genius.com/66074c4f1e656161fe74af9b4ca9f27e.300x300x1.png');"></div>
<div class="mini_card-info">
<div class="mini_card-title_and_subtitle">
<div class="mini_card-title">BLEACH</div>
<div class="mini_card-subtitle">
          BROCKHAMPTON
        </div>
</div>
</div>
</a>
</div>
</div>

In [6]:
link = song_cards[0].find("a",href=True)
link["href"]

'https://genius.com/Brockhampton-bleach-lyrics'

In [7]:
links = []
for i in range(0,10):
    links.append(song_cards[i].find("a",href=True)['href'])

In [8]:
links

['https://genius.com/Brockhampton-bleach-lyrics',
 'https://genius.com/Brockhampton-sweet-lyrics',
 'https://genius.com/Brockhampton-gold-lyrics',
 'https://genius.com/Brockhampton-star-lyrics',
 'https://genius.com/Brockhampton-junky-lyrics',
 'https://genius.com/Brockhampton-zipper-lyrics',
 'https://genius.com/Brockhampton-gummy-lyrics',
 'https://genius.com/Brockhampton-boogie-lyrics',
 'https://genius.com/Brockhampton-rental-lyrics',
 'https://genius.com/Brockhampton-face-lyrics']

In [9]:
# test getting lyrics
resp =  requests.get('https://genius.com/Brockhampton-bleach-lyrics')
soup = BeautifulSoup(resp.content, "html.parser")
big_lyric_block = soup.find("div", class_="lyrics")
lyric_block = big_lyric_block.find("p").text
lyric_block

"[Chorus: Ryan Beatty]\nWho got the feeling?\nTell me why I cry when I feel it\nTell me why, tell me why\nWho got the feeling?\nTell me why I cry when I feel it\nTell me why, tell me why\n(Why?)\n\n[Verse 1: Matt Champion]\nPhone ringing, never outgoing, homebody\nNever outgoing, put my doubts on when these walls up\nTearing at the black tie, finish adding notches to my belt loop\nThey say help you, I can't help you\nWhy I can't speak out? Is wideout, wideout\nKeep it deep inside my mind, it's off-kilter, off-kilter\nI turn memory to fantasy, for that better pleasure, fuck\nTime machine gonna make it better, maybe better for ya\nI can't make this up, I can't take it back\nFeel like a monster, feel like a deadhead zombie\nFeelings you don't want me, I ain't giving up, you should set it off\nTell me “Time's up”, let the water run, let my body run\n\n[Chorus: Ryan Beatty]\nWho got the feeling?\nTell me why I cry when I feel it\nTell me why, tell me why\nWho got the feeling?\nTell me why I

In [10]:
lyrics = []

for link in links:
    resp =  requests.get(link)
    soup = BeautifulSoup(resp.content, "html.parser")
    big_lyric_block = soup.find("div", class_="lyrics")
    lyric_block = big_lyric_block.find("p")
    lyrics.append(lyric_block.text)

In [11]:
# Print out the lyrics to the first song.
#print(lyrics[0])

`pickle` is a Python library that serializes Python objects to disk so that you can load them in later.

In [12]:
import pickle
pickle.dump(lyrics, open("lyrics.pkl", "wb"))

# Question 2. Unigram Markov Chain Model

You will build a Markov chain for the artist whose lyrics you scraped in Lab A. Your model will process the lyrics and store the word transitions for that artist. The transitions will be stored in a dict called `chain`, which maps each word to a list of "next" words.

For example, if your song was ["The Joker" by the Steve Miller Band](https://www.youtube.com/watch?v=FgDU17xqNXo), `chain` might look as follows:

```
chain = {
    "some": ["people", "call", "people"],
    "call": ["me", "me", "me"],
    "the": ["space", "gangster", "pompitous", ...],
    "me": ["the", "the", "Maurice"],
    ...
}
```

Besides words, you should include a few additional states in your Markov chain. You should have `"<START>"` and `"<END>"` states so that we can keep track of how songs are likely to begin and end. You should also include a state called `"<N>"` to denote line breaks so that you can keep track of where lines begin and end. It is up to you whether you want to include normalize case and strip punctuation.

So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    "<START>": ["Some", ...],
    "Some": ["people", ...],
    "people": ["call", ...],
    "call": ["me", ...],
    "me": ["the", ...],
    "the": ["space", ...],
    "space": ["cowboy,", ...],
    "cowboy,": ["yeah", ...],
    "yeah": ["<N>", ...],
    "<N>": ["Some", ..., "Come"],
    ...,
    "Come": ["on", ...],
    "on": ["baby", ...],
    "baby": ["and", ...],
    "and": ["I'll", ...],
    "I'll": ["show", ...],
    "show": ["you", ...],
    "you": ["a", ...],
    "a": ["good", ...],
    "good": ["time", ...],
    "time": ["<END>", ...],
}
```

Your chain will be trained on not just one song, but by all songs by your artist.

In [13]:
#testing on one lyric?

lyric = lyrics[0]
lyric_list = lyric.replace("\n"," <n> ").lower().split()
lyric_list.insert(0,"<START>")
lyric_list.append("<END>")
lyric_list[:] = [x if x != '<n>' else '<N>' for x in lyric_list]

index = 1
chain = {}
for word in lyric_list[index:]: 
    key = lyric_list[index - 1]
    if key in chain:
        chain[key].append(word)
    else:
        chain[key] = [word]
    index += 1
    
#chain

In [14]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.
    
    Returns:
      A dict that maps a single word ("unigram") to a list of
      words that follow that word, representing the Markov
      chain trained on the lyrics.
    """
    chain = {"<START>": []}
    for lyric in lyrics:
        
        lyric_list = lyric.replace("\n"," <n> ").lower().split()
        lyric_list.insert(0,"<START>")
        lyric_list.append("<END>")
        lyric_list[:] = [x if x != '<n>' else '<N>' for x in lyric_list]
        
        index = 1
        for word in lyric_list[index:]: 
            key = lyric_list[index - 1]
            if key in chain:
                chain[key].append(word)
            else:
                chain[key] = [word]
            index += 1
        
    return chain

In [15]:
# Load the pickled lyrics object that you created in Lab A.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Call the function you wrote above.
chain = train_markov_chain(lyrics)

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(chain["<START>"])

# What words tend to begin a line (i.e., what words follow the line break tag?)
print(chain["<N>"][:20])

['[chorus:', '[verse', '[chorus:', '[verse', '[video', '[verse', '[verse', '[verse', '[verse', '[chorus:']
['who', 'tell', 'tell', 'who', 'tell', 'tell', '(why?)', '<N>', '[verse', 'phone', 'never', 'tearing', 'they', 'why', 'keep', 'i', 'time', 'i', 'feel', 'feelings']


Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `"<START>"` state and randomly sample a word from the list of words that follow `"<START>"`. Then, at each step, we'll randomly sample the next word from the list of words that followed each current word. We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

You may find the `random.choice()` function helpful for this question.

In [16]:
def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()
    
    Returns:
      A string representing the randomly generated song.
    """
    
    # a list for storing the generated words
    words = []
    # generate the first word
    word = random.choice(chain["<START>"])
    #words.append(word)
    
    while word != "<END>":
        words.append(word)
        word = random.choice(chain[word])
    
    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [17]:
print(generate_new_lyrics(chain))

[video outro: dom mclennon] 
 [chorus: kevin abstract] 
 kobe bryant with respect 
 ice on my name, brand new crib 
 i was you taste the world so i've been dead 
 brad pitt, start 
 head, head, head, head, head, head, head, head, head, head, head, head, head 
 [verse 3: merlyn wood] 
 let it go obama when i feel all i need a friend 
 time 
 not enough feelin' saturated 
 
 is fixed 
 come fuck the windows tinted 
 what they put me 
 twistin' me by my passport 
 
 this is fixed 
 we still be hurtin' if you make me shit 
 fly as hell, i forgot my willy 
 they don't mean shit 
 nic cage with the door 
 whatchu mean? 
 never outgoing, put my mama in high school, man, that syrup ‘til i hit, under control, i'm the fire, baby, i'm drowning 
 [verse 3: dom mclennon] 
 ain't taught 'em in the way" 
 [chorus: kevin abstract] 
 my wife-a 
 [chorus: ryan beatty] 
 floating like 
 black on when i gotcha 
 ridin' on my uncles 
 what are 
 i'm seeing through the respect? is a pop 
 i'm smashing on wh

# Question 3. Bigram Markov Chain Model

Now you'll build a more complex Markov chain that uses the last _two_ words (or bigram) to predict the next word. Now your dict `chain` should map a _tuple_ of words to a list of words that appear after it.

As before, you should also include tags that indicate the beginning and end of a song, as well as line breaks. That is, a tuple might contain tags like `"<START>"`, `"<END>"`, and `"<N>"`, in addition to regular words. So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    (None, "<START>"): ["Some", ...],
    ("<START>", "Some"): ["people", ...],
    ("Some", "people"): ["call", ...],
    ("people", "call"): ["me", ...],
    ("call", "me"): ["the", ...],
    ("me", "the"): ["space", ...],
    ("the", "space"): ["cowboy,", ...],
    ("space", "cowboy,"): ["yeah", ...],
    ("cowboy,", "yeah"): ["<N>", ...],
    ("yeah", "<N>"): ["Some", ...],
    ("time", "<N>"): ["Come"],
    ...,
    ("<N>", "Come"): ["on", ...],
    ("Come", "on"): ["baby", ...],
    ("on", "baby"): ["and", ...],
    ("baby", "and"): ["I'll", ...],
    ("and", "I'll"): ["show", ...],
    ("I'll", "show"): ["you", ...],
    ("show", "you"): ["a", ...],
    ("you", "a"): ["good", ...],
    ("a", "good"): ["time", ...],
    ("good", "time"): ["<END>", ...],
}
```

In [18]:
#testing bigrams

lyric = lyrics[0]
lyric_list = lyric.replace("\n"," <n> ").lower().split()
lyric_list.insert(0,"<START>")
lyric_list.append("<END>")
lyric_list[:] = [x if x != '<n>' else '<N>' for x in lyric_list]

chain = {(None, "<START>"): []}
chain[(None, "<START>")].append(lyric_list[1])

index = 2
for word in lyric_list[index:]: 
    key = (lyric_list[index-2],lyric_list[index - 1])
    if key in chain:
        chain[key].append(word)
    else:
        chain[key] = [word]
    index += 1
    
#chain

In [19]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.
    
    Returns:
      A dict that maps a tuple of 2 words ("bigram") to a list of
      words that follow that bigram, representing the Markov
      chain trained on the lyrics.
    """
    chain = {(None, "<START>"): []}
    for lyric in lyrics:

        lyric_list = lyric.replace("\n"," <n> ").lower().split()
        lyric_list.insert(0,"<START>")
        lyric_list.append("<END>")
        lyric_list[:] = [x if x != '<n>' else '<N>' for x in lyric_list]
        
        chain[(None, "<START>")].append(lyric_list[1])

        index = 2
        for word in lyric_list[index:]: 
            key = (lyric_list[index-2],lyric_list[index - 1])
            if key in chain:
                chain[key].append(word)
            else:
                chain[key] = [word]
            index += 1
            
    return chain

In [20]:
# Load the pickled lyrics object that you created in Lab A.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Call the function you wrote above.
chain = train_markov_chain(lyrics)

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(chain[(None, "<START>")])

['[chorus:', '[verse', '[chorus:', '[verse', '[video', '[verse', '[verse', '[verse', '[verse', '[chorus:']


Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `(None, "<START>")` state and randomly sample a word from the list of words that follow this bigram. Then, at each step, we'll randomly sample the next word from the list of words that followed the current bigram (i.e., the last two words). We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

In [21]:
#practice 

words = []
# generate the first word
word = random.choice(chain[(None, "<START>")])
words.append(word)
bigram = ("<START>",words[0])
word = random.choice(chain[bigram])

while word != "<END>":
    words.append(word)
    bigram = (words[-2],words[-1])
    word = random.choice(chain[bigram])
    
#words

In [22]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()
    
    Returns:
      A string representing the randomly generated song.
    """
    
    # a list for storing the generated words
    words = []
    # generate the first word
    words.append(random.choice(chain[(None, "<START>")]))
    bigram = ("<START>",words[0])
    word = random.choice(chain[bigram])

    while word != "<END>":
        words.append(word)
        bigram = (words[-2],words[-1])
        word = random.choice(chain[bigram])
    
    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [23]:
print(generate_new_lyrics(chain))

[verse 1: matt champion] 
 what's your motive with me baby? 
 'cause i know it's not what you're waiting for, shit 
 
 [verse 2: ameer vann & merlyn wood] 
 damn, time-travelin', honda-swervin', that's so merlyn 
 damn, time-travelin', honda-swervin', book learnin' 
 that's so merlyn, that's so merlyn, that's so merlyn 
 that's what they sayin' in private, speaking from that entitlement 
 we like wu-tang but i feel all hefty 
 i want 
 i'm on a friday 
 michael cera on a mission, every time that i remember 
 i need someone who can handle it 
 tell me what you're waiting for, hun 
 
 [bridge: kevin abstract] 
 what it's like to speak like a deadhead zombie 
 feelings you don't want nobody, but you-u-u, mm-mm 
 i need a friend (i need a honey butter, vodka in a ufo, i haven't started yet 
 still gotta figure out exactly where to park it at 
 but when i walk out the way" 
 i need an intervention, i need is a chain 
 (i said i keep a gold chain on my boys and my wrist is fixed 
 i just wan

# Analysis

Compare the quality of the lyrics generated by the unigram model (in Lab B) and the bigram model (in Lab C). Which model seems to generate more reasonable lyrics? Can you explain why? What do you see as the advantages and disadvantages of each model?

**YOUR ANSWER HERE.**

The lyrics generated by the bigram model seems more like a BROCKHAMPTON song than the lyrics generated by the unigram model. The grammar is certainly better with the bigram model than with the unigram model. I suppose this is because choosing a word that comes after a pair of words is more specific - there are less words to be followed by a pair of words than a single words. 

The unigram model creates a smaller chain than the bigram model. But, the results of the bigram model are more meaningful and reasonable because the chain is larger yet more specific. The thing is, BROCKHAMPTON has a lot of repetition and random lyrics than some songs - maybe because that is how Hip Hop goes these days. The unigram model would print too many repeated words. Like one line was "head head head head head", because one line in a song repeats "head" a few times in a row so there were multiple "head"s in the "head" key. However, the bigram model handles this well because it adds the word before and after the repeated "head" line in the lyrics, thus preventing the algorithm from getting stuck in "head" and other lyric patterns like this. 