# Song Lyrics Generator

In this assignment, you will scrape a website to get lyrics of songs by your favorite artist. Then, you will train a model called a Markov chain on these lyrics so that you can generate a song in the style of your favorite artist.

# Question 1. Scraping Song Lyrics

Find a web site that has lyrics for several songs by your favorite artist. Scrape the lyrics into a Python list called `lyrics`, where each element of the list represents the lyrics of one song.

**Tips:**
- Find a web page that has links to all of the songs, like [this one](https://www.songlyrics.com/steve-miller-band-lyrics/). Then, you can scrape this page, extract the hyperlinks, and issue new HTTP requests to each hyperlink to get each song.
- If you can't find the artist or songs you want on https://www.songlyrics.com/ you can try some of the [music related APIs here](https://github.com/public-apis/public-apis#music). If you find a useful site, please share it with everyone on Discord.
- Use `time.sleep()` to stagger your HTTP requests so that you do not get banned by the website for making too many requests.

In [27]:
import requests
import time

from bs4 import BeautifulSoup

response = requests.get("https://www.songlyrics.com/jess-glynne-lyrics/")

soup = BeautifulSoup(response.content, 'html.parser')

In [28]:
song_table = soup.find_all('table', attrs={'class': 'tracklist'})[0]
len(song_table)

3

In [29]:
song_links = []

for song in song_table.find_all("a"):

    # Get the link for the song
    link = song.get('href')

    # Append this data.
    song_links.append(link)

len(song_links)

71

In [30]:
unclean_lyrics = []

for link in song_links:
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    unclean_lyrics.append(soup.find_all('p', attrs={'id': 'songLyricsDiv'})[0].text)
    time.sleep(0.5)

In [31]:
import re

lyrics = []
for lyric in unclean_lyrics:
    if "We do not have the lyrics for" not in lyric:
        lyrics.append(lyric)
        


In [43]:
# Print out the lyrics to the first song.
print(lyrics[0])

Standing in a crowded room and I can't see your face
Put your arms around me, tell me everything's OK
In my mind, I'm running round a cold and empty space
Just put your arms around me, tell me everything's OK
Break my bones but you won't see me fall, oh
The rising tide will rise against them all, oh

Darling, hold my hand
Oh, won't you hold my hand?
Cause I don't wanna walk on my own anymore
Won't you understand? Cause I don't wanna walk alone
I'm ready for this, there's no denying
I'm ready for this, you stop me falling
I'm ready for this, I need you all in
I'm ready for this, so darling, hold my hand
Soul is like a melting pot when you're not next to me
Tell me that you've got me and you're never gonna leave
Tryna find a moment where I can find release
Please tell me that you've got me and you're never gonna leave
Break my bones but you won't see me fall, oh
The rising tide will rise against them all, oh

Darling, hold my hand
Oh, won't you hold my hand?
Cause I don't wanna walk on m

`pickle` is a Python library that serializes Python objects to disk so that you can load them in later.

In [33]:
import pickle
pickle.dump(lyrics, open("lyrics.pkl", "wb"))

# Question 2. Unigram Markov Chain Model

You will build a Markov chain for the artist whose lyrics you scraped in Question 1. Your model will process the lyrics and store the word transitions for that artist. The transitions will be stored in a dict called `chain`, which maps each word to a list of "next" words.

For example, if your song was ["The Joker" by the Steve Miller Band](https://www.youtube.com/watch?v=dV3AziKTBUo), `chain` might look as follows:

```
chain = {
    "some": ["people", "call", "people"],
    "call": ["me", "me", "me"],
    "the": ["space", "gangster", "pompitous", ...],
    "me": ["the", "the", "Maurice"],
    ...
}
```

Besides words, you should include a few additional states in your Markov chain. You should have `"<START>"` and `"<END>"` states so that we can keep track of how songs are likely to begin and end. You should also include a state called `"<N>"` to denote line breaks so that you can keep track of where lines begin and end. It is up to you whether you want to include normalize case and strip punctuation.

So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    "<START>": ["Some", ...],
    "Some": ["people", ...],
    "people": ["call", ...],
    "call": ["me", ...],
    "me": ["the", ...],
    "the": ["space", ...],
    "space": ["cowboy,", ...],
    "cowboy,": ["yeah", ...],
    "yeah": ["<N>", ...],
    "<N>": ["Some", ..., "Come"],
    ...,
    "Come": ["on", ...],
    "on": ["baby", ...],
    "baby": ["and", ...],
    "and": ["I'll", ...],
    "I'll": ["show", ...],
    "show": ["you", ...],
    "you": ["a", ...],
    "a": ["good", ...],
    "good": ["time", ...],
    "time": ["<END>", ...],
}
```

Your chain will be trained on not just one song, but by all songs by your artist.

In [34]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.

    Returns:
      A dict that maps a single word ("unigram") to a list of
      words that follow that word, representing the Markov
      chain trained on the lyrics.
    """
    chain = {"<START>": [], "<N>": []}
    for lyric in lyrics:
      # YOUR CODE HERE
      lines = lyric.split("\n")
      lines = [line for line in lines if line != ""]
      for l in range(len(lines)):
        words = lines[l].split(" ")
        # remove all punctuation
        words = [re.sub(r'[^\w\s]','',word) for word in words]
        # remove \r characters
        words = [word.replace("\r", "") for word in words]
        # add start and end tokens
        if l == 0:
          words = ["<START>"] + words + ["<N>"]
        elif l == len(lines)-1:
          words = ["<N>"] + words + ["<END>"]
        else:
          words = ["<N>"] + words + ["<N>"]
        for i in range(len(words)-1):
          if words[i] not in chain:
            chain[words[i]] = []
          chain[words[i]].append(words[i+1])

    return chain

In [35]:
# Load the pickled lyrics object that you created in Question 1.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Call the function you wrote above.
chain = train_markov_chain(lyrics)

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(chain["<START>"])

['Standing', 'Standing', 'Theres', 'Standing', 'Theres', 'Téléchargez', 'From', 'Wrapped', 'I', 'Téléchargez', 'Téléchargez', 'Finally', 'Finally', 'Finally', 'Finally', 'Finally', 'Finally', 'When', 'When', 'Standing', 'Standing', 'Standing', 'Wrapped', 'in', 'I', 'I', 'I', 'Wrapped', 'Wrapped', 'Téléchargez', 'Finally', '', 'feat', 'with', 'feat', 'Verse', 'Verse', 'Standing', 'Standing', 'Standing', 'Standing', 'Sometimes', 'Theres', 'Going', 'You', 'Smoking', 'Time', 'Thinking', 'Wrapped', 'I', 'I', 'Birds', 'In', 'Dont', 'Theres', 'with']


Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `"<START>"` state and randomly sample a word from the list of words that follow `"<START>"`. Then, at each step, we'll randomly sample the next word from the list of words that followed each current word. We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

You may find the `random.choice()` function helpful for this question.

In [36]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()

    Returns:
      A string representing the randomly generated song.
    """

    # a list for storing the generated words
    words = []
    # generate the first word
    words.append(random.choice(chain["<START>"]))

    # # YOUR CODE HERE
    while words[-1] != "<END>":
        choices = chain.get(words[-1], ["<END>"])
        words.append(random.choice(choices))

    # # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [46]:
print(generate_new_lyrics(chain))

Birds fly we turned finally free
Patience lost I began to lose me
My advice would be take a step

PreChorus
I wasnt scared I fought this on my feet
And I aint playing with you
Day one I said Id go for me
One box ticked got a little love to share
Yeah Im gonna Im gonna come through
Youll never be alone Ill be there
Oh I swear I got enough love for two ooh ooh ooh
Youll never be alone Ill be there for you my love
To dreams that never will come true
Am I strong enough to see my life
Through someone elses eyes
Its not an easy road
But now Im caught up in a crowded room and I cant see your face
Put your arms around me tell me everythings OK
In my mind Im running round a cold and empty space
Just put your arms around me tell me that youve got me where you want me
Now Im right here right here
Oh oh oh oh
Aah ah ah ah
Ooh oh oh oh
Aah ah ah ah

You know Im contained
Oh yes Im up on that thing
Right here is where Id stay
But Im not complaining

Right here you got me where you want me
Now Im rig

# Question 3. Bigram Markov Chain Model

Now you'll build a more complex Markov chain that uses the last _two_ words (or bigram) to predict the next word. Now your dict `chain` should map a _tuple_ of words to a list of words that appear after it.

As before, you should also include tags that indicate the beginning and end of a song, as well as line breaks. That is, a tuple might contain tags like `"<START>"`, `"<END>"`, and `"<N>"`, in addition to regular words. So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    (None, "<START>"): ["Some", ...],
    ("<START>", "Some"): ["people", ...],
    ("Some", "people"): ["call", ...],
    ("people", "call"): ["me", ...],
    ("call", "me"): ["the", ...],
    ("me", "the"): ["space", ...],
    ("the", "space"): ["cowboy,", ...],
    ("space", "cowboy,"): ["yeah", ...],
    ("cowboy,", "yeah"): ["<N>", ...],
    ("yeah", "<N>"): ["Some", ...],
    ("time", "<N>"): ["Come"],
    ...,
    ("<N>", "Come"): ["on", ...],
    ("Come", "on"): ["baby", ...],
    ("on", "baby"): ["and", ...],
    ("baby", "and"): ["I'll", ...],
    ("and", "I'll"): ["show", ...],
    ("I'll", "show"): ["you", ...],
    ("show", "you"): ["a", ...],
    ("you", "a"): ["good", ...],
    ("a", "good"): ["time", ...],
    ("good", "time"): ["<END>", ...],
}
```

In [38]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.

    Returns:
      A dict that maps a tuple of 2 words ("bigram") to a list of
      words that follow that bigram, representing the Markov
      chain trained on the lyrics.
    """
    chain = {(None, "<START>"): []}
    for lyric in lyrics:
        words = lyric.split(" ")
        # remove all punctuation
        words = [re.sub(r'[^\w\s]','',word) for word in words]
        # remove \r characters
        words = [word.replace("\r", "") for word in words]
        words = [None, "<START>"] + words + ["<END>"]
        for i in range(len(words)-2):
            bigram = (words[i], words[i+1])
            if bigram not in chain:
                chain[bigram] = []
            chain[bigram].append(words[i+2])
    return chain

In [39]:
# Load the pickled lyrics object that you created in Question 1.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Call the function you wrote above.
chain = train_markov_chain(lyrics)

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(chain[(None, "<START>")])

['Standing', 'Standing', 'Theres', 'Standing', 'Theres', 'Téléchargez', 'From', 'Wrapped', 'I', 'Téléchargez', 'Téléchargez', 'Finally', 'Finally', 'Finally', 'Finally', 'Finally', 'Finally', 'When', 'When', 'Standing', 'Standing', 'Standing', 'Wrapped', 'in', 'I', 'I', 'I', 'Wrapped', 'Wrapped', 'Téléchargez', 'Finally', '', 'feat', 'with', 'feat', 'Verse', 'Verse', 'Standing', 'Standing', 'Standing', 'Standing', 'Sometimes', 'Theres', 'Going', 'You', 'Smoking', 'Time', 'Thinking', 'Wrapped', 'I', 'I', 'Birds', 'In', 'Dont', 'Theres', 'with']


Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `(None, "<START>")` state and randomly sample a word from the list of words that follow this bigram. Then, at each step, we'll randomly sample the next word from the list of words that followed the current bigram (i.e., the last two words). We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

In [40]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()

    Returns:
      A string representing the randomly generated song.
    """

    # a list for storing the generated words
    words = []
    # generate the first word
    bigram = (None, "<START>")
    next = random.choice(chain[bigram])
    words.append(next)
    bigram = (bigram[1], next)

    # YOUR CODE HERE
    while bigram[1] != "<END>":
        choices = chain.get(bigram, ["<END>"])
        next = random.choice(choices)
        words.append(next)
        bigram = (bigram[1], next)

    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [50]:
print(generate_new_lyrics(chain))

Thinking about those nusery rhyming days
When it was all just fun and games
Never had a care in the way you talk to me
You rationalize my darkest thoughts
Yeah you set them free

Came to you with a broken soul
Will you hold my hand
Cause I dont wanna break
And I dont already know
You told me wed win
So I took all I need
Cut me do I not bleed
No bad blood

No bad blood
Just LIVE your LIFE
You cut me deeper every day with your smile

See I aint got far to go
Cause I spent forever waiting
And its no longer a dream
And now Ive landed on my own anymore
Wont you understand Cause I dont wanna walk alone
Im ready for this theres no denying
Im ready for your calling
Now Im right here right here

Ooh oh oh oh

Cant let go and it doesnt matter how I cry
My tears of love are a waste of time
If I turn away am I to travel with no one else could see
I drew a smile on my feet
And I aint got far to go
Cause I spent forever waiting
And its no longer a dream
And now Ive landed on my shoulders
I was lost 

**Paste your randomly generated song lyrics (either unigram or bigram) into the Discord channel and we can try to guess the artist!**

# Question 4. Analysis

Compare the quality of the lyrics generated by the unigram model (in Question 2) and the bigram model (in Question 3). Which model seems to generate more reasonable lyrics? Can you explain why? What do you see as the advantages and disadvantages of each model?

**YOUR ANSWER HERE.**

## Submission Instructions

- After you have completed the notebook, select **Runtime > Run all**
- After the notebook finishes rerunning check to make sure that you have no errors and everything runs properly.  Fix any problems and redo this step until it works.
- Rename this notebook by clicking on "DATA 301 Assignment 04 - YOUR NAMES HERE" at the very top of this page. Replace "YOUR NAMES HERE" with the first and last names of you and your partner (if you worked with one).
- Expand all cells with View > Expand Sections
- Save a PDF version: File > Print > Save as PDF
    - Under "More Settings" make sure "Background graphics" is checked
    - Printing Colab to PDF doesn't always work so well and some of your output might get cutoff. That's ok.
    - It's not necessary, but if you want a more nicely formatted PDF you can uncomment and run the code in the following cell. (Here's a [video](https://www.youtube.com/watch?v=-Ti9Mm21uVc) with other options.)
- Download the notebook: File > Download .ipynb
- Submit the notebook and PDF in Canvas. If you worked in a pair, only one person should submit in Canvas.

In [42]:
# !wget -nc https://raw.githubusercontent.com/brpy/colab-pdf/master/colab_pdf.py
# from colab_pdf import colab_pdf
# colab_pdf('DATA 301 Lab4B - YOUR NAMES HERE.ipynb')