<h1>Markov's Muse: Using N-Gram Markov Chains to Generate Emotion Based Lyrics</h1>

<h3>Hannah Shu, Stanford University</h3>
<h3>CS 109 Extra Credit Project</h3>

In [148]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import random

<h1>Emotion Lexicon</h1>

For this project, I used an emotion lexicon, specifically the NRC Emotion Lexicon, which is a list of words associated with eight basic emotions: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust. Each word in the lexicon is tagged with one or more of these emotions, indicating an association between the word and the corresponding emotion. I use the emotion lexicon to influence the word selection process during the generation of lyrics.

In [149]:
#locating lexicon
lexicon_path = '/Users/hannahshu/Downloads/NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt'
lexicon_df = pd.read_csv(lexicon_path, delimiter='\t', header=None, names=['word', 'emotion', 'association'])
print(lexicon_df)

#cleaning lexicon
filtered_lexicon_df = lexicon_df[lexicon_df['association'] == 1].reset_index(drop=True).drop(columns='association')
print(filtered_lexicon_df)

emotion_counts = filtered_lexicon_df.groupby('emotion').size()
print(emotion_counts)

#convert lexicon into dictionary
emotion_lexicon = {}
for i in range(len(filtered_lexicon_df)):
    word = filtered_lexicon_df.iloc[i]['word']
    emotion = filtered_lexicon_df.iloc[i]['emotion']
    if word not in emotion_lexicon:
        emotion_lexicon[word] = []
    emotion_lexicon[word].append(emotion)


emotion_lexicon

         word       emotion  association
0       aback         anger            0
1       aback  anticipation            0
2       aback       disgust            0
3       aback          fear            0
4       aback           joy            0
...       ...           ...          ...
141535   zoom      negative            0
141536   zoom      positive            0
141537   zoom       sadness            0
141538   zoom      surprise            0
141539   zoom         trust            0

[141540 rows x 3 columns]
            word       emotion
0         abacus         trust
1        abandon          fear
2        abandon      negative
3        abandon       sadness
4      abandoned         anger
...          ...           ...
13867       zest  anticipation
13868       zest           joy
13869       zest      positive
13870       zest         trust
13871        zip      negative

[13872 rows x 2 columns]
emotion
anger           1245
anticipation     837
disgust         1056
fear        

{'abacus': ['trust'],
 'abandon': ['fear', 'negative', 'sadness'],
 'abandoned': ['anger', 'fear', 'negative', 'sadness'],
 'abandonment': ['anger', 'fear', 'negative', 'sadness', 'surprise'],
 'abba': ['positive'],
 'abbot': ['trust'],
 'abduction': ['fear', 'negative', 'sadness', 'surprise'],
 'aberrant': ['negative'],
 'aberration': ['disgust', 'negative'],
 'abhor': ['anger', 'disgust', 'fear', 'negative'],
 'abhorrent': ['anger', 'disgust', 'fear', 'negative'],
 'ability': ['positive'],
 'abject': ['disgust', 'negative'],
 'abnormal': ['disgust', 'negative'],
 'abolish': ['anger', 'negative'],
 'abolition': ['negative'],
 'abominable': ['disgust', 'fear', 'negative'],
 'abomination': ['anger', 'disgust', 'fear', 'negative'],
 'abort': ['negative'],
 'abortion': ['disgust', 'fear', 'negative', 'sadness'],
 'abortive': ['negative', 'sadness'],
 'abovementioned': ['positive'],
 'abrasion': ['negative'],
 'abrogate': ['negative'],
 'abrupt': ['surprise'],
 'abscess': ['negative', 'sad

<h1>Web Scraping Songs Lyrics from Different Artists</h1>

The process of web scraping songs from different artists is in a different file. I am loading all the csv files to dataframes for usage for the project. These data frames include song title and the corresponding song lyrics.

In [150]:
#turn artists into dataframes
df_beatles = pd.read_csv('beatlessongs.csv')
df_joji = pd.read_csv("jojisongs.csv")
df_taylor = pd.read_csv("taylorsongs.csv")
df_hozier = pd.read_csv('hoziersongs.csv')
df_kendrick = pd.read_csv("kendricksongs.csv")
df_lukecombs = pd.read_csv('lukesongs.csv')
df_olivia = pd.read_csv('oliviasongs.csv')
df_ed = pd.read_csv('edsongs.csv')
df_mitski = pd.read_csv('mitskisongs.csv')
df_tyler = pd.read_csv('tylersongs.csv')
df_seal = pd.read_csv('sealsongs.csv')


df_seal

Unnamed: 0,Song Title,Lyrics
0,A Change Is Gonna Come,"[Verse 1] <br> I was born by the river, in a l..."
1,A Father's Way,I build a fence around you in a fathers way <b...
2,A Minor Groove,"I get high when I touch ya... <br> Oooooo, <br..."
3,Ain't nothing but a house party - bonus track,"They're dancing on the ceiling, they're dancin..."
4,Ain’t No Better Love,"[Verse 1] <br> Oh that reaper, he gonna win <b..."
...,...,...
219,Wishing on a Star,I'm wishing on a star <br> To follow where you...
220,With Me Part 1,[INTRO-JD] <br> Ay'thing you like is wit me <b...
221,You Are My Kind,[Verse 1: Seal] <br> Stay with me baby <br> An...
222,You Are My Kind (w/Santana),Stay with it baby <br> And that's all I ask of...


<h1>Training N-Gram as a Markov Chain</h1>

Inspired by Chat GBT and their usage of n grams, I decided to create a n-gram Markov chain from the list of song lyrics. The n-gram Markov chain acts as a probablistic model that predicts the next work in a sequenced based on the previous n words. The function splits the lists into words and building a dictionary of n-grams to their subsequent words. 

In [151]:
def train_ngram_markov_chain(lyrics, n):
    start_key = tuple([None] + ["<START>"] * (n-1)) #initialize start key which will denote the beginning of a verse
    chain = {start_key: []} #initilize the markov chain
    
    for lyric in lyrics: #processing lyrics
        if type(lyric) is not str: 
            continue
        elif not lyric.strip():
            continue
        
        #split the words and add to the chain
        words = lyric.split()
        if not words:
            continue
        chain[start_key].append(words[0])

        #build n grams
        for i in range(1 - n, len(words)):
            if i < 0:
                start_ngram = tuple(["<START>"] * -i) + tuple(words[:i + n])
                if i + n < len(words):
                    next_word = words[i + n]
                else:
                    next_word = "<END>"
            else:
                start_ngram = tuple(words[i:i + n])
                if i + n < len(words):
                    next_word = words[i + n]
                else:
                    next_word = "<END>"
                
            if start_ngram not in chain:
                chain[start_ngram] = []
            chain[start_ngram].append(next_word)

    return chain

<h1>Generating Lyrics</h1>

This function will generate the song lyrics from the n-gram Markov chain created in the previous function. In addition, it will ensure the generated lyrics will contain words associated with the specified emotion by filtering potential next words. The process involves starting with a predefined key, iteratively generating words based on the current n-gram, and selecting words associated with the desired emotion to construct the final lyrics.

In [152]:
def generate_emotion_lyrics(chain, emotion_lexicon, emotion, n):
    start_key = tuple([None] + ["<START>"] * (n-1)) #initializing the start key

    words = [random.choice(chain[start_key])] #choosing the first word randomly from the markov chain
    current_ngram = start_key[1:] + (words[-1],) #update current n_gram

    while current_ngram in chain:      
        possible_next_words = chain[current_ngram] #generates possible new words 
        emotion_words = []
        for word in possible_next_words: #filtering the words by emotion
            if word in emotion_lexicon and emotion in emotion_lexicon[word]:
                emotion_words.append(word)

        if not emotion_words: #checking if this is an empty list
            next_word = random.choice(possible_next_words) #choose other possible next words from the chain
        else:
            next_word = random.choice(emotion_words) #choose other words from the emotion word list
        if next_word == '<END>':
            break
        
        words.append(next_word)
        current_ngram = current_ngram[1:] + (next_word,)

    lyrics = " ".join(words)
    return "\n".join(lyrics.split("<N>"))


In [153]:
#parameters
n = 2
emotion: 'positive'
ngram_chain = train_ngram_markov_chain(df_seal['Lyrics'], n)

#get generated lyrics

generated_lyrics = generate_emotion_lyrics(ngram_chain, emotion_lexicon, emotion, n)
print(generated_lyrics)

My emotions <br> Drift away <br> Say it's alright <br> Everything that we can break away <br> I lost my faith <br> Long ago <br> What's become of them? <br> When you lose all thoughts, sense of time <br> You'll always be loved <br> Don't cry <br> Tonight, my baby my baby <br> Come if it felt the same mistake this time <br> You'll still be loved <br> Don't cry <br> Loving you whether, whether <br> Whether times are good or bad <br> And it's all for you" <br> Now shoot that love baby don't you lose your self-esteem <br> That's when you're in love with angel? <br> If you wreck us gracefully <br> The harm in what you do it again, yah <br> Well I love you baby <br> Take away the snake this power the sun rose high <br> Am I even know who the real ones are cause you've seen <br> There's hate when you fall by the sea <br> Fly right into the arms of peace. <br> For crying out loud when the feeling <br> When you lose it all, it won't be afraid <br> Just ruthless <br> More green than I've ever ha

<h1>Specifying which parts to generate</h1>

Although song generation is important, most artists and musicians would probably appreciate a tool where it generates part of a song, ie Chrous, Verse, Intro or Outro. We begin with training our n-gram markov chain. 

In [154]:
def preprocess_lyrics(lyrics): #suggested by chat gbt
    """Remove unnecessary tags or characters for better processing."""
    return lyrics.replace('<br>', ' <br> ').replace('\n', ' <br> ')

def train_ngram_markov_chain_2(lyrics, n):
    chain = {} #initilize markov chain

    for lyric in lyrics: #processing lyrics
        if type(lyric) is not str: 
            continue
        elif not lyric.strip():
            continue

        words = preprocess_lyrics(lyric).split() #process splitting words
        words = ["<START>"] * (n - 1) + words + ["<END>"]

        for i in range(len(words) - n + 1): #building n-gram
            gram = tuple(words[i:i+n])
            key = gram[:-1]
            next_word = gram[-1]
            if key not in chain:
                chain[key] = []
            chain[key].append(next_word)
    
    return chain
    

<h1> Generating Sequences based on Emotion and Lyric Type</h1>

Similar to the previous generating function, given n-gram Markov chain, this function incorporates an emotion lexicon to bias the word selection towards specific emotion. New feature includes different start phrase ie Chrous, Intro, Verse, etc. and stop generating words when it reaches a specified word limit and/or end token. 

In [155]:
def generate_sequence(chain, start_phrase, n, word_limit, emotion_lexicon, emotion):
    start_words = preprocess_lyrics(start_phrase).split()
    
    # Handle start phrases that are shorter than n-1 (chat gbt)
    if len(start_words) < n - 1:
        start_words = ["<START>"] * (n - 1 - len(start_words)) + start_words

    current_ngram = tuple(start_words[-(n-1):]) #building the tuple
    output = list(start_words)
    generated_words = [] #initializing the generated words

    while len(output) < word_limit: #generating words with limits on word count
        if current_ngram in chain:
            possible_next_words = chain[current_ngram]

            if emotion_lexicon and emotion:
                emotion_words = []
                for word in possible_next_words: #retrieving possible words based on emotion
                    if word in emotion_lexicon and emotion in emotion_lexicon[word]:
                        emotion_words.append(word)
                if emotion_words: #if not empty
                    next_word = random.choice(emotion_words) #randomly select words from list
                else:
                    next_word = random.choice(possible_next_words) #randomly select words from list
            else:
                next_word = random.choice(possible_next_words) #randomly select words from list

            if next_word == "<END>": #reaching the end and updating current ngram
                break
            output.append(next_word)
            generated_words.append(next_word)
            current_ngram = tuple(output[-(n-1):])

        else: #finding potential matches for missing n-grams (helped with chat gbt)
            if len(current_ngram) < 2:
                break # Avoid index error if current_ngram is too short
            key_val = current_ngram[0]
            val_val = current_ngram[1]
            condition_met = False

            for key, value in chain.items():
                if key_val in key and val_val in value:
                    condition_met = True
                    break

            if condition_met:
                possible_next_words = chain[key]

                if emotion_lexicon and emotion:
                    emotion_words = [word for word in possible_next_words if word in emotion_lexicon and emotion in emotion_lexicon[word]]
                    if emotion_words:
                        next_word = random.choice(emotion_words)
                    else:
                        next_word = random.choice(possible_next_words)
                else:
                    next_word = random.choice(possible_next_words)

                if next_word == "<END>":
                    break
                output.append(next_word)
                generated_words.append(next_word)
                current_ngram = tuple(output[-(n-1):])
            else:
                break

    output = [word for word in output if word not in ["<START>", None]] #remove start token 
    return ' '.join(output)


In [156]:
#parameters
desired_section = '[Chorus]'
word_limit = 30
emotion = 'positive'
n = 2
df = df_kendrick
df['Lyrics'] = df['Lyrics'].fillna('')

filtered_lyrics = df[df['Lyrics'].str.startswith(desired_section)]['Lyrics']

if filtered_lyrics.empty:
    print("No lyrics found for the desired section.")
else:
    ngram_chain = train_ngram_markov_chain_2(filtered_lyrics, n) #train with filtered lyrics by desired section
    start_phrase = desired_section
    generated_sequence = generate_sequence(ngram_chain, start_phrase, n, word_limit, emotion_lexicon, emotion)
    print(generated_sequence)

[Chorus] <br> Yeah, I found you a vow soon <br> God, love and job <br> [Verse 2] <br> This how to embrace y'all gon’ see that new, new sh** <br>
