# Can a language model rap like Eminem?
## Lyrics Generator using Markov Chains

Nowadays, natural language processing (NLP), especially language models, are all around us. Without learning language or defining fixed rules it is pretty amazing how well they can generate new content also including art. This is a hotly debated topic [Towards Science: AI Art Debate](https://towardsdatascience.com/the-ai-art-debate-excitement-fear-and-ethics-c04d30f338da). Besides the ethical discussion, language models open up a large field of use cases. For example in combination with data sources containing lyrics, songs could be generated. The aim of this project work is to analyze songs from one specific artist and try to generate more lyrics which could be sold as new songs. Not only the simple generation would be interesting, but also how to generate ryhmes by using an Markov Chains.

## Authentication
To get correct lyrics in an consistent format, the [Genius API](https://docs.genius.com/) is used. Therefore a oAuth process is required. In the Genius API, an client needs to be created in order to get an access token for authentication. By using the python package [Lyricsgenius](https://lyricsgenius.readthedocs.io/en/master/) all endpoints concerning artists and songs can be called easily inside of the code.

In [98]:
import lyricsgenius

In [99]:
ACCESS_TOKEN = 'hioPnwfszCDr5GTfpbjjbS6Q2-wghxTZEyhaZ1VN1_B6-KPOkOZUV0Ra6ysMO3Mz'
ARTIST = "Eminem"

In [100]:
geniusAPI = lyricsgenius.Genius(ACCESS_TOKEN)

## Collecting data

First of all, we check whether the artist is available in the API. In respect to the response time from the server, we need to search the song list first and do an individual api call for each song lyrics.

In [101]:
artist = geniusAPI.search_artist(ARTIST, max_songs=1)
artist

Searching for songs by Eminem...

Song 1: "Rap God"

Reached user-specified song limit (1).
Done. Found 1 songs.


Artist(id, songs, ...)

In [102]:
songList = geniusAPI.artist_songs(artist.id, per_page=50)

In [105]:
songs = []
for song in songList['songs']:
    if song['lyrics_state'] == 'complete': # check if the lyrics is available
        songs.append(geniusAPI.lyrics(song['id']))

Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/Shabaam-sahdeeq-5-star-generals-instrumental-lyrics
Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/Eminem-8-mile-instrumental-lyrics
Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/Jay-z-8-miles-and-runnin-instrumental-lyrics


In order to get some randomness into the data, shuffle is used.

In [106]:
import random

In [107]:
random.shuffle(songs)

## Preprocessing
In this section, all unnecessary information is removed to get the pure lyrics string. While working with language models, most of the time line breaks are removed. But for rhymes those breaks are important. That is why we leave them in the text.

In [108]:
import re

In [109]:
songs[0]

'8 Mile: Lotto vs B-Rabbit Lyrics[Intro: Nashawn \'Ox\' Breedlove (as Lotto)]\nFuck this coward, dawg\nFree World in the motherfuckin\' house, what\'s goin\' on, baby?\nYo, it\'s time to get rid of this coward right here once and for all\nSick of this motherfucker\nCheck this shit out\nHuh\nHuh, yo\n\n[Verse 1: Nashawn \'Ox\' Breedlove (as Lotto)]\nYo, I\'ll spit a racial slur, honky, sue me\nThe shit is a horror flick, but the black guy doesn\'t die in this movie\nFuckin\' with Lotto? Dawg, you gotta be kiddin\'\nThat makes me believe you really don\'t have an interest in livin\'\nYou think these niggas gon\' feel the shit you say?\nI got a better chance joinin\' the KKK\nOn some real shit, though, I like you\nThat\'s why I didn\'t wanna have to be the one you commit suicide to\nFuck "Lotto", call me your leader\nI feel bad that I gotta murder that dude from "Leave It To Beaver"\nI used to like that show, now you got me in fight-back mode\nBut oh well, if you gotta go, then you gotta 

After checking the first element in the lyrics list, we can see that ther is a first section which contains title information. Important section inside of the song such as Intro and Chorus are marked with brackets. In addition to that, at the end there is always a number in combination with the word "Embeded". Such information is not useful for the language model. That is why I am removing such content in the preprocessing step using regex.

In [110]:
def cleaning(lyrics) -> str:
    # first translation information
    lyrics = re.sub(r"^[^_]* Lyrics", "",lyrics)
    # song sections
    lyrics = re.sub(r"(\[.*?\])", "", lyrics)
    # Embed information at the end 
    return re.sub(r"((\d.?\dK?)?Embed)", "", lyrics)

In [111]:
cleaned_lyrics = [cleaning(lyric) for lyric in songs if type(lyric) == str]

In [115]:
print("Available cleaned songs:", len(cleaned_lyrics))

Available cleaned songs: 44


## Model generation
After the data cleaning is completed, the input text is ready to be processed by the model.

In [116]:
import uuid

To be able to analyze the generated text afterwards, an unique id is used to save input and output text.

In [117]:
uuid_song = str(uuid.uuid4())
uuid_song

'ec78c362-a959-466a-bacf-acc530de4e28'

By joining all song lyrics, we get the basic corpus for the Markov Chain. See the next section for a detailed description.

In [118]:
input_text = ' '.join(cleaned_lyrics)
file = open('input_text/' + uuid_song + ".txt", "w")
file.write(input_text)
file.close()

### Markov Chains
Markov chains are useful mathematical models that use concepts from probability and matrix algebra to generate text. While training the Markov Chain, a matrix is generated which calculates the probability of the next word or character based on the previous used text. \
See: [An Introduction to Markov Chains](http://dx.doi.org/10.13140/2.1.1833.8248)

#### Word-based generation vs Character-based generation
There are two possible ways to create an Markov Chain. For word-based models, the probability of the next word is calculated. While in the character-based approach each character is weighted individually. \
One side effect of word-based generation is that the vocabulary only includes words which are already known. This can be tricky for lyrics generation, because sometimes for ryhme purposes a new word can be created or a word from another language can be used. Even though I don't think that this model is able to create new words, it would be better to use the character-based approach.
Another reason why I choose a character-based models are the line breaks. While looking at a song text, one can see that the lines define the rhyme. This generation can be achieved easier by generating characters.

In [119]:
def getTransitionTable(data, k = 4):#if X is the sequence of 'k = 3' and Y is predicted character or k+1 the character
    T = {} #making an empty dictionary
    
    for i in range(len(data) - k):
        X = data[i:i+k]
        Y = data[i+k]
# making dictornary for each after word and new that are not in dict of x(transition dict)
        if T.get(X) is None:
            T[X] = {}
            T[X][Y] = 1
        else:
            if T[X].get(Y) is None: #checking is y is not present or notin Transition Dictonary(x)
                T[X][Y] = 1
            else:
                T[X][Y] +=1
    
    return T

The transition table helps us to get an overview over all available characters in the dictionary and their current frequency in the input text (corpus). The variable *k* defines the number of characters which are considered for the selection of the next character. That is why this variable defines the dimension of our transition table.

In [135]:
lyrics_transition_table = getTransitionTable(input_text, k = 4)
lyrics_transition_table

{'\nFuc': {'k': 36},
 'Fuck': {' ': 42, 'i': 3, ',': 1, '!': 1},
 'uck ': {'t': 20,
  '"': 1,
  'y': 17,
  'i': 14,
  'L': 1,
  'a': 12,
  'd': 3,
  "'": 3,
  'w': 19,
  'o': 8,
  'f': 2,
  'u': 8,
  'm': 8,
  'D': 1,
  's': 2,
  'I': 1,
  'C': 2,
  'e': 1,
  'h': 5,
  'J': 1,
  'H': 1,
  'S': 1,
  'c': 1},
 'ck t': {'h': 52, 'o': 22, 'r': 1, 'u': 1, 'a': 1},
 'k th': {'i': 14, 'e': 36, 'a': 17},
 ' thi': {'s': 179, 'n': 49, 'r': 3, 'c': 3, 'e': 1},
 'this': {' ': 148, ',': 8, '\n': 19, ';': 1, '—': 1, '!': 1, ':': 1, ')': 1},
 'his ': {'c': 13,
  'm': 19,
  's': 27,
  'g': 14,
  'a': 18,
  'b': 36,
  'B': 1,
  'l': 8,
  'r': 10,
  'n': 5,
  'f': 15,
  'e': 6,
  'h': 8,
  'p': 7,
  't': 7,
  'o': 5,
  'w': 5,
  'i': 15,
  'u': 3,
  'v': 2,
  'E': 1,
  'd': 2,
  'y': 1,
  'M': 2,
  'F': 1,
  'k': 1,
  "'": 1,
  '8': 7,
  'I': 1},
 'is c': {'o': 6, 'h': 3, 'u': 2, 'y': 1, 'a': 2, 'r': 2, 'l': 1, 'i': 1},
 's co': {'w': 2, 'r': 3, 'u': 3, 'm': 8, 'o': 2, 'p': 1, 'l': 1},
 ' cow': {'a': 3,

In [136]:
def convertFreqIntoProb(T):
    for kx in T.keys():
        s = float(sum(T[kx].values()))
        for k in T[kx].keys():
            T[kx][k] = T[kx][k]/s
            
    return T

In order to get a probability for the each character or character group, the frequency is used.

In [137]:
char_model = convertFreqIntoProb(lyrics_transition_table)
char_model

{'\nFuc': {'k': 1.0},
 'Fuck': {' ': 0.8936170212765957,
  'i': 0.06382978723404255,
  ',': 0.02127659574468085,
  '!': 0.02127659574468085},
 'uck ': {'t': 0.15151515151515152,
  '"': 0.007575757575757576,
  'y': 0.12878787878787878,
  'i': 0.10606060606060606,
  'L': 0.007575757575757576,
  'a': 0.09090909090909091,
  'd': 0.022727272727272728,
  "'": 0.022727272727272728,
  'w': 0.14393939393939395,
  'o': 0.06060606060606061,
  'f': 0.015151515151515152,
  'u': 0.06060606060606061,
  'm': 0.06060606060606061,
  'D': 0.007575757575757576,
  's': 0.015151515151515152,
  'I': 0.007575757575757576,
  'C': 0.015151515151515152,
  'e': 0.007575757575757576,
  'h': 0.03787878787878788,
  'J': 0.007575757575757576,
  'H': 0.007575757575757576,
  'S': 0.007575757575757576,
  'c': 0.007575757575757576},
 'ck t': {'h': 0.6753246753246753,
  'o': 0.2857142857142857,
  'r': 0.012987012987012988,
  'u': 0.012987012987012988,
  'a': 0.012987012987012988},
 'k th': {'i': 0.208955223880597,
  'e': 

## Lyrics generation
By using this probability table, our new lyrics can be generated.

In [138]:
import numpy as np

In [139]:
def sample_next(context, T, k = 4):
    context = context[-k:] #AS WE are predict next char from last k char 
    
    
    if T.get(context) is None:
        return ' '
    
    possible_chars = list(T[context].keys())
    possible_probabs = list(T[context].values())
    
    return np.random.choice(possible_chars, p =possible_probabs )

In [140]:
def generateText(starting_sent,T, k = 4, max_len = 100):
    sentence = starting_sent
    
    context = sentence[-k:]
    
    for i in range(max_len):
        next_pred = sample_next(context, T, k)
        sentence += next_pred
        context = sentence[-k:]
        
    return sentence

By check the mean lyrics length of Eminem songs we can find our new lyrics max length.

In [141]:
lens = [len(lyric) for lyric in cleaned_lyrics]
np.mean(lens)

3269.0454545454545

For the text generation, we leave our *k* with 4. This means that we need at least 4 input characters to be able to generate new content.

In [144]:
lyrics_predict = generateText("Look ", char_model, max_len=4000, k=4)

In [145]:
file = open('output_text/' + uuid_song + ".txt", "w")
file.write(lyrics_predict)
file.close()
lyrics_predict

'Look back withousand that having this man, he\'s fuck it, wretch with you wanna be\n\'Cause I never to the fucking stop of the flip that\'s up! Lookies\n\n\nSix miles\nI ain\'t just me\nI mean is that the impact\nSee, twenty graduated\nAin\'t remember, he room, it anyone what I ain\'t walk more them toasteros\n\'Cause two judge-packed up piss, jiggie Small does in the lunch Dre about of\n\n\nAfterman, holocause I\'m never me road dange\nI\'m off your you gon\' his left, I\'m hatcheddar, keeps scalat for gut strap that I can five a do the start shoes, wish up in Yoo" by each one\nHere\'s a block with\nI just and\nBodies layin\' to get some and him\nThe right the drawn and to way the bust me, \'bout off, we wanna behind the fuck Lottom off the pink did it\n\'Cause I see hot\nYou can\'t take a none\nVery to be brain park\nYo, that you know, I\'m just the haunt catchine the green with us, the shit out spontaneous boss wouldn\'t give a date linked it, man\'s syntax\nThat\'s wholes\nAnd tha

## Conclusion and Further work
Markov Chains are a great way to generate text language independent. Two big advantages by using this method are that there is no need for large input data and that the compution time for the model is very short. I think it is quite amazing how fast a language model can learn about sentences and grammar. \
By looking at our results, the model was able to use line breaks which structures our generated lyrics very well. Nevertheless, there is lots of room for improvement. One can hardly find ryhmes in the text and there is no context in the song text. \
One approach would be to change the hyperparamter *k* to get more related text lines. It is questionable whether by using larger character groups already existing phrases of songs will be reused as a result. Another idea would be to investigate more time in the preprocessing. One could analyze the ryhmes of the existing songs and select only unison rhymes as input text. I think that double rhymes or cross rhymes confuses the model because the connection of the text lines are too far apart. \
To sum up, I am happy with the result and I think that with more preprocessing or hyperparameter tuning a language model could rap like Eminem. Even if it takes a lot more than rhymes to create new songs.

References:
- https://github.com/soniajoseph/MarkovLyric
- [https://lyricsgenius.readthedocs.io](https://lyricsgenius.readthedocs.io)
- [https://docs.genius.com/](https://docs.genius.com/)
- https://github.com/aryangulati/Character-Based-Language-Model
- [An Introduction to Markov Chains](http://dx.doi.org/10.13140/2.1.1833.8248)