# Can a language model rap like Eminem?
## Lyrics Generator using Markov Chains

Nowadays, natural language processing (NLP), especially language models, are all around us. Without learning language or defining fixed rules it is pretty amazing how well they can generate new content also including art. This is a hotly debated topic [Towards Science: AI Art Debate](https://towardsdatascience.com/the-ai-art-debate-excitement-fear-and-ethics-c04d30f338da). Besides the ethical discussion, language models open up a large field of use cases. For example in combination with data sources containing lyrics, songs could be generated. The aim of this project work is to analyze songs from one specific artist and try to generate more lyrics which could be sold as new songs. Not only the simple generation would be interesting, but also how to generate ryhmes by using an Markov Chains.

## Authentication
To get correct lyrics in an consistent format, the [Genius API](https://docs.genius.com/) is used. Therefore a oAuth process is required. In the Genius API, an client needs to be created in order to get an access token for authentication. By using the python package [Lyricsgenius](https://lyricsgenius.readthedocs.io/en/master/) all endpoints concerning artists and songs can be called easily inside of the code.

In [1]:
import lyricsgenius

In [2]:
ACCESS_TOKEN = ''
ARTIST = "Eminem"

In [3]:
geniusAPI = lyricsgenius.Genius(ACCESS_TOKEN)

## Collecting data

First of all, we check whether the artist is available in the API. In respect to the response time from the server, we need to search the song list first and do an individual api call for each song lyrics.

In [4]:
artist = geniusAPI.search_artist(ARTIST, max_songs=1)
artist

Searching for songs by Eminem...

Song 1: "Rap God"

Reached user-specified song limit (1).
Done. Found 1 songs.


Artist(id, songs, ...)

In [5]:
songList = geniusAPI.artist_songs(artist.id, per_page=50)

In [6]:
songs = []
for song in songList['songs']:
    if song['lyrics_state'] == 'complete': # check if the lyrics is available
        songs.append(geniusAPI.lyrics(song['id']))

Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/Shabaam-sahdeeq-5-star-generals-instrumental-lyrics
Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/Eminem-8-mile-instrumental-lyrics
Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/Jay-z-8-miles-and-runnin-instrumental-lyrics


In order to get some randomness into the data, shuffle is used.

In [7]:
import random

In [8]:
random.shuffle(songs)

## Preprocessing
In this section, all unnecessary information is removed to get the pure lyrics string. While working with language models, most of the time line breaks are removed. But for rhymes those breaks are important. That is why we leave them in the text.

In [9]:
import re

In [10]:
songs[0]

"8 Mile: B-Rabbit vs Supa Emcee Lyrics[Eminem (as B-Rabbit)]\n...\n\n[Supa Emcee]\nHey, yo, what up, trailer trash?\nYo, Future, how'd you get whitey to battle the Savior?\nThat's like Darth Vader battling Obey Taylor\nYou can't mess around with the horror\nI grab the mic and I'ma murder you, now you'll die tomorrow!\nYou can't kick with the lyrics I spit\nI blow your head off and leave you dying in your blood, you bitch!\nYou still a trick and a hoe and I'ma roll\nI fucked your mama; she's a prostitute, bro!\nYou thinking you white, but you ain't light\nI'ma murder you in the gang again, and blast you in the night!\nI'm still ill, I'm still for real with Ken Loraine\nI got a gun on your boy, I'ma blow out his brain!\nYou run with a bunch of wack faggots\nI'ma eat out your meat like I'm in your body: I'm a maggot!\nAnd yo, matter of fact, you gettin' played\nYou ain't nothing but a white boy dying of fucking AIDS!You might also like1Embed"

After checking the first element in the lyrics list, we can see that ther is a first section which contains title information. Important section inside of the song such as Intro and Chorus are marked with brackets. In addition to that, at the end there is always a number in combination with the word "Embeded". Such information is not useful for the language model. That is why I am removing such content in the preprocessing step using regex.

In [11]:
def cleaning(lyrics) -> str:
    # first translation information
    lyrics = re.sub(r"^[^_]* Lyrics", "",lyrics)
    # song sections
    lyrics = re.sub(r"(\[.*?\])", "", lyrics)
    # Embed information at the end 
    return re.sub(r"((\d.?\dK?)?Embed)", "", lyrics)

In [12]:
cleaned_lyrics = [cleaning(lyric) for lyric in songs if type(lyric) == str]

In [13]:
print("Available cleaned songs:", len(cleaned_lyrics))

Available cleaned songs: 44


## Model generation
After the data cleaning is completed, the input text is ready to be processed by the model.

In [14]:
import uuid

To be able to analyze the generated text afterwards, an unique id is used to save input and output text.

In [15]:
uuid_song = str(uuid.uuid4())
uuid_song

'96fcc2c6-1fc3-490b-9409-f578cef8da02'

By joining all song lyrics, we get the basic corpus for the Markov Chain. See the next section for a detailed description.

In [16]:
input_text = ' '.join(cleaned_lyrics)
file = open('input_text/' + uuid_song + ".txt", "w")
file.write(input_text)
file.close()

### Markov Chains
Markov chains are useful mathematical models that use concepts from probability and matrix algebra to generate text. While training the Markov Chain, a matrix is generated which calculates the probability of the next word or character based on the previous used text. \
See: [An Introduction to Markov Chains](http://dx.doi.org/10.13140/2.1.1833.8248)

#### Word-based generation vs Character-based generation
There are two possible ways to create an Markov Chain. For word-based models, the probability of the next word is calculated. While in the character-based approach each character is weighted individually. \
One side effect of word-based generation is that the vocabulary only includes words which are already known. This can be tricky for lyrics generation, because sometimes for ryhme purposes a new word can be created or a word from another language can be used. Even though I don't think that this model is able to create new words, it would be better to use the character-based approach.
Another reason why I choose a character-based models are the line breaks. While looking at a song text, one can see that the lines define the rhyme. This generation can be achieved easier by generating characters.

In [17]:
def getTransitionTable(data, k = 4):#if X is the sequence of 'k = 3' and Y is predicted character or k+1 the character
    T = {} #making an empty dictionary
    
    for i in range(len(data) - k):
        X = data[i:i+k]
        Y = data[i+k]
# making dictornary for each after word and new that are not in dict of x(transition dict)
        if T.get(X) is None:
            T[X] = {}
            T[X][Y] = 1
        else:
            if T[X].get(Y) is None: #checking is y is not present or notin Transition Dictonary(x)
                T[X][Y] = 1
            else:
                T[X][Y] +=1
    
    return T

The transition table helps us to get an overview over all available characters in the dictionary and their current frequency in the input text (corpus). The variable *k* defines the number of characters which are considered for the selection of the next character. That is why this variable defines the dimension of our transition table.

In [18]:
lyrics_transition_table = getTransitionTable(input_text, k = 4)
# show first 50 items
list(lyrics_transition_table.items())[0:10]

[('\n...', {'\n': 1}),
 ('...\n', {'\n': 1, 'R': 1}),
 ('..\n\n', {'\n': 1}),
 ('.\n\n\n', {'H': 1}),
 ('\n\n\nH', {'e': 2, 'o': 2}),
 ('\n\nHe', {'y': 2}),
 ('\nHey', {',': 3, ' ': 4}),
 ('Hey,', {' ': 4}),
 ('ey, ', {'y': 1, 'w': 2, 't': 2, 'b': 1, "'": 1, 'T': 1, 'd': 4}),
 ('y, y', {'o': 7, 'e': 3})]

In [19]:
def convertFreqIntoProb(T):
    for kx in T.keys():
        s = float(sum(T[kx].values()))
        for k in T[kx].keys():
            T[kx][k] = T[kx][k]/s
            
    return T

In order to get a probability for the each character or character group, the frequency is used.

In [20]:
char_model = convertFreqIntoProb(lyrics_transition_table)
list(char_model.items())[0:10]

[('\n...', {'\n': 1.0}),
 ('...\n', {'\n': 0.5, 'R': 0.5}),
 ('..\n\n', {'\n': 1.0}),
 ('.\n\n\n', {'H': 1.0}),
 ('\n\n\nH', {'e': 0.5, 'o': 0.5}),
 ('\n\nHe', {'y': 1.0}),
 ('\nHey', {',': 0.42857142857142855, ' ': 0.5714285714285714}),
 ('Hey,', {' ': 1.0}),
 ('ey, ',
  {'y': 0.08333333333333333,
   'w': 0.16666666666666666,
   't': 0.16666666666666666,
   'b': 0.08333333333333333,
   "'": 0.08333333333333333,
   'T': 0.08333333333333333,
   'd': 0.3333333333333333}),
 ('y, y', {'o': 0.7, 'e': 0.3})]

## Lyrics generation
By using this probability table, our new lyrics can be generated.

In [21]:
import numpy as np

In [22]:
def sample_next(context, T, k = 4):
    context = context[-k:] #AS WE are predict next char from last k char 
    
    
    if T.get(context) is None:
        return ' '
    
    possible_chars = list(T[context].keys())
    possible_probabs = list(T[context].values())
    
    return np.random.choice(possible_chars, p =possible_probabs )

In [23]:
def generateText(starting_sent,T, k = 4, max_len = 100):
    sentence = starting_sent
    
    context = sentence[-k:]
    
    for i in range(max_len):
        next_pred = sample_next(context, T, k)
        sentence += next_pred
        context = sentence[-k:]
        
    return sentence

By check the mean lyrics length of Eminem songs we can find our new lyrics max length.

In [24]:
lens = [len(lyric) for lyric in cleaned_lyrics]
np.mean(lens)

3269.0454545454545

For the text generation, we leave our *k* with 4. This means that we need at least 4 input characters to be able to generate new content.

In [29]:
lyrics_predict = generateText("Slim ", char_model, max_len=4000, k=4)

In [30]:
file = open('output_text/' + uuid_song + ".txt", "w")
file.write(lyrics_predict)
file.close()
print(lyrics_predict)

Slim Shady, 1999, that I got Jimmy main them like 2Pac write to yo know you win, toting my rapped up with a Labrador
It's whites off with a rap flow
Tell suckers, rank what
It prayin', good the bitch in to driven here weak, you can applaudient can get words of check it
Got againstin' smash it
Like a man, I was havior's alries and a ho
Now, so big signed, this what 45, full raw

Woo, okay. 313?
None of my lected at their find
Pull over yourse
In a cyclone
(I'm 'bout hole to a basemental
Sittin'
Put my eighborhood's
Just cracked him while I'm guaranoid
Quicks
Fuck is nothink the Jay Dee
I see that I'ma walk with? Your sque on, feel and the mask off that your housand
Bodies he?"
That shit quick with should've he's the Marley
Peace to foot a need it
Althought for Never life way I'ma be extremember
Dresden
When you hatever that shing a cyclone
All the seriod out "7 Mile I'm spittin', Shady
1999, will I need spittin' you're way to say it's clothermon, you stayed pistol have the Alabama and
I

## Conclusion and Further work
Markov Chains are a great way to generate text language independent. Two big advantages by using this method are that there is no need for large input data and that the compution time for the model is very short. I think it is quite amazing how fast a language model can learn about sentences and grammar. \
By looking at our results, the model was able to use line breaks which structures our generated lyrics very well. Nevertheless, there is lots of room for improvement. One can hardly find ryhmes in the text and there is no context in the song text. \
One approach would be to change the hyperparamter *k* to get more related text lines. It is questionable whether by using larger character groups already existing phrases of songs will be reused as a result. Another idea would be to investigate more time in the preprocessing. One could analyze the ryhmes of the existing songs and select only unison rhymes as input text. I think that double rhymes or cross rhymes confuses the model because the connection of the text lines are too far apart. \
To sum up, I am happy with the result and I think that with more preprocessing or hyperparameter tuning a language model could rap like Eminem. Even if it takes a lot more than rhymes to create new songs.

Git Repo: [Lyrics Generator](https://github.com/lauragregorc/lyrics-generator) 

References: 
- [https://github.com/soniajoseph/MarkovLyric](https://github.com/soniajoseph/MarkovLyric) 
- [https://lyricsgenius.readthedocs.io](https://lyricsgenius.readthedocs.io) 
- [https://docs.genius.com/](https://docs.genius.com/) 
- [https://github.com/aryangulati/Character-Based-Language-Model](https://github.com/aryangulati/Character-Based-Language-Model) 
- [An Introduction to Markov Chains](http://dx.doi.org/10.13140/2.1.1833.8248)