# Kanye Lyric Generator

In this notebook, I take a look at using `Pytorch` for text generation and apply it to the song lyrics of one of my favorite artists, Kanye West

*Note: Much of the foundation of this project was taken from [this](https://www.analyticsvidhya.com/blog/2020/08/build-a-natural-language-generation-nlg-system-using-pytorch/) data science article*

## Loading the Data

We will start by importing our needed libraries:
- `torch`: Machine learning algorithm foundation
- `torch.nn`: Shortcut for neural network logic
- `torch.nn.functional`: Shortcut for neural network algorithms
- `numpy`: Library for quick, vectorized operations
- `re`: Regex for text processing
- `string`: Library for text processing
- `random`: Randomized text generation
- `language_check`: Corrects our generated sentences gramatically
- `better_profanity`: Censors swear words

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as Functional
import numpy as np
import re
import string
import random
import language_check as Language
from better_profanity import profanity as Profanity

Now we must open our data

The data we will be using is a collection of Kanye West verses (stored in a file called `kanye_verses.txt`), which can be found on Kaggle [here](https://www.kaggle.com/viccalexander/kanyewestverses)

We also do some preprocessing, replacing the double line breaks in the text with single line breaks

In [2]:
file = open("kanye_verses.txt", "r", encoding = "utf8")
text = file.read()
text = text.replace("\n\n", "\n")

To modularize the process of cleaning up the text more, we will create a function called `clean_lyric()`, which does the following:
- Uses `Regex` to only keep alphabetical characters
- Removes the character `'` from the text

In [3]:
def clean_lyric(lyric):
    return re.sub("[^a-z' ]", "", lyric).replace("'", "")

Now we can create a list of unique lyrics by splitting by the newline character and using `numpy` functions

Afterwards, we will call `clean_lyric()` on every element to create a list of `cleaned_lyrics`

In [4]:
lyrics = text.lower().split("\n")
lyrics = np.unique(lyrics)[1:].tolist()

cleaned_lyrics = [clean_lyric(lyric) for lyric in lyrics]

## Data Processing

To make generative text, we need to define a sequence size which we can chunk our lyrics into

For this notebook, we will set `seq_size` equal to 5

In [5]:
seq_size = 5

In order to train our model, we need to break up our lyrics such that a group of words, our input, predicts another group of words, our output

We'll create the method `create_sequences()` to help us accomplish this, which does the following:
- Returns the lyric if the sequence isn't long enough
- Otherwise, creates every possible `seq_len` length sequence

In [6]:
def create_sequences(lyric, seq_len):
    # intialize sequences list
    sequences = []
    
    # returns early if not long enough
    if len(lyric.split()) <= seq_len:
        return [lyric]
    
    # adds every possible sequence
    for itr in range(seq_len, len(lyric.split())):
        curr_seq = lyric.split()[itr - seq_len:itr + 1]
        sequences.append(" ".join(curr_seq))
    
    # returns the sequences
    return sequences

We'll iterate through our `cleaned_lyrics` and create every possible sequence, storing it in `raw_sequences`

We will then make use of the [`numpy.unique()`](https://numpy.org/doc/stable/reference/generated/numpy.unique.html) function to get a unique set of sequences

In [7]:
# obtain every sequence
raw_sequences = [create_sequences(lyric, seq_size) for lyric in cleaned_lyrics]

# filter to get the unique sequences
sequences = np.unique(np.array(sum(raw_sequences, []))).tolist()

Computers can only process data as numbers, so we need to find a way to convert our words to numbers

To solve this, we will make a simple bag of words model, and index it as the word's position in the array

We will store both dictionaries in `word_to_idx` and `idx_to_word`, as well as the total number of words in `vocab_size`

In [8]:
uniq_words = np.unique(np.array(" ".join(sequences).split(" ")))
uniq_words_idx = np.arange(uniq_words.size)

word_to_idx = dict(zip(uniq_words.tolist(), uniq_words_idx.tolist()))
idx_to_word = dict(zip(uniq_words_idx.tolist(), uniq_words.tolist()))

vocab_size = len(word_to_idx)
vocab_size

6141

Now we will create arrays of word sequences, called `x_word` and `y_word`

We can do this by iterating through every `sequence` and adding the ones that are the appropriate length

In [9]:
# intialize the empty lists
x_word = []
y_word = []

# iterate through every sequence
for seq in sequences:
    
    # stop if the sequence isn't long enough
    if (len(seq.split()) != seq_size + 1):
        continue
    
    # add the words to the sequences
    x_word.append(" ".join(seq.split()[:-1]))
    y_word.append(" ".join(seq.split()[1:]))

As mentioned earlier, computers can only really process numerical data, so we'll create the function `get_seq_idx()` to convert our words

In [10]:
def get_seq_idx(seq):
    return [word_to_idx[word] for word in seq.split()]

Now, we can use this function to convert `x_word` and `y_word` into `x_idx` and `y_idx`

In [11]:
x_idx = np.array([get_seq_idx(word) for word in x_word])
y_idx = np.array([get_seq_idx(word) for word in y_word])

## Training the Model

First, we'll declare the following variables that we'll use to create our neural network:
- `num_hidden`: Number of nodes in the hidden layer
- `num_layers`: Number of layers in the network
- `embed_size`: Size of the embedded layer
- `drop_prob`: Bernoulli probability for Dropout layer
- `lr`: Loss ratio
- `num_epochs`: Number of epochs to train the network
- `batch_size`: The batch size to train with

In [12]:
num_hidden = 256
num_layers = 4
embed_size = 200
drop_prob = 0.3
lr = 0.001
num_epochs = 20
batch_size = 32

We'll need a Recurrent Neural Network to train this model, so the obvious choice is a [Long Short-Term Memory](https://en.wikipedia.org/wiki/Long_short-term_memory) model, or `LSTM`

To accomplish this, we'll create a module named `LyricLSTM` following the [standard Pytorch documentation](https://pytorch.org/docs/stable/generated/torch.nn.Module.html)

This module will have the following three methods:

**`__init__(self, num_hidden, num_layers, embed_size, drop_prob, lr)`**

This method initializes the variables for the `LSTM`

First, it stores the variables `drop_prob`, `num_layers`, `num_hidden`, and `lr`

Next, it stores an `Embedding` layer using `vocab_size` and `embed_size`

We then stores the `LSTM` layer using the parameter variables, and setting `batch_first` equal to True

Afterwards, we define a `Dropout` layer using our `drop_prob`

Finally, we store the fully-connected `Linear` layer using `num_hidden` and `vocab_size`

<hr />

**`forward(self, x, hidden)`**

This method forward-propogates the input, `x` through the network

First, we embed the `x` input using the `Embedding` layer

We then use the `LSTM` layer to obtain the `lstm_output` and `hidden` layer

Afterwards, we can pass `lstm_output` through the `Dropout` layer to obtain `dropout_out` torch

Finally, we pass `dropout_out` through the fully-connected `Linear` layer, and return this value along with the `hidden` layer

<hr />

**`init_hidden(self, batch_size)`**

This method initializes the `hidden` state of the model

We first create a `weight` torch using the parameters of the model

We then can create a new `hidden` layer and return its value

In [13]:
class LyricLSTM(nn.Module):
    
    ''' Initialize the network variables '''
    def __init__(self, num_hidden, num_layers, embed_size, drop_prob, lr):
        # call super() on the class
        super().__init__()
        
        # store the constructor variables
        self.drop_prob = drop_prob
        self.num_layers = num_layers
        self.num_hidden = num_hidden
        self.lr = lr
        
        # define the embedded layer
        self.embedded = nn.Embedding(vocab_size, embed_size)

        # define the LSTM
        self.lstm = nn.LSTM(embed_size, num_hidden, num_layers, dropout = drop_prob, batch_first = True)
        
        # define a dropout layer
        self.dropout = nn.Dropout(drop_prob)
        
        # define the fully-connected layer
        self.fc = nn.Linear(num_hidden, vocab_size)      
    
    ''' Forward propogate through the network '''
    def forward(self, x, hidden):
        
        ## pass input through embedding layer
        embedded = self.embedded(x)     
        
        # Obtain the outputs and hidden layer from the LSTM layer
        lstm_output, hidden = self.lstm(embedded, hidden)
        
        # pass through a dropout layer and reshape
        dropout_out = self.dropout(lstm_output).reshape(-1, self.num_hidden) 

        ## put "out" through the fully-connected layer
        out = self.fc(dropout_out)

        # return the final output and the hidden state
        return out, hidden
    
    ''' Initialize the hidden state of the network '''
    def init_hidden(self, batch_size):
        
        # Create a weight torch using the parameters of the model
        weight = next(self.parameters()).data

        # initialize the hidden layer using the weight torch
        hidden = (weight.new(self.num_layers, batch_size, self.num_hidden).zero_(),
                  weight.new(self.num_layers, batch_size, self.num_hidden).zero_())
        
        # return the hidden layer
        return hidden

Now we can initialize the variables we will use for training

First, we'll create a variable `model` using the `LyricLSTM()` constructor above

The optimizer we will select for backpropogation is the [Adam Algorithm](https://arxiv.org/abs/1412.6980)

We will also need a way to evaluate the loss of the function, so we will choose [Cross Entropy Loss](https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e)

In [14]:
# create the LSTM model
model = LyricLSTM(num_hidden, num_layers, embed_size, drop_prob, lr)

# selecting an optimizer
optimizer = torch.optim.Adam(model.parameters(), lr = lr)

# selecting a loss function
loss_func = nn.CrossEntropyLoss()

# overview of the model
model.train()

LyricLSTM(
  (embedded): Embedding(6141, 200)
  (lstm): LSTM(200, 256, num_layers=4, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=6141, bias=True)
)

Before we can begin training, we need a function to obtain the next batch for training

The method `get_next_batch()` allows us to do this by doing the following:
- Iterates from `batch_size` to the end of `x`, increasing by `batch_size`
- Indexes `x` and `y` with `batch_size` to obtain `batch_x` and `batch_y`
- Yield `batch_x` and `batch_y`

In [15]:
def get_next_batch(x, y, batch_size):
    
    # iterate until the end of x
    for itr in range(batch_size, x.shape[0], batch_size):
        
        # obtain the indexed x and y values
        batch_x = x[itr - batch_size:itr, :]
        batch_y = y[itr - batch_size:itr, :]
        
        # yield these values
        yield batch_x, batch_y

Finally, we can train the model by doing the following each `epoch`:
- Initialize the `hidden_layer`
- Iterate through `x` and `y` int the next batch
- Convert the `inputs` and `act` arrays to torches
- Reformat the `hidden_layer` into a tuple
- Obtain the zero-accumulated gradient from the model
- Use `forward()` to calculate the output of the model
- Use `loss_func()` to obtain the `loss` of the model
- Update the weights accordingly

In [16]:
for epoch in range(num_epochs):

    # initialize hidden state
    hidden_layer = model.init_hidden(batch_size)
        
    for x, y in get_next_batch(x_idx, y_idx, batch_size):
            
        # convert numpy arrays to PyTorch arrays
        inputs = torch.from_numpy(x).type(torch.LongTensor)
        act = torch.from_numpy(y).type(torch.LongTensor)

        # reformat the hidden layer
        hidden_layer = tuple([layer.data for layer in hidden_layer])

        # obtain the zero-accumulated gradients from the model
        model.zero_grad()
            
        # get the output from the model
        output, hidden = model(inputs, hidden_layer)
            
        # calculate the loss from this prediction
        loss = loss_func(output, act.view(-1))

        # back-propagate to update the model
        loss.backward()

        # prevent exploding gradient problem
        nn.utils.clip_grad_norm_(model.parameters(), 1)

        # update weigths using the optimizer
        optimizer.step()           

## Generating Text

Now that our model is trained, we can start generating text!

First, we'll create a function called `predict()` which uses the network to make predictions by doing the following:
- Creates torch `inputs` from `x`
- Reformat the `hidden_layer` into the tuple `hidden`
- Obtain the output of the `model` using `forward()`
- Calculate the probabilities using [`softmax`](https://en.wikipedia.org/wiki/Softmax_function)
- Use `argsort()` to obtain the top 3 indexes
- Randomly choose an index and return the corresponding word

In [17]:
def predict(model, tkn, hidden_layer):
         
    # create torch inputs
    x = np.array([[word_to_idx[tkn]]])
    inputs = torch.from_numpy(x).type(torch.LongTensor)

    # detach hidden state from history
    hidden = tuple([layer.data for layer in hidden_layer])

    # get the output of the model
    out, hidden = model(inputs, hidden)

    # get the token probabilities and reshape
    prob = Functional.softmax(out, dim=1).data.numpy()
    prob = prob.reshape(prob.shape[1],)

    # get indices of top 3 values
    top_tokens = prob.argsort()[-3:][::-1]
    
    # randomly select one of the three indices
    selected_index = top_tokens[random.sample([0,1,2],1)[0]]

    # return word and the hidden state
    return idx_to_word[selected_index], hidden

Next, we need the function `generate()` to generate text, which does the following:

- Creates an initial `hidden` layer
- Iterates through the given tokens and predicts the next `tokens`
- Iterates once more to predict the subsequent `tokens`
- Returns a joined string using the list of `token`

In [18]:
def generate(model, num_words, start_text):
    
    # baseline model eval
    model.eval()
    
    # create the initial hidden layer of batch size 1
    hidden = model.init_hidden(1)
    
    # convert the starting text into tokens
    tokens = start_text.split()
    
    # iterate through and predict the next token
    for token in start_text.split():
        curr_token, hidden = predict(model, token, hidden)
    
    # add the token
    tokens.append(curr_token)
    
    # predict the subsequent tokens
    for token_num in range(num_words - 1):
        token, hidden = predict(model, tokens[-1], hidden)
        tokens.append(token)
        
    # return the formatted string
    return " ".join(tokens)

Some of the sentences may be gramatically incorrect, and also may contain profanity

To fix this, we'll use the [`language-check`](https://pypi.org/project/language-check/) and [`better-profanity`](https://pypi.org/project/better-profanity/) packages

First, we'll need to load in the tools needed to do this

In [19]:
# load the swear words to censor
Profanity.load_censor_words()

# create a tool for language checking
lang_tool = Language.LanguageTool('en-US')

Finally, we'll create one last method, `get_lyric()`, which combines everything we have done as follows:
- generates the text using `generate()`
- finds errors using `lang_tool`
- applies the suggested changes from `errors`, creating `corrected_text`
- censors any words if needed

In [20]:
def get_lyric(start_text, censor, num_words):
    
    # generate the text
    generated_text = generate(model, num_words, start_text.lower())
    
    # find all grammatial errors
    errors = lang_tool.check(generated_text)
    
    # create the corrected text
    corrected_text = Language.correct(generated_text, errors)
    
    # censors the word if necessary
    return Profanity.censor(corrected_text) if censor else corrected_text

Now, we will test our program below!

In [26]:
get_lyric("Come to", False, 10)

"Come to say yo y'all ain't gotta take a career up then"

In [37]:
get_lyric("Kanye west", True, 10)

'Kanye west is a couple of shoes I picked your own star'

In [25]:
get_lyric("I hate you", False, 7)

'I hate you on your blood I was train to'