# Creating a Bigram Language Model for Kanye West Lyrics Generation

This project aims to create a simple bigram language model that can generate Kanye West lyrics. The model will be trained on a dataset of Kanye West song lyrics and will learn the patterns and structures of his lyrics to generate new lyrics that resemble his style.

### Dataset
The dataset used for training the language model is a collection of Kanye West song lyrics. The lyrics are stored in a text file named "kanye-west.txt" which was obtained from HuggingFace The dataset will be read and processed to extract the necessary information for training the language model.

### Bigram Language Model
A bigram language model is a statistical language model that predicts the next word in a sequence based on the previous word. In this project, we will use a bigram language model to generate Kanye West lyrics. The model will learn the probabilities of word sequences and use them to generate new lyrics.

### Implementation
The project will be implemented using Python and the PyTorch library. We will define a class called "BigramLanguageModel" that will handle the training and generation of lyrics. The class will have methods for reading the dataset, preprocessing the text, training the language model, and generating lyrics.

### Training
During the training phase, the language model will learn the probabilities of word sequences based on the dataset. It will build a bigram language model by counting the occurrences of word pairs (bigrams) and calculating their probabilities. The model will store this information in a data structure for later use during the generation phase.

### Lyrics Generation
Once the language model is trained, it can be used to generate new lyrics. The generation process starts with an initial word, and the model predicts the next word based on the probabilities learned during training. This process is repeated to generate a sequence of words that form a new lyric.

### Evaluation
The generated lyrics can be evaluated based on their coherence, relevance to Kanye West's style, and overall quality. Evaluation metrics such as perplexity and human judgment can be used to assess the performance of the language model and improve its accuracy.


### Sources 
1. [Bigram Language Model](https://pastebin.com/vxGwbqiH) 
2. [The spelled out intro to language modelling: build makemore](https://youtu.be/PaCmpygFfXo?si=o3BA_orQwaLym6pk)
3. [Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=228s&ab_channel=AndrejKarpathy) 
4. [dataset](https://huggingface.co/datasets/huggingartists/kanye-west)

In [1]:
# reading the dataset
with open('kanye-west.txt', 'r', encoding='utf-8') as file: 
    text = file.read()

print(len(text))

4721711


In [2]:
print(text[:1000])

Well, it is a weepin and a moanin and a gnashin of teeth
It is a weepin and a mournin and a gnashin of teeth
It is a—when it comes to my sound which is the champion sound
Believe, believe
O-o-o-o-o-okay, Lamborghini Mercy
Your chick, she so thirsty
I-I-I-I-Im in that two-seat Lambo
With your girl, she tryna jerk me 
O-o-o-o-o-okay, Lamborghini Mercy
Your chick, she so thirsty
I-I-I-I-Im in that two-seat Lambo
With your girl, she tryna jerk me
O-o-o-o-o-okay, Lamborghini Mercy 
Your chick, she so thirsty 
I-I-I-I-Im in that two-seat Lambo
With your girl, she tryna jerk me 
O-o-o-o-o-okay, Lamborghini Mercy
Your chick, she so thirsty 
I-I-I-I-Im in that two-seat Lambo 
With your girl, she tryna jerk me
Okay, drop it to the floor, make that ass shake 
Woah, make the ground move: thats an ass quake
Built a house up on that ass: thats an ass-state
Roll–roll–roll my weed on it: thats an ass tray
Say, Ye, say, Ye, dont we do this every day–day? 
I work them long nights, long nights to get a 


In [3]:
# calculate the vocabulary and its size 
chars = sorted(list(set(text)))
vocab_size = len(chars)

print(chars)
print('Vocabulary size:', vocab_size)

['\t', '\n', ' ', '!', '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '}', '~', '\x8b', '\x9d', '\xa0', '¡', '´', '·', '½', 'Á', 'Ä', 'Å', 'Ç', 'É', 'Î', 'Ö', '×', 'Ø', 'Ü', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'í', 'î', 'ï', 'ñ', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'ü', 'ý', 'ā', 'ă', 'ą', 'ć', 'Č', 'č', 'ē', 'ę', 'ğ', 'İ', 'ı', 'Ł', 'ł', 'ń', 'Ő', 'ś', 'Ş', 'ş', 'š', 'ż', 'ž', 'Ș', 'ș', 'Ț', 'ț', 'Έ', 'Ό', 'Α', 'Β', 'Γ', 'Δ', 'Ε', 'Η', 'Κ', 'Λ', 'Μ', 'Ν', 'Ξ', 'Ο', 'Π', 'Σ', 'Τ', 'Φ', 'ά', 'έ', 'ή', 'ί', 'α', 'β', 'γ', 'δ', 'ε', 'ζ', 'η', 'θ', 'ι', 'κ', 'λ', 'μ', 'ν', 'ξ', 'ο

In [4]:
# create an encding from characters to integers and vice versa
char_to_int = {c:i for i, c in enumerate(chars)}
int_to_char = {i:c for i, c in enumerate(chars)}

encode = lambda s: [char_to_int[c] for c in s]
decode = lambda l: ''.join([int_to_char[c] for c in l])

In [5]:
# example
encoded = encode('harshit')
print(encoded)

decoded = decode(encoded)
print(decoded)

[71, 64, 81, 82, 71, 72, 83]
harshit


In [6]:
import torch
data = torch.tensor(encode(text), dtype=torch.int64)
print(data.shape, data.type)

torch.Size([4721711]) <built-in method type of Tensor object at 0x1586c4950>


In [7]:
# train val split
val_size = int(0.1 * len(data))
train_data, val_data = data[:-val_size], data[-val_size:]

In [8]:
# hyperparameters
batch_size = 16
block_size = 32 
max_iters = 5000 
eval_interval = 100 
learning_rate = 1e-3 
eval_iters = 200
n_embd = 64 
n_head = 4
n_layer = 4
dropout = 0.0 

In [9]:
# create the dataset
def get_batch(split:str): 
    data = train_data if split == 'train' else val_data
    index = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in index])
    y = torch.stack([data[i+1:i+block_size+1] for i in index])

    return x, y 

In [10]:
x_batch, y_batch = get_batch('train')
print(x_batch.shape, y_batch.shape)

torch.Size([16, 32]) torch.Size([16, 32])


In [11]:
import torch
import torch.nn as nn 
from torch.nn import functional as F

The code in the cell below attempts to implement a Transformer model.

In [12]:
class AttentionHead(nn.Module): 
    """Single attention head, allows the model to focus on different parts 
        of the input sequence while producing a single output for each part.

        Methods: 
            forward: performs the forward pass of the model, calculating the 
                     attention weights and the output of the head.

    """
    def __init__(self, head_size): 
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout) 

    def forward(self,x) :
        b,t,c = x.shape
        k = self.key(x)
        q = self.query(x)

        weights = q @ k.transpose(-2,-1)*(c**-0.5) 
        weights = weights.masked_fill(self.tril[:t, :t] == 0, float('-inf'))
        weights = F.softmax(weights, dim=-1)
        weights = self.dropout(weights)

        value = self.value(x)
        out = weights @ value
        return out

class MultiAttentionHead(nn.Module): 
    """
    Miulti-head attention layer, allows the model to focus on different parts. This works
    as there are multiple single attention heads, each of which 'focus' on different
    parts of the input sequence. The outputs are joined together and projected into the
    expected dimension, after which dropout is applied and the result is returned.
    """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([AttentionHead(head_size) for _ in range(num_heads)])
        self.projection = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x): 
        out = torch.cat([h(x) for h in self.heads], dim=1)
        out = self.dropout(self.projection(out)) 
        return out
    
class FeedForward(nn.Module): 
    """
    Implements a simple feed-forward neural network, which consists of two linear layers
    followed by a non-linear ReLU activation function and dropout. This is used to transform
    the output of the attention layer into the expected dimension.
    """
    def __init__(self, n_embd): 
        super().__init__()
        self.neural_net = nn.Sequential(
            nn.Linear(n_embd, n_embd*4),
            nn.ReLU(),
            nn.Linear(n_embd*4, n_embd), 
            nn.Dropout(dropout)
        ) 
    def forward(self, x): 
        return self.neural_net(x) 
    
class TransformerBlock(nn.Module): 
    """
    Combines the multi-head attention layer and the feed-forward neural network into a single
    transformer block. Each block also applies layer normalization after the attention and the
    feed-forward layer, and adds a residual connection around each of the sub-layers.
    """
    def __init__(self, n_embd, n_head): 
        super().__init__()
        head_size = n_embd//n_head
        self.self_attention = MultiAttentionHead(n_head, head_size) 
        self.feed_forward = FeedForward(n_embd)
        self.layer_norm_1 = nn.LayerNorm(n_embd)
        self.layer_norm_2 = nn.LayerNorm(n_embd)
    def forward(self, x): 
        x = x + self.self_attention(self.layer_norm_1(x))
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x
    

In [13]:
# bigram model
class BigramLanguageModel(nn.Module): 

    """
    Bigram Language Model, which uses a transformer architecture to model the 
    probability of a token given the previous token. The model consists of an
    embedding layer, followed by a number of transformer blocks, and a final
    linear layer to output the predicted token probabilities. In this model, the inputs 
    and outputs are sequences of tokens, and the input only depends on the previous token.
    Which means that the input does not depend on any token before the previous token.
    """
    def __init__(self): 
        super(BigramLanguageModel, self).__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd) 
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[TransformerBlock(n_embd, n_head) for _ in range(n_layer)]) # transformer blocks
        self.layer_norm = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)    

    def forward(self, index, targets=None) : 
        b, t = index.shape
        token_embeddings = self.token_embedding_table(index) # of the shape (b, t, c) 
        position_embeddings = self.position_embedding_table(torch.arange(t))
        x = token_embeddings + position_embeddings
        x = self.layer_norm(x) 
        logits = self.lm_head(x) # of the shape b, t, vocab_size 

        if targets is None: 
            loss = None
        else: 
            b, t, c = logits.shape
            logits = logits.view(b*t, c)
            targets = targets.view(b*t)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, index, max_new_tokens): 
        for _ in range(max_new_tokens): 
            index_cropped = index[:, -block_size:] # crop index to the last block_size tokens
            logits, loss = self(index_cropped) # get the logits for the last block_size tokens
            logits = logits[:, -1] # remove everything but the last token
            probs = F.softmax(logits, dim=-1) # turn logits into probabilities
            index_next = torch.multinomial(probs, 1) # sample from the distribution
            index = torch.cat((index, index_next), dim=-1) # append the sampled token to the index
        return index

In [14]:
bigram_model = BigramLanguageModel()

In [15]:
@torch.no_grad()
def estimate_loss(): 
    out = {}
    bigram_model.eval() 
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters) 
        for k in range(eval_iters): 
            X, Y = get_batch(split)
            logits, loss = bigram_model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean() 
    bigram_model.train()
    return out

In [16]:
# create an optimiser for bigram model
optimizer = torch.optim.AdamW(bigram_model.parameters(), lr=learning_rate)
for i in range(max_iters): 
    if i % eval_interval == 0 or i == max_iters-1: 
        losses = estimate_loss()
        print(f'Iter {i}, train loss: {losses["train"]}, val loss: {losses["val"]}')

    x_batch, y_batch = get_batch('train')
    logits, loss = bigram_model(x_batch, y_batch)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step() 

Iter 0, train loss: 6.3781867027282715, val loss: 6.3736419677734375
Iter 100, train loss: 3.472564697265625, val loss: 3.5426058769226074
Iter 200, train loss: 2.851269006729126, val loss: 2.953464984893799
Iter 300, train loss: 2.700740337371826, val loss: 2.8123233318328857
Iter 400, train loss: 2.6430504322052, val loss: 2.764686346054077
Iter 500, train loss: 2.6097826957702637, val loss: 2.7358853816986084
Iter 600, train loss: 2.5852694511413574, val loss: 2.713991165161133
Iter 700, train loss: 2.5645785331726074, val loss: 2.68621826171875
Iter 800, train loss: 2.5536818504333496, val loss: 2.7037127017974854
Iter 900, train loss: 2.5426483154296875, val loss: 2.672560214996338
Iter 1000, train loss: 2.5405237674713135, val loss: 2.671848773956299
Iter 1100, train loss: 2.5383472442626953, val loss: 2.655033588409424
Iter 1200, train loss: 2.542438268661499, val loss: 2.6722660064697266
Iter 1300, train loss: 2.530505895614624, val loss: 2.678732395172119
Iter 1400, train loss

In [17]:
import pandas as pd

loss_df = pd.read_csv('bigram_loss.csv')

# find average train and val loss 
train_loss = loss_df['Train Loss'].mean()
val_loss = loss_df['Val Loss'].mean()

print(f'Average train loss: {train_loss}, average val loss: {val_loss}')

Average train loss: 2.6209058823529405, average val loss: 2.7638784313725493


In [18]:
import torch 
context = torch.zeros((1, 1), dtype=torch.long)
print(decode(bigram_model.generate(context, max_new_tokens=2000)[0].tolist()))


	-uth mmp try, thores 
Yo knoplanonapatol, d s
Gyea mu ld whowhin garyous ah m wand byordonghobe calodowhatou I s m
The supll
Ond pad a nis t’theve find cld usitho.
P d donasex y de pu se be th indo, I aintoillicousok
Red H he nsisyyou m
I sheredyo spin I lyontige igour Monourthuru Hen Yo shed. he t
Agosene Kad sthe t adowe actly "Opaithe anghe,
Kapattespe pe
Then. dyomatatat na n o iدlohimur ngow yode
All t, fipeatou
I ct blikis byon, nd in?
Lrinite yowatonotols s
Yond thoughyoucain t yo, t
I mashigongs y5 ony t on alll
Ime 
Gu s y t I roryse ce om dein m
A, t thalisthe, m Fo ik 
Yorthzhes I way buthi lust wou ghat e g wop gqus, merus, n
An ayodss Imourin tamal Mitond
No me iplyo toco et ed me h, p of Dal 
Mrour, befthey goo l acas g I rsagiknturld con dord ghen ogowe whisthackmyou ig fioyomme—nd, d id
Tondin, ll hodeyonth selyy f mepeave lld we that g
Yoleus bonor y tighrt p chi-wibu
HE, s t I se gu ieyo aring adorole whthalher gep
Ce t’me mouggadid g her, byer s lo g
Scawan bnn t ok

In [23]:
# calcuate the BLEU score 

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction

def calculate_bleu(reference, candidate):
    return sentence_bleu([reference], candidate, smoothing_function=SmoothingFunction().method1)

reference = ' '.join(text.split()[:2000])
candidate = decode(bigram_model.generate(context, max_new_tokens=2000)[0].tolist())

bleu_score = calculate_bleu(reference, candidate)

print(f'BLEU score: {bleu_score}')

BLEU score: 0.007629888802746765


In [25]:
# calculate the ROUGE score

from rouge import Rouge

def calculate_rouge(reference, candidate):
    rouge = Rouge()
    scores = rouge.get_scores(candidate, reference)
    return scores[0]['rouge-l']['f']

rouge_score = calculate_rouge(reference, candidate)

print(f'ROUGE score: {rouge_score}')

ROUGE score: 0.042884985975856244


In [27]:
# calculate the word error rate

import jiwer

def calculate_wer(reference, candidate):
    return jiwer.wer(reference, candidate)

wer = calculate_wer(reference, candidate)
print(f'WER: {wer}')

WER: 0.989


In [None]:
!git commit -m "calculated the bleu score, rogue score and the word error rate for the bigram model." 
!git push

In [19]:
# save the model into a h5 file
torch.save(bigram_model.state_dict(), 'bigram_model.h5')