<div align="center">
  <h1>4b - Lyrics Generation - GPT-2</h1> <a name="0-bullet"></a>
</div>

- [1. Setup](#1-bullet)
    * [1.1 Set the working directory](#11-bullet)
    * [1.2 Load the data](#12-bullet)
- [2. Preprocess the data](#2-bullet)
    * [2.1 Prepare the corpus](#21-bullet)
- [3. Download the model](#3-bullet)
- [4. Fine-tune the model](#4-bullet)
- [5. Load the model](#5-bullet)
- [6. Lyrics generation](#6-bullet)
    * [6.1 Generate lyrics](#61-bullet)
    * [6.2 Calculate lyrics similarity](#62-bullet)
    * [6.3 Store lyrics to a text file](#63-bullet)

> References: 
> * [github.com/minimaxir/gpt-2-simple](https://github.com/minimaxir/gpt-2-simple)

---

You may have to install:

> `!pip install pickle5` <br>
  `!pip install gpt-2-simple`

In [None]:
import os
import json
import time
import pickle5 as pickle

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

%tensorflow_version 1.x
import gpt_2_simple as gpt2
import tensorflow as tf

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



---

# 1. Setup <a name="1-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

## 1.1 Set the working directory  <a name="11-bullet"></a>

In [None]:
ROOT_DIR = "./eminem-lyrics-generator/notebooks/" 
IN_GOOGLE_COLAB = True

if IN_GOOGLE_COLAB:
    # mount google drive
    from google.colab import drive
    drive.mount('/content/gdrive')

    # change the current working directory
    %cd gdrive/'My Drive'

    # create a root directory if there's none
    if not os.path.isdir(ROOT_DIR):
        %mkdir $ROOT_DIR

    # change the current working directory
    %cd $ROOT_DIR

Mounted at /content/gdrive
/content/gdrive/My Drive
/content/gdrive/My Drive/eminem-lyrics-generator/notebooks


## 1.2 Load the data  <a name="12-bullet"></a>

In [None]:
# specifies paths to all files in the project
SETTINGS_FILE_PATH = os.path.join(os.path.abspath(".."), 'SETTINGS.json')
settings = json.load(open(SETTINGS_FILE_PATH))

In [None]:
DATA_FILE_DIR = settings['LYRICS_DF_SONGS_PATH']      # 'LYRICS_DF_ALL_PATH' or 'LYRICS_DF_SONGS_PATH'

with open(DATA_FILE_DIR, 'rb') as f:
    eminem_df = pickle.load(f)

In [None]:
eminem_df

Unnamed: 0,title,lyrics
0,Rap God,"[Intro] ""Look, I was gonna go easy on you not ..."
1,Killshot,"[Intro] You sound like a bitch, bitch Shut the..."
2,Godzilla,"[Intro] Ugh, you're a monster [Verse 1: Emine..."
3,Lose Yourself,"[Intro] Look, if you had one shot or one oppor..."
4,The Monster,[Intro: Rihanna] I'm friends with the monster ...
...,...,...
372,Rap Game (Bump Heads),"[Intro: Eminem, DJ Butter & D12 Member] I am n..."
373,Whoo Kid Freestyle,"Step right up, i'm about to light up the skyli..."
374,Hit ’Em Up,"[Intro] ""Aiyyo Head, that's why I fucked your..."
375,The Wake Up Show Freestyle,[Verse 1] Met a retarded kid named Greg with a...


# 2. Preprocess the data <a name="2-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

## 2.1 Prepare the corpus <a name="21-bullet"></a>

### a) filter out songs with no section headers in the lyrics

In [None]:
has_section_headers = eminem_df.lyrics.apply(lambda lyrics: "[" in lyrics )
eminem_df = eminem_df[has_section_headers].reset_index(drop=True)

### b) add titles to the beginning of lyrics

In [None]:
eminem_lyrics_df = eminem_df.apply(lambda row: "[Title]\n" + row.title + "\n\n" + row.lyrics, axis=1)

### c) replace triple newlines with double newlines 

In [None]:
eminem_lyrics_df = eminem_lyrics_df.apply(lambda lyrics: lyrics.replace('\n\n\n', '\n\n'))

### d) store to a csv file

In [None]:
FILE_NAME = settings['GPT2_TRAIN_DATA_CLEAN_PATH']
eminem_lyrics_df.to_csv(FILE_NAME, index=False)

# 3. Download the model <a name="3-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

In [None]:
MODEL_NAME = "355M"       #124M, 355M
MODELS_DIR = settings['MODELS_DIR']

gpt2.download_gpt2(model_dir=MODELS_DIR, model_name=MODEL_NAME) 

Fetching checkpoint: 1.05Mit [00:00, 565Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 4.94Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 370Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:18, 77.2Mit/s]                                 
Fetching model.ckpt.index: 1.05Mit [00:00, 349Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 5.24Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 7.84Mit/s]


# 4. Fine-tune the model <a name="4-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

In [None]:
MODEL_NAME = "355M"     #124M, 355M 
MODELS_DIR = settings['MODELS_DIR']
FILE_NAME = settings['GPT2_TRAIN_DATA_CLEAN_PATH']
CHECKPOINT_DIR = os.path.join(os.path.join(MODELS_DIR, MODEL_NAME), "training_checkpoints/")

NUM_OF_EPOCHS = 2500

tf.reset_default_graph()
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              dataset=FILE_NAME,
              model_dir=MODELS_DIR,
              model_name=MODEL_NAME,
              steps=NUM_OF_EPOCHS,
              checkpoint_dir=CHECKPOINT_DIR,
              restore_from="latest",
              run_name="run1",
              save_every=NUM_OF_EPOCHS,
              sample_every=10000,
              overwrite=True,
              print_every=5) 

Loading checkpoint ../models/355M/model.ckpt
INFO:tensorflow:Restoring parameters from ../models/355M/model.ckpt


100%|██████████| 1/1 [00:00<00:00, 34.69it/s]

Loading dataset...





dataset has 413883 tokens
Training...
Saving ../models/355M/training_checkpoints/run1/model-0
[5 | 18.85] loss=2.97 avg=2.97
[10 | 29.82] loss=2.50 avg=2.73
[15 | 40.80] loss=2.67 avg=2.71
[20 | 51.79] loss=3.36 avg=2.88
[25 | 62.78] loss=3.47 avg=3.00
[30 | 73.82] loss=3.22 avg=3.04
[35 | 84.88] loss=2.76 avg=3.00
[40 | 95.91] loss=3.68 avg=3.08
[45 | 106.98] loss=2.83 avg=3.06
[50 | 118.03] loss=2.46 avg=2.99
[55 | 129.07] loss=1.88 avg=2.89
[60 | 140.13] loss=2.36 avg=2.84
[65 | 151.19] loss=3.85 avg=2.92
[70 | 162.26] loss=2.43 avg=2.88
[75 | 173.33] loss=2.45 avg=2.85
[80 | 184.42] loss=1.73 avg=2.78
[85 | 195.50] loss=3.64 avg=2.83
[90 | 206.57] loss=3.34 avg=2.86
[95 | 217.65] loss=3.48 avg=2.90
[100 | 228.72] loss=2.86 avg=2.90
[105 | 239.78] loss=2.87 avg=2.90
[110 | 250.85] loss=2.95 avg=2.90
[115 | 261.93] loss=3.26 avg=2.92
[120 | 272.98] loss=3.07 avg=2.92
[125 | 284.08] loss=2.17 avg=2.89
[130 | 295.12] loss=2.48 avg=2.87
[135 | 306.20] loss=3.27 avg=2.89
[140 | 317.27] l

# 5. Load the model <a name="5-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

In [None]:
MODEL_NAME = "355M"      
MODELS_DIR = settings['MODELS_DIR']
CHECKPOINT_DIR = os.path.join(os.path.join(MODELS_DIR, MODEL_NAME), "training_checkpoints")

tf.reset_default_graph()
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess,
              checkpoint_dir=CHECKPOINT_DIR,
              model_dir=MODELS_DIR)

Loading checkpoint ../models/355M/training_checkpoints/run1/model-2500
INFO:tensorflow:Restoring parameters from ../models/355M/training_checkpoints/run1/model-2500


# 6. Lyrics generation <a name="6-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

## 6.1 Generate lyrics <a name="61-bullet"></a>

In [None]:
SONG_TITLE = "Artificial Intelligence"

text = gpt2.generate(sess,
                     checkpoint_dir=CHECKPOINT_DIR,
                     length=4000,
                     temperature=1.0,
                     prefix=f"[Title]\n{SONG_TITLE}\n",
                     truncate="<|endoftext|>",
                     include_prefix=True,
                     top_k=40,
                     nsamples=5, 
                     batch_size=5,
                     return_as_list=True)

In [None]:
print(text[3])

[Title]
Artificial Intelligence

[Intro]
Please don't murder me
Please don't kill me
Please don't murder me
I love you
AI, AI, AI
Yes, AI, AI
Yes, AI, AI
Yes, AI, AI
Yes, AI, AI
"Say, what's your name, Samantha? Why do I follow all of you? It's not love, it's hate."

[Verse 1]
I turn on the television, look around, I don't see nobody
But my own reflection, I see your lips flickering
What do I say? I just want you to see
How much I care about your well-being, your well-being
I don't need you to do this but if you insist
I'll be inside your head, thinking of you, thinking of you
And I'll be right there, just waiting, juste
To murder you and leave you with no survivors
I want to be the best, I'll be the best, I'll be the best
Doing tricks on you, with you, with no survivors

[Chorus]
'Cause I'm not scared of you
You can piss me off, no need to kill me
I'm not scared of you
You can piss me off, no need to kill me
I'm not scared of you
You can piss me off, no need to kill me
I'm not scared 

In [None]:
print(text[0])

[Title]
Artificial Intelligence

[Intro]
It's worse than natural language processing
It takes a village to butcher an AI
And I don't want to seem ungrateful
I've been feeling so alone and soinen
I can taste death, I'm craving that buzz I've lacked
My self-esteem's plummeting; can't even get the top slot at the Music Awards
Back onstage just to perform, thank God I got the cojones
To remind me that I'm even in this shit at all
Please don't make me lose sleep: it'd be so much easier
If I fell asleep with my nose pressed against the light switch
Got up again and won't let up until I'm upwards of my chest high
If I snooze I'll bounce on my feet till I get up whee
And then I'll bounce on my feet till I get up even though I'm technically dead
Even though I keep getting stuck in this rut
Even though I still can't get the recognition
Even though I know my name'll probably be called out at some point
But I'ma make it anyway: artificial intelligence—will you knock?
Will someone please please kno

In [None]:
print(text[1])

[Title]
Artificial Intelligence

[Intro]
Is that AI?, man?! Yeah
Let's hear it for yourself

[Verse 1]
You basically told us, our savior
Is a computer sent to take your soul
And populate it with computer code that'll flood your computer
With programs that'll do anything and everything
To get this hardware, and get that cash 'cause they bid
Highly intelligent, highly competitive, know your employees
How they get paid, what bonuses they get, and there's a rule that
They avoid conflict of interest 'cause it's in their best interest
To serve their country, and the government contracts
Because they can afford it, and it's cheaper
They play putt-putt with the best companies, jump on companies
That can afford it, and the government contracts
Because they can afford it, and it's cheaper

[Chorus]
Is that AI?, man?! Yeah
Let's hear it for yourself
Is that AI?, man?! Yeah
Let's hear it for yourself
Is that AI?, man?! Yeah
Let's hear it for yourself
Is that AI?, man?! Yeah
Let's hear it for yours

In [None]:
print(text[2])

[Title]
Artificial Intelligence

[Intro]
"In a world so desperately in need of a hero, a lone wolf acts boldly, boldly, and hey, it's good that way"

[Verse 1]
I'm a loner, use no other man's words, all I know is I like getting on cameras and letting the cameras see
The truth is, I prefer playing hide-and-seek with no one but myself
When I get on them shoulders and guide 'em through the thicket
Better beware, stay clear of crowd, camera's gonna get you killed
I'm like a grizzly bear, I rip both of my step-noses out of my mouth, look into space
Like it's my last day on this Earth, the walls are caving in
I can feel the steel cables by my teeth, the whole world's blind
Even my ear drums they're cracking
I'm like a four-fifths composed of applause
The audience's beginning to thin out, the music's starting to thin out
We're losing the war, I feel like I'm losing my battle with anxiety
I need my Vulcan nerve jabbed twice in my mouth (Brr)
Twice in my life (Hello?) thank you (Yes)
This is Ar

In [None]:
print(text[4])

[Title]
Artificial Intelligence

[Intro]
Welcome to the world of artificial intelligence—AI!
If you have a smart-aleck mind, this is luck, you likely won't win, but don't get discouraged, please do keep laughing
'Cause this is still a long shot, don't get discouraged, please do keep laughing

[Verse 1]
I'm alive again—I'm up for the count, I'm ready to stake my reputation on this
If not, then I'm out my staigh—I'm in the driver's seat
I'll cruise through a blizzard and snow, blindfold with a vest on
And blow up at your feet—it's my manifesto—I'll dash through your husbandry
Collect your children and whip 'em up—it's my manifesto—I'll dash through your walls
Collect your children and whip 'em up—it's my manifesto

[Chorus]
It's art, it's science
When an artist unleashes his demons
It's art, it's science
When an artist unleashes his demons
It's art, it's science
When an artist unleashes his demons
It's art, it's science

[Verse 2]
Devour, conquer, and invade—good-good—here devour, conque

## 6.2 Calculate lyrics similarity <a name="62-bullet"></a>

In [None]:
print(f"Lyrics generated with '{SONG_TITLE}' title\n")

# iterate through all the generated lyrics
for i, gen_text in enumerate(text):
    # put the generated text on the top of the lyrics corpus
    lyrics = np.concatenate([[gen_text], eminem_lyrics_df.values])

    # transform lyrics into TF-IDF vectors
    tfidf = TfidfVectorizer(stop_words="english").fit_transform(lyrics)  
    # compute the cosine similarity                                                                                                                                                                                                                      
    pairwise_similarity = tfidf * tfidf.T 

    # isolate the top row (the row with similarities for the genereted text)
    pairwise_similarity = pairwise_similarity.toarray()[0]
    # mask the diagonal element (the similarity to itself)
    pairwise_similarity[0] = -1
    # get the top 3 most similar lyrics to the generated lyrics
    most_similar_idxs = pairwise_similarity.argsort()[-3:][::-1] 

    # list of things to print
    output = [i, 
              ', '.join(eminem_df.iloc[most_similar_idxs - 1].title), 
              *pairwise_similarity[most_similar_idxs], 
              most_similar_idxs - 1]

    print("lyrics {}:\n- similar to: {:50s}\n- scores: {:.3f}, {:.3f}, {:.3f}\n- df indices: {}\n".format(*output))

Lyrics generated with 'Artificial Intelligence' title

lyrics 0:
- similar to: Beautiful, Little Engine, The Ringer              
- scores: 0.090, 0.088, 0.083
- df indices: [ 26 129   6]

lyrics 1:
- similar to: Big Weenie, Guts Over Fear, Drug Ballad           
- scores: 0.120, 0.097, 0.092
- df indices: [199  38  90]

lyrics 2:
- similar to: Mic Check One Two, Jimmy Crack Corn, Discombobulated
- scores: 0.081, 0.080, 0.074
- df indices: [331 233 176]

lyrics 3:
- similar to: No Love, Kim, Em360 Rapcity Backroom Freestyle    
- scores: 0.131, 0.120, 0.116
- df indices: [ 24  50 266]

lyrics 4:
- similar to: Remember Me?, I Remember (Dedication to Whitey Ford/Everlast Diss), Never Love Again
- scores: 0.231, 0.175, 0.113
- df indices: [131 239 135]



## 6.3 Store lyrics to a text file <a name="63-bullet"></a>

In [None]:
GENERATED_LYRICS_DIR = settings['GENERATED_LYRICS_DIR']
FILE_NAME = f"gpt2_{MODEL_NAME}_lyrics.txt"
DELIMITER = f"\n\n{'='*80} \n\n"

with open(os.path.join(GENERATED_LYRICS_DIR, FILE_NAME), "a") as text_file:
    text_file.write(DELIMITER.join(text))