# Trigram Lyrics generation

## Sample sentence: 
Carlos needs a haircut

## Bigrams: 

- (Carlos, needs)
- (needs, a)
- (a, haircut)

## Trigrams

- (Carlos, needs, a)
- (needs, a, haircut)

## Acknowledgments
Based on the Poetry-generator talk made by Carlos Castro - Software Engineer at Microsoft

# Setup

In [80]:
# Imports

import nltk
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict

In [81]:
# Download data

nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\lirosale\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lirosale\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Exploration

In [82]:
# Inspect first sentence

first_sentence = reuters.sents()[0]
print(first_sentence) # [u'ASIAN', u'EXPORTERS', u'FEAR', u'DAMAGE', u'FROM' ...

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']


In [83]:
# Show trigrams for first sentence

print(list(trigrams(first_sentence, pad_left=True, pad_right=True))) 

[(None, None, 'ASIAN'), (None, 'ASIAN', 'EXPORTERS'), ('ASIAN', 'EXPORTERS', 'FEAR'), ('EXPORTERS', 'FEAR', 'DAMAGE'), ('FEAR', 'DAMAGE', 'FROM'), ('DAMAGE', 'FROM', 'U'), ('FROM', 'U', '.'), ('U', '.', 'S'), ('.', 'S', '.-'), ('S', '.-', 'JAPAN'), ('.-', 'JAPAN', 'RIFT'), ('JAPAN', 'RIFT', 'Mounting'), ('RIFT', 'Mounting', 'trade'), ('Mounting', 'trade', 'friction'), ('trade', 'friction', 'between'), ('friction', 'between', 'the'), ('between', 'the', 'U'), ('the', 'U', '.'), ('U', '.', 'S'), ('.', 'S', '.'), ('S', '.', 'And'), ('.', 'And', 'Japan'), ('And', 'Japan', 'has'), ('Japan', 'has', 'raised'), ('has', 'raised', 'fears'), ('raised', 'fears', 'among'), ('fears', 'among', 'many'), ('among', 'many', 'of'), ('many', 'of', 'Asia'), ('of', 'Asia', "'"), ('Asia', "'", 's'), ("'", 's', 'exporting'), ('s', 'exporting', 'nations'), ('exporting', 'nations', 'that'), ('nations', 'that', 'the'), ('that', 'the', 'row'), ('the', 'row', 'could'), ('row', 'could', 'inflict'), ('could', 'inflict

In [84]:
# How many trigrams in n word sentence?

In [85]:
# Word count
print(len(first_sentence))

49


In [86]:
# Trigram count
print(len(list(trigrams(first_sentence, pad_left=True, pad_right=True))))

51


# Reuters trigram model

In [87]:
# Our model will be a dictionary that maps trigram -> number of occurrences in reuters data
# By default we have zero for all trigrams (this is why we use defaultdict and not dict)
model = defaultdict(lambda: defaultdict(lambda: 0))
 
# Iterated through all sentences in the dataset
for sentence in reuters.sents():
    # For each trigram in the sentence
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        # Increase occurence count
        model[(w1, w2)][w3] += 1

In [88]:
# How many times does "economists" follow "what the"?
#print(model["", ""][""]) # "economists" follows "what the" 2 times

In [89]:
# How many times does "noneexistingword" follow "what the"?
#print(model["", ""][""]) # 0 times

In [90]:
#print(model["",""][""]) # 8839 sentences start with "The"?

## Intuition for probabilities in trigram model

Consider the sentences 

- _Carlos needs a haircut_
- _Carlos needs a new pair of shoes_

The trigrams are

- (Carlos, needs, a) (2)
- (needs, a, haircut) (1)
- (needs, a, new) (1)
- (a, new, pair) (1)
- (pair, of, shoes) (1)

Our language model can now predict conditional probabilities for words given the context:

Carlos, needs -> ???

a: 1.0
carlos: 0.0
needs: 0.0
new: 0.0
haircut: 0.0
shoes: 0.0

needs, a ->


a: 0.0
carlos: 0.0
needs: 0.0
new: 0.5
haircut: 0.5
shoes: 0.0

Mathematically this can be expressed as the conditional probability P (w[i] | w[i-1], w[i-2])

## Reuters trigram text generation

Let's get a word prediction given 2 words from our trigram model


In [91]:
print(len(model))

398630


In [92]:
# Compute probabilities from counts
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    #print(w1_w2, total_count)
    for w3 in model[w1_w2]:
        # Divide the number of times a trigram appears by the total count of trigrams
        model[w1_w2][w3] /= total_count

In [93]:
# View language model output for "what the"
prediction = model["I", "Shall"]

prediction_sorted = sorted(prediction.items(), key=lambda kv: -kv[1])

for word, conditional_probability in prediction_sorted:
    print(word, conditional_probability)
    

In [94]:
import numpy as np
from itertools import compress

text = [None, None]
end_sentence = False

while not end_sentence:
    
    # Obtain predictions for next word based on the lastest 2 words
    predictions = model[tuple(text[-2:])]
 
    prediction_items = predictions.items()
    probs = [p[1] for p in prediction_items]
    words = [p[0] for p in prediction_items]
    
    # Randomly select next word considering conditional probabilities
    s = np.random.multinomial(1, probs) # [0 0 0 1 0 0 0]
    candidate = list(compress(words, s))[0]
    text.append(candidate)
    
    if candidate == None:
        end_sentence = True

print(text)
    

[None, None, 'On', 'the', 'other', 'three', 'trading', 'banks', '.', None]


# Lyrics

In [95]:
import os

In [96]:
# Load poetry
file = open("DonOmar/Cuentale.txt", "r") 

new_line_sentinel = " _NL_ "

# In poetry, as in music, new lines are very important. Workaround to capture the new line in the tokenizer
song = file.read().replace("\n", new_line_sentinel)

In [97]:
print(song)

 Eliel (Eliel!) _NL_ Tu sabes _NL_ Quien mas? (quien mas?) _NL_ Tu sabes! _NL_ Si es el nene de titi _NL_ El nene de titi _NL_ King of Kings _NL_ Tu sabes! _NL_ Pa la tigerada _NL_ Mira si ya to' el mundo lo sabe _NL_ Tu sabes! _NL_ La calle sabe! (la calle lo sabe) _NL_ Lo saben ya! _NL_ Tu sabes! _NL_  _NL_ Eliel (Eliel!) _NL_ Tu sabes _NL_ Quien mas? (quien mas?) _NL_ Tu sabes! _NL_ Si es el nene de titi _NL_ El nene de titi _NL_ King of Kings _NL_ Tu sabes! _NL_ Pa la tigerada _NL_ Mira si ya to' el mundo lo sabe _NL_ Tu sabes! _NL_ La calle sabe! (la calle lo sabe) _NL_ Lo saben ya! _NL_ Tu sabes! _NL_ Tu sabes _NL_ Quien mas? (quien mas?) _NL_ Tu sabes! _NL_ Si es el nene de titi _NL_ El nene de titi _NL_ King of Kings _NL_ Tu sabes! _NL_ Pa la tigerada _NL_ Mira si ya to' el mundo lo sabe _NL_ Tu sabes! _NL_ La calle sabe! (la calle lo sabe) _NL_ Lo saben ya! _NL_ Tu sabes!Anda vamos… _NL_ Cuentale de tu y yo _NL_ Y de una vez confiesale _NL_ Que te perdio _NL_ Anda vamos… _NL_ 

In [98]:
all_files = os.listdir("DonOmar/")

In [99]:
#Load lyrics
all_songs = []
for name_file in all_files:
    file = open("DonOmar/" + name_file, 'r')
    all_songs.append(file.read().replace("\n", new_line_sentinel))

In [100]:
# How many songs?
# print()

In [101]:
# Tokenization
poetry_tokenized = []
for song in all_songs:
    poetry_tokenized += nltk.word_tokenize(song)

In [102]:
# First 100 tokens
print(poetry_tokenized[0:100])

['Natty', 'natasha', '_NL_', 'Ahora', 'entiendo', 'es', 'tan', 'difícil', 'de', 'aceptar', '_NL_', 'Que', 'tuvimos', 'tanto', 'para', 'dar', '_NL_', 'Y', 'con', 'el', 'tiempo', 'se', 'acabo', 'nuestro', 'camino', '_NL_', '_NL_', 'Natty', 'natasha', '_NL_', 'Ahora', 'entiendo', 'es', 'tan', 'difícil', 'de', 'aceptar', '_NL_', 'Que', 'tuvimos', 'tanto', 'para', 'dar', '_NL_', 'Y', 'con', 'el', 'tiempo', 'se', 'acabo', 'nuestro', 'camino', '_NL_', 'Ahora', 'entiendo', 'es', 'tan', 'difícil', 'de', 'aceptar', '_NL_', 'Que', 'tuvimos', 'tanto', 'para', 'dar', '_NL_', 'Y', 'con', 'el', 'tiempo', 'se', 'acabo', 'nuestro', 'caminoDon', 'omar', '_NL_', 'Que', 'paraíso', 'construimos', 'si', 'al', 'final', '_NL_', 'Termino', 'por', 'derribarnos', 'la', 'ansiedad', '_NL_', 'No', 'supimos', 'rescatarnos', 'y', 'cambio', 'nuestro', 'destino', '_NL_', 'Don', 'omar']


In [103]:
# Generate some poetry: build language model and generate!

# Our model will be a dictionary that maps trigram -> number of occurrences in reuters data
# By default we have zero for all trigrams (this is why we use defaultdict and not dict)
poetry_model = defaultdict(lambda: defaultdict(lambda: 0))
 

# For each trigram in the sentence
for w1, w2, w3 in trigrams(poetry_tokenized, pad_right=True, pad_left=True):
    # Increase occurence count
    poetry_model[(w1, w2)][w3] += 1

In [104]:
# Compute probabilities from counts
for w1_w2 in poetry_model:
    total_count = float(sum(poetry_model[w1_w2].values()))
    for w3 in poetry_model[w1_w2]:
        # Divide the number of times a trigram appears by the total count of trigrams
        poetry_model[w1_w2][w3] /= total_count

In [105]:
# What are the common words that follow "el tiempo"?
# poetry_model["", ""]

In [106]:
# Compute probabilities from counts
for w1_w2 in poetry_model:
    total_count = float(sum(poetry_model[w1_w2].values()))
    for w3 in poetry_model[w1_w2]:
        # Divide the number of times a trigram appears by the total count of trigrams
        poetry_model[w1_w2][w3] /= total_count

In [107]:
new_line_sentinel = "_NL_"
poetry = ["tu", "sabe"]

for i in range(1, 50):

    # Obtain predictions for next word based on the lastest 2 words
    predictions = poetry_model[tuple(poetry[-2:])]

    prediction_items = predictions.items()
    probs = [p[1] for p in prediction_items]
    words = [p[0] for p in prediction_items]
    
    # Randomly select next word considering conditional probabilities
    s = np.random.multinomial(1, probs)
    candidate = list(compress(words, s))[0]
    poetry.append(candidate)


print(" ".join(poetry).replace(new_line_sentinel, "\n"))


tu sabe 
 Don danni ... Danni fornari 
 Luny , la batidora 
 Batidora , La muerte llego pronto 
 wooo jaa 
 Oye , Don , Dale ! ) Aguanta tu donqueo 
 ( que te choque 
 No juegues con tu gorío , sandungueando , esto es un


## Want more tasks?
Wanna try an other text representation? 
- Tf–idf
- What about taking in to account stop words? 

Resources: https://scikit-learn.org/stable/modules/feature_extraction.html?highlight=nlp#text-feature-extraction