# Markov Chains

In [1]:
import pandas as pd
import numpy as np

In [12]:
import os
os.chdir('/Users/martinkihn/Desktop/xmas_master_text')
os.getcwd()

'/Users/martinkihn/Desktop/xmas_master_text'

In [13]:
ls

api_xmas_file.csv  text.txt           text_newlines.txt  xmas_titles.csv
master_df.csv      text_as_list.csv   text_nopunc.txt


In [16]:
f = open("text_nopunc.txt", "r+")
lines = f.readlines()
lines = "".join(lines)
text_list = lines.split()

In [17]:
text_list[:10]

['a',
 'skishop',
 'owner',
 'reluctantly',
 'moves',
 'himself',
 'his',
 'wife',
 'and',
 'his']

In [18]:
from collections import defaultdict
markov_graph = defaultdict(lambda: defaultdict(int))

In [19]:
tokenized_text = text_list

In [20]:
last_word = tokenized_text[0].lower()
for word in tokenized_text[1:]:
    word = word.lower()
    markov_graph[last_word][word] += 1
    last_word = word

In [21]:
limit = 3
for first_word in ('christmas', 'home', 'town'):
    next_words = list(markov_graph[first_word].keys())[:limit]
    for next_word in next_words:
        print(first_word, next_word)

christmas carol
christmas or
christmas eve
home and
home to
home for
town of
town dubs
town grows


In [22]:
# start at random word - of possible choices take weighted random
# choice using np.random.choice

def walk_graph(graph, distance=5, start_node=None):
    if distance <= 0:
        return[]
    
    if not start_node:
        start_node = np.random.choice(list(graph.keys()))
    
    weights = np.array(
        list(markov_graph[start_node].values()),
        dtype=np.float64)
    #normalize word counts sum 1
    weights /= weights.sum()
    
    #pick destination using weighted dist
    choices = list(markov_graph[start_node].keys())
    chosen_word = np.random.choice(choices, None, p=weights)
    
    return [chosen_word] + walk_graph(
        graph, distance=distance-1,
        start_node=chosen_word)

In [23]:
for i in range(10):
    print(' '.join(walk_graph(
        markov_graph, distance=12)), '\n')

dies jason dumps her life the intercontinental trip becomes complicated when her 

that makes her sister is marooned for danny wise lifestyle books is 

her two young woman hes built an unexpected christmastime courtship filled with 

chris it safe in the process captain grace everything goes missing in 

appears for trespassing dispirited he says his selfless gesture leads to her 

demands days of the most important in new york publishing company holiday 

of her faith in hand alec who has dwindled taking care unit 

donnelly comes across america to the holidays or will pose as christmas 

struggles to see or the season hoping for a very much in 

in her meeting with him find happiness in the snow globe and 



In [24]:
# another method uses pairs of words

def make_pairs(text_list):
    for i in range(len(text_list)-1):
        yield (text_list[i], text_list[i+1])
        
pairs = make_pairs(text_list)

In [25]:
pairs

<generator object make_pairs at 0x7fe0d88774a0>

In [26]:
word_dict = {}

for word_1, word_2 in pairs:
    if word_1 in word_dict.keys():
        word_dict[word_1].append(word_2)
    else:
        word_dict[word_1] = [word_2]

In [27]:
first_word = np.random.choice(text_list)
first_word

'the'

In [28]:
chain = [first_word]
n_words = 30

# the next words are sampled randomly from the following words
for i in range(n_words):
    chain.append(np.random.choice(word_dict[chain[-1]]))

' '.join(chain)

'the house she meets her humanitarian spirit of the new york nicks restaurant to be the gallerys big christmas while in the other after the hospitals new lastminute european holiday tradition'

In [29]:
chain = ['career']
n_words = 100
for i in range(n_words):
    chain.append(np.random.choice(word_dict[chain[-1]]))

' '.join(chain)

'career to a foolproof strategy for its original grandeur only one week before when her world of her and coincidences and ryan mason cooka holiday complications arise that would help from her childhood friend and marry boyfriend but is matched with her hometown christmas when a woman she has lost when her high school they invite their siblings jim and read the car and part of classic holiday season when a lot she helps her sixyearold son he begins to the citys oldest son jordan jordan finds herself drawn to inherit the wedding of the disgraced former classmates in the mistletoe'

# Transformers and NLG with GPT=2

In [31]:
# train and test text files are set up via the right path

In [33]:
!ls

api_xmas_file.csv      text_as_list.csv       text_newlines.txt
master_df.csv          text_as_list_test.csv  text_nopunc.txt
text.txt               text_as_list_train.csv xmas_titles.csv


In [34]:
train_path = 'text_as_list_train.csv'
test_path = 'text_as_list_test.csv'

In [35]:
!pip install transformers



In [36]:
from transformers import (
    GPT2Tokenizer,
    DataCollatorForLanguageModeling,
    TextDataset,
    GPT2LMHeadModel,
    TrainingArguments,
    Trainer,
    pipeline)

In [37]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [38]:
print('vocab size: %d, max seq len: %d' % (tokenizer.vocab_size, tokenizer.model_max_length))

vocab size: 50257, max seq len: 1024


In [39]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [40]:
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=train_path,
    block_size=126)

test_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=test_path,
    block_size=126)



In [41]:
print(tokenizer.decode(train_dataset[4]))

 and have not celebrated Christmas since. They have been so consumed by his death that they have forgotten the joys of the holidays, until a mysterious visitor enters their lives and rekindles the spirit of the season.
When a photojournalist photographs a mysterious stranger performing an act of bravery, the act quickly becomes headline news and the town dubs the stranger John Christmas. After seeing the photo, Kathleen McAllister becomes convinced that the mysterious stranger is in fact her long-lost brother Hank. With the towns help, Kathleen and Noah set about to find the strangers true identity with the help Max, a Christmas angel.



In [42]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

In [45]:
training_args = TrainingArguments(
    output_dir = 'data/out',
    overwrite_output_dir = True,
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 32,
    learning_rate = 5e-5,
    num_train_epochs = 10
)

trainer = Trainer(
    model = model,
    args = training_args,
    data_collator=data_collator,
    train_dataset = train_dataset,
    eval_dataset = test_dataset
)

W&B installed but not logged in. Run `wandb login` or set the WANDB_API_KEY env variable.


In [46]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=50, training_loss=3.0897314453125, metrics={'train_runtime': 2375.011, 'train_samples_per_second': 0.021, 'total_flos': 127003268044800, 'epoch': 10.0})

In [47]:
trainer.save_model()

In [48]:
generator = pipeline('text-generation', tokenizer='gpt2', model='data/out')

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [49]:
print(generator('As Christmas approaches', max_length=200)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


As Christmas approaches, two young men, Nick and Sam—all professionals together at the firm of Little John—get together for the first time since they lost their parents during the winter. Nick asks for cash to help settle the score, while Sam agrees to let him stay and help Nick work as Santa. Despite their troubles, Nick and Sam find a way to help each other and avoid more problems.
When two young actresses find themselves falling for each other, they need new romance to keep each busy. Two years after she took the helm of their Broadway production, Elle Murphy, a high school junior, finds herself playing the role of Victoria, who has her own separate romance with her boyfriend. But before long, Elle must decide if she'd rather be with Ben, who has his own problems, or with Elle. What will happen when the wedding turns into a Christmas romance?
Sophie Bell, a senior at the prestigious Victoria Public School, was diagnosed with multiple sclerosis,


In [50]:
print(generator('The holiday season', max_length=200)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The holiday season is filled with surprises for the people of Philadelphia and the town’s beloved Christmas spirit. As the town’s best-known ski lodge and snow lodge are divided between Santa and his family, a new opportunity presents itself when two new faces – Jenna and Alex – become the focus of a rivalry by creating a holiday spirit that will change every Christmas.
Jack Bauer: The Secret Service’s toughest sergeant has a plan during the holiday season to use the power of the Christmas app to infiltrate and save the town’s most important Christmas celebration. When the Secret Service’s tough sergeant uses his new initiative to protect Christmas, he has no idea who to trust.
New England’s oldest public school teacher is sent to the town of Ashfield to help with a teacher training program whose students are struggling with Christmas preparation and the pressure of the holiday season.
When a young boy discovers that love is real, his love for God quickly turns to Christmas


In [51]:
# there are different way to decode the next-best-word model known
# as beam search, random samples, k-samples - below we compare
# the output

text_beam = generator('A chance meeting',
                      max_length = 50,
                      num_beams = 5)
print(text_beam[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


A chance meeting with the man responsible for creating the world's most famous Christmas tree decorating contest.
A young woman finds herself falling in love with a handsome young man, who she believes to be her real father. As she tries to figure out


In [52]:
text_random_sampling = generator('A chance meeting',
                                 max_length=50,
                                 top_k=0,
                                 do_sample=True,
                                 temperature=0.5)
print(text_random_sampling[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


A chance meeting with the owner of the building, Charles, and their two sons, David and Chris, will bring a renewed sense of community to the small town.
As she prepares to leave her post as CEO of a small-town bakery,


In [53]:
text_k_sampling = generator('A chance meeting',
                            max_length=50,
                            top_k=40,
                            do_sample=True)
print(text_k_sampling[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


A chance meeting with Tom’s former high school coach, Tom Hartman, to talk about family and Christmas would seem to make a lasting difference to the young man and his family’s Christmas traditions. However, a mysterious and dangerous man


In [54]:
text_p_sampling = generator('A chance meeting',
                            max_length=50,
                            top_k=0,
                            top_p=0.92,
                            do_sample=True)
print(text_p_sampling[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


A chance meeting with the woman doesn't get her on the subject long, until Holly, a Seattle City Councilman with close ties to Major League Baseball, learns that Celine is the real deal. While Holly initially believes she's stepping down from her
