<a href="https://colab.research.google.com/github/kwaldenphd/poemBot/blob/master/allison_parrish.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup & Environment



## Install

In [None]:
!pip install pronouncing # https://pronouncing.readthedocs.io/en/latest/
!pip install markovify # https://pypi.org/project/markovify/
!pip install numpy # https://pypi.org/project/numpy/
! pip install scipy # https://pypi.org/project/scipy/

Dependency issues with `tensorflow` & `keras` for `Pincelate` (https://pincelate.readthedocs.io/en/latest/)

In [None]:
# !pip install tensorflow==1.15.0 # https://pypi.org/project/tensorflow/
# !pip install keras==2.2.5 "h5py<3.0.0" # https://pypi.org/project/keras/
# !pip install pincelate # https://pypi.org/project/pincelate/

## Import

In [None]:
# import stuff
import sys, pandas as pd, numpy as np, json, random, re, gzip, textwrap
from collections import Counter, defaultdict
import markovify, codecs, random
# from pincelate import Pincelate

# All The Allison Parrish Things

## Overview

### Project Gutenberg
- [Gutenberg, dammit](https://github.com/aparrish/gutenberg-dammit/) (full corpus)
- [Gutenberg corpus](https://github.com/aparrish/gutenberg-poetry-corpus) (poetry corpus)
  - ["Quick Experiments" Jupyter Notebook](https://github.com/aparrish/gutenberg-poetry-corpus/blob/master/quick-experiments.ipynb)
  - ["Plot to Poem" 2017 NoPaGenMo Jupyter Notebook](https://github.com/aparrish/plot-to-poem/blob/master/plot-to-poem.ipynb)
- [Gutenberg Poetry Autocomplete](http://gutenberg-poetry.decontextualize.com/)

### `pronouncing`, *interface for CMU Pronouncing Dictionary*
- [documentation](https://pronouncing.readthedocs.io/en/latest/index.html)
- [Updated Jupyter Notebook cookbook](https://github.com/aparrish/nonsense-verse-pycon-2020/blob/master/pronouncing-tutorial.ipynb)

### `Pincelate`, *ML model for spelling & sounding out English words*
- [documentation](https://pincelate.readthedocs.io/en/latest/)
- [Jupyter Notebook tutorial and cookbook](https://github.com/aparrish/nonsense-verse-pycon-2020/blob/master/pincelate-tutorial-and-cookbook.ipynb)
- [PyCon 2020 workshop](https://github.com/aparrish/nonsense-verse-pycon-2020)
- [Bobey Dig 2019 NaNoGenMo notebook](https://github.com/aparrish/bobey-dig/blob/master/headcoldify.ipynb)
  * NOTE: `Tensorflow` & `Keras` dependencies require Python <3.8 (so no option to run on Google Colab- requires local session)

### Magic & divinatory language
- [Workshop materials](https://github.com/aparrish/comexmadivla)
- [Speculative magic words Jupyter Notebook (string manipulation, `Pincelate`)](https://github.com/aparrish/comexmadivla/blob/master/magic-words-workbook.ipynb)
- [Cartomancy and semantic space Jupyter Notebook](https://github.com/aparrish/comexmadivla/blob/master/cartomancy-semantic-space.ipynb)

### `Tracery` *Python tracery port*
- [documentation](https://github.com/aparrish/pytracery)
- [original Tracery, Kate Compton](http://tracery.io/)
- [Tracery tutorial](http://www.crystalcodepalace.com/traceryTut.html)

### "Reading and Writing With Electronic Text" NYU ITP course, Spring 2023
- [Class notebooks and code, GitHub](https://github.com/aparrish/rwet)
- [Class website](https://rwet.decontextualize.com/)
- Jupyter Notebooks
  - "[Natural Language Processing Concepts With Spacy](https://github.com/aparrish/rwet/blob/master/nlp-concepts-with-spacy.ipynb)"
  - "[Playing With Transformers](https://github.com/aparrish/rwet/blob/master/transformers-playground.ipynb)"

### `pycorpora` *Python interface for Darius Kazemi's [Corpora Project](https://github.com/dariusk/corpora)*
- [GitHub](https://github.com/aparrish/pycorpora)
- [Original Darius Kazemi's Corpora Project](https://github.com/dariusk/corpora)
  * *Seems deprecated and doesn't work based on current documentation*

## Shallow Dives

### Project Gutenberg Poetry Corpus

- [GitHub](https://github.com/aparrish/gutenberg-poetry-corpus)
- [Jupyter Notebook](https://github.com/aparrish/gutenberg-poetry-corpus/blob/master/quick-experiments.ipynb)

#### Build & Load

In [None]:
# build
!curl -O http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz

In [None]:
# load data
# import gzip, json
all_lines = []
for line in gzip.open("gutenberg-poetry-v001.ndjson.gz"):
    all_lines.append(json.loads(line.strip()))

In [None]:
# show random sample
# import random
random.sample(all_lines, 8)

#### Concordances & Counts

In [None]:
# create concordance for "flower"
# import re
flower_lines = [line['s'] for line in all_lines if re.search(r'\bflower\b', line['s'], re.I)]
random.sample(flower_lines, 8)

In [None]:
# longest lines, align on "flower"
longest = max([len(x) for x in flower_lines]) # find the length of the longest line
center = longest - len("flower") # and use it to create a "center" offset that will work for all lines

sorted_flower_lines = sorted(
    [line for line in flower_lines if re.search(r"\bflower\b\s\w", line)], # only lines with word following
    key=lambda line: line[re.search(r"\bflower\b\s", line).end():]) # sort on the substring following the match

for line in sorted_flower_lines[350:400]: # change these numbers to see a different slice
    offset = center - re.search(r'\bflower\b', line, re.I).start()
    print((" "*offset)+line) # left-pad the string with spaces to align on "flower"

In [None]:
# adjective concordance 
found_adj = []
for line in flower_lines:
    matches = re.findall(r"(the|a)\s(\b\w+\b)\s(\bflower\b)", line, re.I)
    for match in matches: 
        found_adj.append(match[1])
random.sample(found_adj, 12)

In [None]:
# counting most common adjectives
# from collections import Counter
Counter(found_adj).most_common(12)

#### Rhymes & Phones

In [None]:
# rhymes
# import pronouncing as pr
source_word = "flowering"
source_word_rhymes = pr.rhymes(source_word)
for line in all_lines:
    text = line['s']
    match = re.search(r'(\b\w+\b)\W*$', text)
    if match:
        last_word = match.group()
        if last_word in source_word_rhymes:
            print(text)

In [None]:
# create list of phones
phones = pr.phones_for_word(source_word)[0] # words may have multiple pronunciations, so this returns a list
pr.rhyming_part(phones)

In [None]:
# random rhymes
# from collections import defaultdict
by_rhyming_part = defaultdict(lambda: defaultdict(list))
for line in all_lines:
    text = line['s']
    if not(32 < len(text) < 48): # only use lines of uniform lengths
        continue
    match = re.search(r'(\b\w+\b)\W*$', text)
    if match:
        last_word = match.group()
        pronunciations = pr.phones_for_word(last_word)
        if len(pronunciations) > 0:
            rhyming_part = pr.rhyming_part(pronunciations[0])
            # group by rhyming phones (for rhymes) and words (to avoid duplicate words)
            by_rhyming_part[rhyming_part][last_word.lower()].append(text)

random_rhyming_part = random.choice(list(by_rhyming_part.keys()))
random_rhyming_part, by_rhyming_part[random_rhyming_part]

In [None]:
# rhyming groups
rhyme_groups = [group for group in by_rhyming_part.values() if len(group) >= 2]
for i in range(7):
    group = random.choice(rhyme_groups)
    words = random.sample(list(group.keys()), 2)
    print(random.choice(group[words[0]]))
    print(random.choice(group[words[1]]))

#### Markov Text Chains

In [None]:
# markov text chains
# import markovify
big_poem = "\n".join([line['s'] for line in random.sample(all_lines, 250000)])
model = markovify.NewlineText(big_poem)
for i in range(14):
    print(model.make_sentence())

In [None]:
# another sentence
model.make_short_sentence(60)

In [None]:
# randomly-generated poem
for i in range(6):
    print()
    for i in range(random.randrange(1, 5)):
        print(model.make_short_sentence(40))
    # ensure last line has a period at the end, for closure
    print(re.sub(r"(\w)[^\w.]?$", r"\1.", model.make_short_sentence(40)))
    print()
    print("～ ❀ ～")

### Magic Words

"Words have power that arises not just from their meaning but from their material. Some words, those "somewhere between the 'legible' and 'illegible,' between the 'spirit world' and the 'human world'," as scholar James Robson writes, "express or illustrate ineffable meanings and powers that defy... traditional modalities of communication." Some words, that is, are magic. In this workshop, we will use techniques in computational text analysis and text generation to better understand how magic words work, and coin new magic words of our own. In the first part of the session, we consider magic words as islands in a largely unexplored infinite space of potential linguistic expression—a space that can be explored computationally in order to uncover new magic words with new affordances. In the second part of the session, we consider systems of divination (in particular, Tarot) as ad-hoc ontologies for dividing the world into comprehensible categories. We then analyze the "semantic space" of these divinatory ontologies, and endeavor to create new divinatory systems with new ontologies that reflect our own worldviews. Technologies covered include cryptography, phoneme-to-grapheme models, generative adversarial networks, text clustering, predictive language models and variational autoencoders. No previous programming experience required." ([GitHub](https://github.com/aparrish/comexmadivla), SFPC Code Societies, 2020)

Original Jupyter Notebooks
- [Speculative magic words (string manipulation, `Pincelate`)](https://github.com/aparrish/comexmadivla/blob/master/magic-words-workbook.ipynb)
- [Cartomancy and semantic space](https://github.com/aparrish/comexmadivla/blob/master/cartomancy-semantic-space.ipynb)

In [None]:
# get list of English nouns from Darius Kazemi's corpus project
!curl -L -O https://raw.githubusercontent.com/dariusk/corpora/master/data/words/nouns.json

In [None]:
# import random, json
nouns = [item.lower() for item in json.load(open("nouns.json"))['nouns']]
random.choice(nouns)

#### Concatenation

In [None]:
# concatenation
def smoosh(a, b):
    return a[:int(len(a)/2)] + b[int(len(b)/2):]
smoosh(random.choice(nouns), random.choice(nouns))

#### Dislocation

In [None]:
# dislocation
def dislocate(s, prob=0.1):
    out = ""
    for ch in s:
        if random.random() < prob:
            out += " "
        out += ch
    return out
dislocate("abracadabra")

'ab r acadabra'

#### Character Ciphers

In [None]:
# character cipher
def replace_by_char(s, ch_map):
    out = ""
    for ch in s:
        if ch in ch_map:
            out += ch_map[ch]
        else:
            out += ch
    return out


nextch_map = {
    'a': 'b', 'b': 'c', 'c': 'd', 'd': 'e',
    'e': 'f', 'f': 'g', 'g': 'h', 'h': 'i',
    'i': 'j', 'j': 'k', 'k': 'l', 'l': 'm',
    'm': 'n', 'n': 'o', 'o': 'p', 'p': 'q',
    'q': 'r', 'r': 's', 's': 't', 't': 'u',
    'u': 'v', 'v': 'w', 'w': 'x', 'x': 'y',
    'y': 'z', 'z': 'a'
}


replace_by_char("here come the irish", nextch_map)

In [None]:
# rot13 cipher
# import codecs
codecs.encode("here come the irish", 'rot13')

#### Mirror Writing

In [None]:
# mirror writing
# from https://github.com/combatwombat/Lunicode.js/blob/master/lunicode.js
mirror_replacements = {
    'a': 'ɒ', 'b': 'd', 'c': 'ɔ', 'd': 'b', 'e': 'ɘ', 
    'f': 'Ꮈ', 'g': 'ǫ', 'h': 'ʜ', 'i': 'i', 'j': 'ꞁ',
    'k': 'ʞ', 'l': 'l', 'm': 'm', 'n': 'ᴎ', 'o': 'o',
    'p': 'q', 'q': 'p', 'r': 'ɿ', 's': 'ꙅ', 't': 'ƚ',
    'u': 'u', 'v': 'v', 'w': 'w', 'x': 'x', 'y': 'ʏ', 'z': 'ƹ',
    'A': 'A', 'B': 'ᙠ', 'C': 'Ɔ', 'D': 'ᗡ', 'E': 'Ǝ',
    'F': 'ꟻ', 'G': 'Ꭾ', 'H': 'H', 'I': 'I', 'J': 'Ⴑ',
    'K': '⋊', 'L': '⅃', 'M': 'M', 'N': 'Ͷ', 'O': 'O',
    'P': 'ꟼ', 'Q': 'Ọ', 'R': 'Я', 'S': 'Ꙅ', 'T': 'T',
    'U': 'U', 'V': 'V', 'W': 'W', 'X': 'X', 'Y': 'Y', 'Z': 'Ƹ'}
text = "in the beginning was the notebook"
print(text + " " + replace_by_char(text, mirror_replacements))

#### Handwriting

In [None]:
# handwriting errors
# import re, random

# suggested in Lecouteux, p. xxi
replacements = {
    'u': ['o', 'n'],
    'st': ['h'],
    'p': ['f'],
    'ni': ['m'],
    'rn': ['m'],
    'in': ['m'],
    'iu': ['m', 'in'],
    'r': ['t', 'z', 'c'],
    'l': ['t'],
    'c': ['t'],
    'd': ['ol']
}


text = "in the beginning was the notebook"
out = text
for patt, repl in replacements.items():
    out = re.sub(patt,
                 lambda m: random.choice(repl) if random.random() < 0.5 else m.group(),
                 out)
print(text)
print(out)

NameError: ignored

#### Abbreviations

In [None]:
# abbreviations
def abbrev(s, take=1):
    words = s.split()
    return [w[:take] for w in words]
abbrev("hello there how are you?")

In [None]:
# another abbreviation function call
abbrev(text, 2)

In [None]:
# abbrev function call
print(''.join(abbrev(text, 2)))

In [None]:
# abbrev function call
init_cap = [item.capitalize() for item in abbrev(text, 2)]
print('. '.join(init_cap))

#### Mandorlas 

In [None]:
# formatting
def triangle(s):
    out = []
    for i in range(len(s)):
        snippet = s[:i+1]
        out.append(snippet)
    return out
print("\n".join(triangle("abracadabra")))
print("\n".join(reversed(triangle("abracadabra"))))

In [None]:
# streamlied mandalora formatting
def mandorla(s):
    return triangle(s)[:-1] + list(reversed(triangle(s)))
print("\n".join(mandorla("abracadabra")))

In [None]:
# for centering in a Jupyter Notebook
from IPython.display import display, HTML
html_src = "<div style='text-align: center'>"
html_src += "<br>".join(mandorla("abracadabra"))
html_src += "</div>"
display(HTML(html_src))

In [None]:
# mandalora with mirror replacement
html_src = "<div style='text-align: center'>"
html_src += "<br>".join(mandorla("abracadabra" + replace_by_char("abracadabra", mirror_replacements)))
html_src += "</div>"
display(HTML(html_src))

#### Word Squares

In [None]:
# word squares
def gen_str(n, alphabet):
    return ''.join([random.choice(alphabet) for i in range(n)])
gen_str(5, alphabet="abcdefghijklmnopqrstuvwxyz")

In [None]:
# second square function with random letters
def gen_square(n, alphabet='abcdefghijklmnopqrstuvwxyz', start=None):
    if start is None:
        rows = [gen_str(n, alphabet)]
    else:
        assert len(start) == n
        rows = [start]
    for i in range(int(n/2)):
        beg = ""
        end = ""
        for j in range(i+1):
            beg += rows[j][i+1]
            end += rows[j][-i-2]
        row = beg + gen_str(n - ((i+1)*2), alphabet) + ''.join(reversed(end))
        rows.append(row)
    return rows + list(reversed([''.join(reversed(s)) for s in rows[:int(n/2)]]))
print("\n".join(gen_square(5)))

In [None]:
# alternate generative square function call
print("\n".join(gen_square(5, alphabet="satorarepotenet", start="sator")))

In [None]:
# function call with emojis
print()
print("\n".join(gen_square(5, alphabet="😀😄😁😆😅😂🤣😊😙😗😘🥰😍😌😉🙃🙂😇😚😋😛😝😜🤨🧐🤓😎")))

#### Numerology

In [None]:
# numerology
def letter_value(ch):
    if not(ch.isalpha()):
        return 0
    return ord(ch.lower()) - 96
letter_value('a')

In [None]:
# word sum
def gematriesque(s):
    return sum([letter_value(ch) for ch in s])
gematriesque('notre dame')

In [None]:
# look up words that match sum
# from collections import defaultdict
by_sum = defaultdict(list)
word_to_sum = {}

for item in nouns:
    letter_sum = gematriesque(item)
    word_to_sum[item] = letter_sum
    by_sum[letter_sum].append(item)

by_sum[72]

In [None]:
# show words with same sum as input string
print("\n".join(by_sum[gematriesque('notre dame')]))

### NLP & `Spacy`

"“Natural Language Processing” is a field at the intersection of computer science, linguistics and artificial intelligence which aims to make the underlying structure of language available to computer programs for analysis and manipulation. It’s a vast and vibrant field with a long history! New research and techniques are being developed constantly.

"The aim of this notebook is to introduce a few simple concepts and techniques from NLP—just the stuff that’ll help you do creative things quickly, and maybe open the door for you to understand more sophisticated NLP concepts that you might encounter elsewhere. We'll start with simple extraction tasks: isolating words, sentences, and parts of speech. By the end, we'll have a few working systems for creating sophisticated text generators that function by remixing texts based on their constituent linguistic units." "Transformers is a Python library released by Hugging Face to make it easy to use pre-trained transformer language models. This notebook takes you through the basics of how to generate text with this library, and demonstrates a few simple techniques you can use to assert finer-grained control over the text generation procedure, like logit warping and fine-tuning." ([Jupyter Notebook](https://github.com/aparrish/rwet/blob/master/nlp-concepts-with-spacy.ipynb), Spring 2023 "Reading and Writing Electronic Text" NYU ITP course)

#### Setup

In [None]:
# install
import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en_core_web_md

In [None]:
# import
import spacy
nlp = spacy.load('en_core_web_md')

In [None]:
# loading a sample txt file
text = open("pg84.txt").read()

In [None]:
# parse using spacy
doc = nlp(text)
sentences = list(doc.sents)
words = [w for w in list(doc) if w.is_alpha]
noun_chunks = list(doc.noun_chunks)
entities = list(doc.ents)

#### Counting & Sampling

In [None]:
# number of sentences
len(sentences)

3327

In [None]:
# random sample of sentences
# import random
for item in random.sample(sentences, 5):
    print(item.text.strip().replace("\n", " "))
    print()

In [None]:
# ten random words
for item in random.sample(words, 10):
    print(item.text)

In [None]:
# ten random noun chunks
for item in random.sample(noun_chunks, 10):
    print(item.text)

my sweet Elizabeth
indignation
absolution
the fishermen
an instant
Heaven
the murderer
the gloom
promotion
he


In [None]:
# ten random entities
for item in random.sample(entities, 10):
    print(item.text)

#### Parts of Speech

In [None]:
# sentences as strings
sentence_strs = [item.text for item in doc.sents]
random.sample(sentence_strs, 10)

['I hardly know whether I shall have the power to detail\nit; yet the tale which I have recorded would be incomplete without this\nfinal and wonderful catastrophe.',
 'Section 2.',
 'Entreating him, therefore, to remain a few minutes at the bottom of the\nstairs, I darted up towards my own room.',
 'Sometimes,\nwhen nature, overcome by hunger, sank under the exhaustion, a repast\nwas prepared for me in the desert that restored and inspirited me.',
 'I found myself similar yet at the same time strangely\nunlike to the beings concerning whom I read and to whose conversation I\nwas a listener.',
 'Sometimes I wished to express my sensations in my own mode, but the\nuncouth and inarticulate sounds which broke from me frightened me into\nsilence again.\n\n',
 'Thus I might proclaim myself a madman,\nbut not revoke the sentence passed upon my wretched victim.',
 '[Coleridge’s “Ancient Mariner.”',
 'Is this to prognosticate peace, or to mock at my unhappiness?”\n\nI fear, my friend, that I sh

In [None]:
# parts of speech
nouns = [w for w in words if w.pos_ == "NOUN"]
verbs = [w for w in words if w.pos_ == "VERB"]
adjs = [w for w in words if w.pos_ == "ADJ"]
advs = [w for w in words if w.pos_ == "ADV"]

In [None]:
# random sampling of nouns
for item in random.sample(nouns, 20): # change "nouns" to "verbs" or "adjs" or "advs" to sample from those lists!
    print(item.text)

#### Entities

In [None]:
# getting entities
people = [e for e in entities if e.label_ == "PERSON"]
locations = [e for e in entities if e.label_ == "LOC"]
times = [e for e in entities if e.label_ == "TIME"]

In [None]:
# random sample with times
for item in random.sample(times, 20): # change "times" to "people" or "locations" to sample those lists
    print(item.text.strip())

#### Frequency

In [None]:
# get term frequency
# from collections import Counter
word_count = Counter([w.text for w in words])

In [None]:
# show single term
word_count['heaven']

15

In [None]:
# ten most common terms
word_count.most_common(10)

[('the', 4080),
 ('and', 3003),
 ('I', 2847),
 ('of', 2750),
 ('to', 2157),
 ('my', 1635),
 ('a', 1402),
 ('in', 1138),
 ('was', 1020),
 ('that', 1018)]

In [None]:
# twenty most common terms in a for loop
for word, count in word_count.most_common(20):
    print(word, count)

the 4080
and 3003
I 2847
of 2750
to 2157
my 1635
a 1402
in 1138
was 1020
that 1018
me 867
with 705
had 684
not 583
which 565
you 555
but 551
his 502
for 495
as 492


#### Saving Output

In [None]:
# list of words to file
with open("words.txt", "w") as fh:
    fh.write("\n".join([w.text for w in words]))

In [None]:
# that workflow as a function
def save_spacy_list(filename, t):
    with open(filename, "w") as fh:
        fh.write("\n".join([item.text for item in t]))

# function call
save_spacy_list("words.txt", words)

In [None]:
# saving counter objects
def save_counter_tsv(filename, counter, limit=1000):
    with open(filename, "w") as outfile:
        outfile.write("key\tvalue\n")
        for item, count in counter.most_common():
            outfile.write(item.strip() + "\t" + str(count) + "\n")    

save_counter_tsv("100_common_words.tsv", word_count, 100)

In [None]:
# time entities
time_counter = Counter([e.text.lower().strip() for e in times])
save_counter_tsv("time_count.tsv", time_counter, 100)

In [None]:
# people entities
people_counter = Counter([e.text.lower() for e in people])
save_counter_tsv("people_count.tsv", people_counter, 100)

#### Data Structures

##### Words

In [None]:
# lemmatizing
for word in random.sample(words, 12):
    print(word.text, "→", word.lemma_)

which → which
with → with
a → a
my → my
the → the
the → the
being → being
it → it
playing → play
innocence → innocence
the → the
the → the


In [None]:
# words in sentence
sentence = random.choice(sentences)
for word in sentence:
    print(word.text)

Hear
him
not
;
call
on
the
names
of
William
,


Justine
,
Clerval
,
Elizabeth
,
my
father
,
and
of
the
wretched
Victor
,
and


thrust
your
sword
into
his
heart
.


In [None]:
# parts of speech
for item in random.sample(words, 24):
    print(item.text, "/", item.pos_, "/", item.tag_)

Accursed / VERB / VBN
you / PRON / PRP
my / PRON / PRP$
that / SCONJ / IN
removal / NOUN / NN
of / ADP / IN
consumption / NOUN / NN
her / PRON / PRP$
to / PART / TO
is / AUX / VBZ
would / AUX / MD
they / PRON / PRP
to / ADP / IN
to / ADP / IN
word / NOUN / NN
Servox / PROPN / NNP
happiness / NOUN / NN
to / ADP / IN
attended / VERB / VBN
his / PRON / PRP$
of / ADP / IN
awoke / VERB / VBD
rose / VERB / VBD
caused / VERB / VBD


In [None]:
# term deep dive
spacy.explain('VBP')

'verb, non-3rd person singular present'

In [None]:
# verb forms
only_past = [item.text for item in doc if item.tag_ == 'VBN']
random.sample(only_past, 12)

['been',
 'watched',
 'been',
 'paid',
 'acquainted',
 'supplied',
 'transported',
 'sent',
 'dated',
 'lost',
 'surrounded',
 'opened']

In [None]:
# plural nouns
only_plural = [item.text for item in doc if item.tag_ == 'NNS']
random.sample(only_plural, 12)

##### Sentences

In [None]:
# sentence breakdown
sent = random.choice(sentences)
print("Original sentence:", sent.text.replace("\n", " "))
for word in sent:
    print()
    print("Word:", word.text)
    print("Tag:", word.tag_)
    print("Head:", word.head.text)
    print("Dependency relation:", word.dep_)
    print("Children:", list(word.children))

##### Semantic Units

In [None]:
# flattening the structure
def flatten_subtree(st):
    return ''.join([w.text_with_ws for w in list(st)]).strip()

sent = random.choice(sentences)
print("Original sentence:", sent.text.replace("\n", " "))
for word in sent:
    print()
    print("Word:", word.text.replace("\n", " "))
    print("Flattened subtree: ", flatten_subtree(word.subtree).replace("\n", " "))

In [None]:
# get subjects
subjects = []
for word in doc:
    if word.dep_ in ('nsubj', 'nsubjpass'):
        subjects.append(flatten_subtree(word.subtree))

random.sample(subjects, 12)

In [None]:
# prepositional phrases
prep_phrases = []
for word in doc:
    if word.dep_ == 'prep':
        prep_phrases.append(flatten_subtree(word.subtree).replace("\n", " "))
 
random.sample(prep_phrases, 12)                           

#### Generating Text With `spaCy` and Tracery

In [None]:
# break down sentence structure
subjects = [flatten_subtree(word.subtree).replace("\n", " ")
            for word in doc if word.dep_ in ('nsubj', 'nsubjpass')]
past_tense_verbs = [word.text for word in words if word.tag_ == 'VBD' and word.lemma_ != 'be']
adjectives = [word.text for word in words if word.tag_.startswith('JJ')]
nouns = [word.text for word in words if word.tag_.startswith('NN')]
prep_phrases = [flatten_subtree(word.subtree).replace("\n", " ")
                for word in doc if word.dep_ == 'prep']

In [None]:
# tracery setup
!{sys.executable} -m pip install tracery

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tracery
  Downloading tracery-0.1.1.tar.gz (8.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: tracery
  Building wheel for tracery (setup.py) ... [?25l[?25hdone
  Created wheel for tracery: filename=tracery-0.1.1-py3-none-any.whl size=7697 sha256=627ad53f991fd0dd09d030bb059ba4634104ffdba207d7c1d19ea83844f7f5a4
  Stored in directory: /root/.cache/pip/wheels/2c/55/dd/ca1cff9fcfa0968ca0610769213fc9e907c88eb2c0164726d4
Successfully built tracery
Installing collected packages: tracery
Successfully installed tracery-0.1.1


In [None]:
# import
import tracery
from tracery.modifiers import base_english

In [None]:
# define a grammar
rules = {
    "origin": [
        "#subject.capitalize# #predicate#.",
        "#subject.capitalize# #predicate#.",
        "#prepphrase.capitalize#, #subject# #predicate#."
    ],
    "predicate": [
        "#verb#",
        "#verb# #nounphrase#",
        "#verb# #prepphrase#"
    ],
    "nounphrase": [
        "the #noun#",
        "the #adj# #noun#",
        "the #noun# #prepphrase#",
        "the #noun# and the #noun#",
        "#noun.a#",
        "#adj.a# #noun#",
        "the #noun# that #predicate#"
    ],
    "subject": subjects,
    "verb": past_tense_verbs,
    "noun": nouns,
    "adj": adjectives,
    "prepphrase": prep_phrases
}
grammar = tracery.Grammar(rules)
grammar.add_modifiers(base_english)
grammar.flatten("#origin#")

'From thine eyes, it saw of a bride.'

In [None]:
# generate text
from textwrap import fill
output = " ".join([grammar.flatten("#origin#") for i in range(12)])
print(fill(output, 60))

At Geneva, He had. I retired. With which, I met. I lighted.
I admired. On earth, I said of provisions. The spirits of
the dead hovered round and dated the pleasures. In her eyes,
Our little voyages of discovery appeared the things and the
fever. I induced in the bloom of health. The whirlwind
passions of my soul called. I gave the selfishness of the
cottagers. I exclaimed of my parents’ house.


### Playing With Transformers

"Transformers is a Python library released by Hugging Face to make it easy to use pre-trained transformer language models. This notebook takes you through the basics of how to generate text with this library, and demonstrates a few simple techniques you can use to assert finer-grained control over the text generation procedure, like logit warping and fine-tuning." ([Jupyter Notebook](https://github.com/aparrish/rwet/blob/master/transformers-playground.ipynb), Spring 2023 "Reading and Writing Electronic Text" NYU ITP course)

#### Setup

In [None]:
# install
!pip install transformers datasets
!pip install tensorflow>=2.11

In [None]:
# check tf version
import tensorflow as tf
print(tf.__version__)

2.11.0


In [None]:
from IPython.display import HTML
HTML('')

#### Overview

In [None]:
# import statements
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

In [None]:
# tokenizers
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

##### Generators

In [None]:
# generator
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
generator("Two roads diverged in a yellow wood, and")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Two roads diverged in a yellow wood, and a black sedan crashed into trees off the side of the street.\n\n\n\nThe black sedan hit a roadblock and an SUV plunged into a tree on the west side of Lake Michigan.\n'}]

In [None]:
# another generator
generator("Two roads diverged in a yellow wood, and")[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Two roads diverged in a yellow wood, and the green areas of the city were rumbled in the afternoon.\n\n\n\n\nThe black areas were the biggest in North York with 9.2 million people.\n\nPolice and Fire Ministry'

##### Tokenizers

In [None]:
# show vocabulary 
vocab = tokenizer.get_vocab()
len(vocab)

50257

In [None]:
# show random set of words
# import random
random.sample(vocab.items(), 10)

In [None]:
# encode a string
src = "Behold! An alabaster anemone. Zzzzap!"
tokenizer.encode(src)

In [None]:
# decode token id
tokenizer.decode(1603)

'aster'

In [None]:
# units in original string
for token_id in tokenizer.encode(src):
    print(token_id, "→", "'" + tokenizer.decode(token_id) + "'")

3856 → 'Be'
2946 → 'hold'
0 → '!'
1052 → ' An'
435 → ' al'
397 → 'ab'
1603 → 'aster'
281 → ' an'
368 → 'em'
505 → 'one'
13 → '.'
1168 → ' Z'
3019 → 'zz'
89 → 'z'
499 → 'ap'
0 → '!'


In [None]:
# decode entire list of token ids
token_ids = tokenizer.encode(src)
tokenizer.decode(token_ids)

'Behold! An alabaster anemone. Zzzzap!'

In [None]:
# decode a random list of token ids
tokenizer.decode(random.sample(list(vocab.values()), 12))

'ruciating Inferno processor1995 rational newborn disproportion Ulster Carth recipient common situation'

In [None]:
# call the tokenizer as a function
tokenizer(["this is a test", "this is another test"], return_tensors="pt")

{'input_ids': tensor([[5661,  318,  257, 1332],
        [5661,  318, 1194, 1332]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 1]])}

#### Generation Deep Dive

In [None]:
# encode prompt
prompt = "Two roads diverged in a yellow wood, and"
prompt_encoded = tokenizer([prompt], return_tensors="pt")
prompt_encoded

{'input_ids': tensor([[ 7571,  9725, 12312,  2004,   287,   257,  7872,  4898,    11,   290]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
# show model result
result = model(**prompt_encoded)

In [None]:
# make a prediction
next_token_probs = result.logits[0,-1]
next_token_probs

tensor([-75.0658, -73.7585, -76.3895,  ..., -75.8365, -73.6258, -73.6459],
       grad_fn=<SelectBackward0>)

In [None]:
# shape of the prediction
next_token_probs.shape

torch.Size([50257])

In [None]:
# show individual token probability
next_token_probs[tokenizer.encode(' the')].item()

-63.117576599121094

In [None]:
# another probability
next_token_probs[tokenizer.encode(' x')].item()

-73.5635986328125

##### PyTorch Foundations

In [None]:
# generating using pytorch
import torch
for idx in reversed(torch.argsort(next_token_probs)[-12:]):
    print("'" + tokenizer.decode(idx) + "'")

In [None]:
# another pytorch example
prompt = "Two roads diverged in a yellow wood, and"
for i in range(10):
    # encode the prompt
    prompt_encoded = tokenizer([prompt], return_tensors="pt")
    # run a forward pass on the network
    result = model(**prompt_encoded)
    # get the probabilities for the next word
    next_token_probs = result.logits[0,-1]
    # sort by value, get the top 12 (you can change this number! try 1, or 1000)
    nexts = torch.argsort(next_token_probs)[-12:]
    # append the decoded ID to the current prompt
    prompt += tokenizer.decode(random.choice(nexts))
    print(prompt)

Two roads diverged in a yellow wood, and another
Two roads diverged in a yellow wood, and another green
Two roads diverged in a yellow wood, and another green.
Two roads diverged in a yellow wood, and another green. One
Two roads diverged in a yellow wood, and another green. One of
Two roads diverged in a yellow wood, and another green. One of their
Two roads diverged in a yellow wood, and another green. One of their vehicles
Two roads diverged in a yellow wood, and another green. One of their vehicles crashed
Two roads diverged in a yellow wood, and another green. One of their vehicles crashed.
Two roads diverged in a yellow wood, and another green. One of their vehicles crashed. Three


##### `.generate()`

In [None]:
# transformers generate method
prompt = "Two roads diverged in a yellow wood, and"
prompt_encoded = tokenizer(prompt, return_tensors="pt") # the "return_tensors" thing is important!
result = model.generate(**prompt_encoded)[0]
tokenizer.decode(result, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Two roads diverged in a yellow wood, and another road diverged in a red brick.\n\n\n\nA white saw, white sheet of wood, and a white wall were added to the roadway at 1 p.m.\n\nL'

In [None]:
# set max length for tokens
prompt = "Two roads diverged in a yellow wood, and"
prompt_encoded = tokenizer(prompt, return_tensors="pt")
result = model.generate(**prompt_encoded, max_length=250)[0]
tokenizer.decode(result, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Two roads diverged in a yellow wood, and a red tarp over the highway just north of the crash site.\n\n\n\n\n\nThe highway reopened Thursday afternoon at 6:10 a.m. when a tractor truck was heading northeast, authorities said.\n\n\n\nA second tractor was spotted at a stop just west of the crash site.\n\nA third truck that crashed on the road near the crash site was southbound on the westbound highway just north of the crash site.\nCrews continued to monitor the scene, officials said.\nThere was no damage to the wreck, as well as some repairs to the bridge and tunnel.\nThe wreck site is located at the intersection of 3rd Avenue and 13th at an intersection of 12th Avenue and 13th and the wreck site is located at the intersection of 4th Avenue and 13th and the crash site is located at the intersection of 4th Avenue and 13th and the crash site is located at the intersection of 4th Avenue and 13th and the crash site is located at the intersection of 4th Avenue and 13th and the crash site is 

#### Pipeline Approaches

In [None]:
# set the model/tokenizer
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

# call generator
generator("Two roads diverged in a yellow",
          max_length=100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Two roads diverged in a yellow spot\n\n\n\n\nThe road, west of Sydney, is being investigated as a violent one, police have confirmed.\n\n\nMural injuries occurred after the incident on March 27.\n\n\nPolice said traffic, including cars and vehicles were also in control.\nAn ambulance arrived and took the crash victim to hospital.\nThe South East Coast Highway on Adelaide Bridge was closed and the road closed down.'}]

In [None]:
# another generator example
generator("Two roads diverged in a yellow",
          max_length=100)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Two roads diverged in a yellow van. He jumped out of the vehicle and into the black van. "He was on the edge of a hill, very big and in pain," said Lt. Jim Toles, an officer with the Division of Public Safety.\n\n\n\n\nHis two-day suspension was lifted Monday for first-time felony trespass, which he says took place by accident.\nPolice in the area reported seeing an elderly man with a long chain of teeth, and he'

##### Fine-Tuning the Prompt

In [None]:
# film review style
print(generator("My review of The Road Not Taken, the Movie:", max_length=100)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


My review of The Road Not Taken, the Movie: I Am Losing Your Mind in a Dark World. By J.C. Green. A great review.
You read so much. There are so many reasons why you like to watch these movies. My opinion of the film is the same — that this movie is the most expensive movie in the world is a success and I think what people do from a good source can be very positive. To the extent that I was a filmmaker, it


In [None]:
# dialogue/interview style
print(generator("Allison: I took the road less traveled by.\nRobert Frost:",
                max_length=100)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Allison: I took the road less traveled by.
Robert Frost: I took off without much time from the road, but when everything else was going to turn out, I thought that I would walk at the pace of a horse, so I followed my foot up the road as fast as possible.
Paul: On the same as, if I hadn't gone on a lot of the journey, I wouldn't have done that if I had not already done so. Paul wrote an account of


In [None]:
# favorite facts
print(generator("My favorite facts about poetry:\n\n1.",
                max_length=100)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


My favorite facts about poetry:

1. Your favourite poet has always been one who has not lived with any form of formal intellectual structure (although in many cases, the idea that he has had an interest in a particular one—that he has never had a literary passion—is really just an outgrowth of a lot of "hippie" writing and poetry for that same reason."
2. The original idea of a literary writer was not a literary philosophy of literary thought and poetry.


##### Fine-Tuning Probabilities

In [None]:
# set manual probability
tokens = ['know', 'knew', 'smell', 'see', 'am']
probs = [0.5, 0.2, 0.15, 0.1, 0.05]

index = torch.multinomial(torch.tensor(probs), 1).item()
print(tokens[index])

know


In [None]:
# modified probability in a for loop
for i in range(10):
    index = torch.multinomial(torch.tensor(probs), 1).item()
    print(tokens[index])

know
knew
know
am
smell
know
knew
know
know
know


In [None]:
# temperature framework
for temperature in [0.1, 0.35, 1.0, 2.0, 50.0]:
    modified = torch.softmax(
        torch.log(torch.tensor(probs)) / temperature, dim=-1)
    print(f"temperature {temperature:0.02f}")
    for tok, prob in zip(tokens, modified):
        print(tok.ljust(6), "→", f"{prob:0.002f}")
    print()

temperature 0.10
know   → 1.00
knew   → 0.00
smell  → 0.00
see    → 0.00
am     → 0.00

temperature 0.35
know   → 0.90
knew   → 0.07
smell  → 0.03
see    → 0.01
am     → 0.00

temperature 1.00
know   → 0.50
knew   → 0.20
smell  → 0.15
see    → 0.10
am     → 0.05

temperature 2.00
know   → 0.34
knew   → 0.21
smell  → 0.19
see    → 0.15
am     → 0.11

temperature 50.00
know   → 0.20
knew   → 0.20
smell  → 0.20
see    → 0.20
am     → 0.20



###### Examples

In [None]:
# first example
generator("Two roads diverged in a yellow",
          temperature=0.1,
          max_length=100)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Two roads diverged in a yellow light, and the police said they were investigating the incident.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

In [None]:
# another example
generator("Two roads diverged in a yellow",
          temperature=4.0,
          max_length=100)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"Two roads diverged in a yellow Toyota and red Nissan Camibu as U of S's was attacked on several highway near Westfield Avenue just in time for routine updates and security cameras that were part time of U+ and the traffic on westward when emergency crews responded Thursday at Waverts Ferry crossing.The reports come into view of just minutes from traffic lanes south-west over a mile from Kline Road south through Kneelington County.Fire workers were at Wervtencourt with three"

##### Top-K Sampling

In [None]:
# first example
generator("Two roads diverged in a yellow",
          top_k=tokenizer.vocab_size,
          max_length=100)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Two roads diverged in a yellow that alternated between disengaging intersection from Mills Ave., Union Street and Atman Road in which pedestrian officers were trying to reach one in both directions on the Southbound and Southbound lanes.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

In [None]:
# another example
generator("Two roads diverged in a yellow",
          top_k=tokenizer.vocab_size,
          temperature=1.2,
          max_length=100)[0]['generated_text']

In [None]:
# third example
generator("Two roads diverged in a yellow",
          top_k=1,
          max_length=100)[0]['generated_text']

##### Messing With Filters

In [None]:
# filteirng out day/night
generator("It was a dark and stormy",
          bad_words_ids=tokenizer([" night", " day"]).input_ids)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'It was a dark and stormy time in the world of gaming and television. The events of the last few years have been incredibly difficult to describe since the start of the development, the success, and the dedication of the people of this country. For'

In [None]:
# filtering out all be verb forms
generator("Once upon a time,",
          bad_words_ids=tokenizer(
              ["be", " be",
               "am", " am",
               "are", " are",
               "is", " is",
               "was", " was",
               "were", " were"]).input_ids,
          max_length=100)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Once upon a time, he would think he would feel good being here.'

In [None]:
# filter out tokens for all words that include "e"
forbidden_ids = []
for key, val in tokenizer.get_vocab().items():
    if 'e' in key:
        forbidden_ids.append([val]) # needs to be a list of lists
print(generator("Last month, I",
          bad_words_ids=forbidden_ids,
          max_length=100)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Last month, I had at a loss to admit that I was wrong.

This post has drawn a lot of scorn.
If you’ll join in, you’ should join in. I will try to show our support towards any of you.
First of all, I can’t say that I don’t know any of you about you, but I am just saying—oh god, I was thinking—that this is just a bit of an out


##### Fine-Tuning the Model

In [None]:
# install datasets
import sys
!{sys.executable} -m pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# import
import datasets

In [None]:
# load frankenstein text
!curl -L -O https://www.gutenberg.org/cache/epub/84/pg84.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  438k  100  438k    0     0  1564k      0 --:--:-- --:--:-- --:--:-- 1564k


In [None]:
# filter for first 20K characters
with open("84-0-20k.txt", "w") as fh:
    fh.write(open("pg84.txt").read()[:20000])

In [None]:
# load dataset
training_data = datasets.load_dataset('text', data_files="84-0-20k.txt")

Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-bc2c885b88915351/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-bc2c885b88915351/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# tokenize
tokenizer.pad_token = tokenizer.eos_token
tokenized_training_data = training_data.map(
    lambda x: tokenizer(x['text']),
    remove_columns=["text"]
)

Map:   0%|          | 0/419 [00:00<?, ? examples/s]

In [None]:
# batch tokens
block_size = 64
# magic from https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result
lm_training_data = tokenized_training_data.map(
    group_texts,
    batched=True,
    batch_size=200
)

Map:   0%|          | 0/419 [00:00<?, ? examples/s]

In [None]:
# training loop
from transformers import Trainer, TrainingArguments
trainer = Trainer(model=model,
                  train_dataset=lm_training_data['train'],
                  args=TrainingArguments(
                      output_dir='distilgpt2-finetune-frankenstein20k',
                      num_train_epochs=1,
                      do_train=True,
                      do_eval=False
                  ),
                  tokenizer=tokenizer)
trainer.train()
trainer.save_model()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 66
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 9
  Number of trainable parameters = 81912576
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to distilgpt2-finetune-frankenstein20k
Configuration saved in distilgpt2-finetune-frankenstein20k/config.json
Configuration saved in distilgpt2-finetune-frankenstein20k/generation_config.json
Model weights saved in distilgpt2-finetune-frankenstein20k/pytorch_model.bin
tokenizer config file saved in distilgpt2-finetune-frankenstein20k/tokenizer_config.json
Special tokens file saved in distilgpt2-finetune-frankenstein20k/special_tokens_map.json


In [None]:
# run new model
generator("Two roads diverged in a yellow", max_length=100)[0]['generated_text']

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.1"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Two roads diverged in a yellow pickup truck, apparently as heavy rain had been making off from the town.\n\n\n\nThe last of the roads was made by the southern bank of Humberland. The road in its rear was not named for its size and was used as a way to bring tourists on a short journey to town. Another way of a route is to meet up. The journey takes two hours, and ends up a daylong journey, and by the time the road was'

In [None]:
# to use the new model in another project
my_tokenizer = AutoTokenizer.from_pretrained('distilgpt2-finetune-frankenstein20k')
my_model = AutoModelForCausalLM.from_pretrained('distilgpt2-finetune-frankenstein20k')
my_generator = pipeline("text-generation", model=my_model, tokenizer=my_tokenizer)

my_generator("Two roads diverged in a yellow")[0]['generated_text']