# Reading and Writing Electronic Text
Spring 2024

Myrah Sarwar

## Final Assignment
#### [Link to documentation blog post](https://myrahsarwar.com/_rwet_final)

***

#### Installing things I need to install!

In [None]:
import sys
!conda install --prefix {sys.prefix} -y -c pytorch pytorch

In [None]:
import sys 
!{sys.executable} -m pip install accelerate -U

In [None]:
import sys
!{sys.executable} -m pip install transformers

In [37]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

In [38]:
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

In [39]:
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

#### Combining all of my text files by category

This will give me the flexibility to work with specific areas only if needed later on. Right now, my notes are separated based on the main folder they were in in my notes application.

**all** of my files

In [5]:
filenames = ["final_txt/itp.txt", "final_txt/misc_tags.txt","final_txt/notes_icloud.txt","final_txt/notes_local.txt","final_txt/school_apps.txt","final_txt/umg.txt","final_txt/usc.txt"]

with open("all-text.txt", "w") as new_file:
    for name in filenames:
        with open(name) as f:
            for line in f:
                new_file.write(line)
            new_file.write("\n")

print("Concatenation complete. Check 'all-text.txt' for the result.")

Concatenation complete. Check 'all-text.txt' for the result.


**personal** files only

In [7]:
filenames = ["final_txt/notes_icloud.txt","final_txt/notes_local.txt"]

with open("personal-text.txt", "w") as new_file:
    for name in filenames:
        with open(name) as f:
            for line in f:
                new_file.write(line)
            new_file.write("\n")

print("Concatenation complete. Check 'personal-text.txt' for the result.")

Concatenation complete. Check 'personal-text.txt' for the result.


**academic + work** files only

In [8]:
filenames = ["final_txt/itp.txt","final_txt/umg.txt","final_txt/usc.txt"]

with open("acad-text.txt", "w") as new_file:
    for name in filenames:
        with open(name) as f:
            for line in f:
                new_file.write(line)
            new_file.write("\n")

print("Concatenation complete. Check 'acad-text.txt' for the result.")

Concatenation complete. Check 'acad-text.txt' for the result.


Realizing after that I probably should have just done this with one big loop instead of three separate ones, but it's okay.

#### Now to train the model on my text:

In [None]:
import sys
!{sys.executable} -m pip install datasets

In [41]:
import datasets

In [42]:
training_data = datasets.load_dataset('text', data_files="all-text.txt")

In [43]:
tokenizer.pad_token = tokenizer.eos_token
tokenized_training_data = training_data.map(
    lambda x: tokenizer(x['text']),
    remove_columns=["text"]
)

In [44]:
block_size = 64

def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result
lm_training_data = tokenized_training_data.map(
    group_texts,
    batched=True,
    batch_size=200
)

In [45]:
from transformers import Trainer, TrainingArguments

In [46]:
trainer = Trainer(model=model,
                  train_dataset=lm_training_data['train'],
                  args=TrainingArguments(
                      output_dir='distilgpt2-finetune-erotic',
                      num_train_epochs=40,
                      do_train=True,
                      do_eval=False,
                  ),
                  tokenizer=tokenizer)

In [None]:
trainer.train()

In [None]:
model.to('cpu')

***

#### Testing outputs for "note titles" in three ways:
* text with a line break
* text in all caps (since I do this often in my notes)
* text with colon after (to make it clear it's a title)

In [None]:
generated_text = generator("today\n", max_length=100, truncation=True)[0]['generated_text']

generated_text_corrected = generated_text.replace("-", "\n-").replace(":", ":\n").replace("*", "\n*")

print(generated_text_corrected)

In [None]:
generated_text = generator("THINGS TO DO", max_length=50, truncation=True)[0]['generated_text']

generated_text_corrected = generated_text.replace("-", "\n-").replace(":", ":\n").replace("*", "\n*")

print(generated_text_corrected)

In [None]:
generated_text = generator("food I have:", max_length=40, truncation=True)[0]['generated_text']

generated_text_corrected = generated_text.replace("-", "\n-").replace(":", ":\n").replace("*", "\n*")

print(generated_text_corrected)

**Some results (@ 20 epochs):**

~~~
my feelings today:

 are these emotions justified? are these unspoken few feelings really justified or do they just make me feel worthless and worthless instead of understanding and being humanlikeI feel strongly about the existence of the universe because we are all created equal, as individuals, but the term has become too misnomer and misconstrued and misconstruedMy comfort level is low because I don’t know why I’m so misinformed or why I think I have such a big interest in things like the things I care aboutI am so afraid to let too much into things like these other times. My life is too filled with me but this is why I feel so angry a lot for the dayIt’s always hard to change. life has to be joyful and feel so empty and feel empty every time so I try to feel like I need to be myself anymore I realize that everything is fine When I feel like I feel sad, I feel like it goes away

----

to do list

- check out list of tasks    
- find tasks through list of tasks, you want to describe the tasks
- tasks    
- you can create new ones
- remember to put the output of each task in a separate file so it is easier to find and find things that you are looking for    
- you can create a list of tasks of any given task (e.g., go up to 2 different managers, right?)

----

list

- [x] remi mac n cheese
- [x] salad tin (vegetables)
- [x] sour cream
- [x] cheese
- [x] paprika
- [x] saltine crackers
~~~


**@ 40 epochs:**

~~~
spotify playlist:
    
- mix of ambient / rock 'n stuff or whatever

---

places:

- alison mcdonalds pizza place(??)
- bovard, the tailor (undiscovered)wilshire vintage (thanksgiving)things to consider things to keep in mind
- vegetarian options good coffee places if

---

STUDY PLAN:

- slow down, take steps
- slowly take a few steps
- slowly get used to confidence again and be mindful
- Don’t take pr too long 
- you will only feel certain things during practice and you will feel happy even just a few moments. Today is one of many times so try to focus on your mundane tasks. Today, you might not find things that are important so try to feel good/peaceful enough so you can focus on your task list

---

list of my accomplishments:

What did I do? What did I enjoy most? Are accomplishments I failed to accomplish?

---

list of my accomplishments:
    
- 25k followers    
- 478 followers

---

list of my accomplishments:
 
- completed homework in May of this year

---

stuff to do

- don’t act too full of pretentious shit

---

food I have

- instant coffee
- instant coffee 
- great value of energy (which I feel california consumes)

~~~

***
Using "I" and "It" only (which I manually chose) as the starting point.

In [None]:
generated_text = generator("I am")[0]['generated_text']

generated_text_corrected = generated_text.replace("-", "\n-").replace(":", ":\n").replace("*", "\n*")

print(generated_text_corrected)

**Example outputs:**

~~~
I’m sorry for you. I was just getting close enough to escaping or something. I wanted to die.Then

----

It’s a little hard to keep track lol   <3

----

It is a company. I worked at supermarkets for more than a million years

----

I am convinced that there is life i have, and that is the way life can be

----

I’ve got a plan for my life that is completely different from what was happening there and I think that’s the direction i was supposed to go in but it turned out that i

---

I am the most important person on this planet. and this is because i feel this way because of all the things that are causing me just as much pain as the pain i feel myself living. I would personally rather live without me now

---

I am the only one who is manipulating the world.  I have the power to change things for the betterment of the your life, so i can’t change either. hope this will be my last class

~~~


***
Looking at the above results altogether makes me think this could be interesting to try to generate multiple lines to make actual poetry (which is what I should have done anyway for the class since that is the point), so let's try it.

4 lines, each starting with the word "I", decreasing the amount of words with each line:

In [None]:
length = 15
for i in range (4):
    generated_text = generator("I", max_length=length, truncation=True)[0]['generated_text']
    print(generated_text + "\n")
    length -= 3

Some outputs:

~~~
I’ve spent time in similar communities as well as places like reddit

I’d like to talk to her again, but

I’d rather stay in the shadows

I will always have a story

---

I’ve tried so hard to keep quiet I can’

I just need more motivation to be authentic as a personI

I - I feel like I alone in life

I
~~~

Again but starting each line with random preposition or pronoun. I'm getting rid of the line length decreasing because it's not contributing much.

In [344]:
prepositions = ["about", "above", "across", "after", "against", "along", "among", "around", "at", "before", 
                "behind", "below", "beneath", "beside", "between", "beyond", "by", "down", "during", "except", 
                "for", "from", "in", "inside", "into", "like", "near", "of", "off", "on", "onto", "out", "outside", 
                "over", "past", "since", "through", "throughout", "to", "toward", "under", "underneath", "until", 
                "up", "upon", "with", "within", "without"]

pronouns = ["I", "you", "he", "she", "it", "we", "they", "me", "him", "her", "us", "them", "myself", "yourself", 
            "himself", "herself", "itself", "ourselves", "yourselves", "themselves", "mine", "yours", "his", "hers", 
            "its", "ours", "theirs", "this", "that", "these", "those", "who", "whom", "whose", "which", "what", "whatever", 
            "whoever", "whichever", "whomever", "my", "your", "our", "their"]

prep_pro = prepositions + pronouns

In [None]:
import random

length = 10
for i in range(4):
    starting_word = random.choice(prepositions)
    generated_text = generator(starting_word, max_length=length)[0]['generated_text']
    generated_text = generated_text.capitalize()
    print(generated_text + "\n")

Most of those outputs didn't make much sense.

***
#### Markov chains!

I'm also trying with markov chains just in case those results somehow end up being better. (They were not.)

In [None]:
import sys
!{sys.executable} -m pip install markovify

import markovify

Once again separating by categories:

In [466]:
all_text = open("all-text.txt").read()
all_text_generator = markovify.Text(all_text, state_size=1)

In [467]:
personal_text = open("personal-text.txt").read()
personal_text_generator = markovify.Text(personal_text, state_size=1)

In [468]:
acad_text = open("acad-text.txt").read()
acad_text_generator = markovify.Text(acad_text, state_size=1)

These are generating sentences at random without me giving any direction. That's probably fine otherwise, but it's not what I want for this. I've also separated based on content to see if the outputs might make more sense.

In [None]:
print(all_text_generator.make_sentence())

In [None]:
print(personal_text_generator.make_short_sentence(100))

In [None]:
print(acad_text_generator.make_sentence())

In [None]:
length = 90

for i in range(3):
    print(all_text_generator.make_short_sentence(length))
    print("")
    length -= 15

This is not really working how I need it to because I want to be able to specify the first words. I think the closest I can get to this here is using the `make_sentence_with_start`, but this only picks sentences in my notes that start with the exact same words (instead of occurring anywhere).

In [None]:
generated_text = all_text_generator.make_sentence_with_start(beginning="it")
print(generated_text)

I thought maybe the language would be more consistent if I chained the sentences, but that doesn't work either because the last word of the previous sentence (usually) is too specific to occur at the start anywhere else. It would always throw me an error, like below. I also don't like that it's giving me full sentences when a good chunk of my notes are in bullet point format.

In [1317]:
sentences = 3
paragraph = ""

paragraph += all_text_generator.make_sentence() + " "

for i in range(sentences - 1):
    sentence = all_text_generator.make_sentence_with_start(beginning=paragraph.split()[-1])
    paragraph += sentence + " "
print(paragraph)

ParamError: `make_sentence_with_start` can't find sentence beginning with cry.