# Final Project - _Writing with Algorithms_ 
### Jenifer Gaitan

This is my final project, using the by Interactive textgenrnn Demo w/ GPU lab by Max Woolf as a basis for the recurrent neural network in my work.


## Intentions

For this project, I'd like to draw upon my background as a someone who has studied History and Women, Gender, and Sexuality studies. Women authors remain underrepresented in literature. Additionally, in many well-known literary texts women are portrayed in the traditional roles of mothers, wives, and daughters. In essence, their relationship with men is central to their identity. With this in mind, I wanted to analyze Jane Eyre closely as my text because it is one of the top 25 most downloaded texts from the Gutenberg corpus. Furthermore, it is written by a female author in the mid 1800s. Charlotte Brontë, like all of her sisters, used a masculine pen name when publishing her book. This illustrates the conditions under which early women novelists wrote and shared their literature worldwide. Having previously studied the book, I know it depicts many relationships between male and female characters. Jane Eyre is known for centering a woman’s experience in a journey of self-discovery that goes beyond seeking fulfillment and an identity tied to a man. 

I am inspired by the work of Alison Parrish in her creation of Our Arrival. In reading her computer-generated poetry, I felt that the absence of people mentioned in the text was unique in that it made me ponder it more closely searching for its tie in between the lines of the poem. It also made me question the relationship between humans and nature more closely, such as how I had understood this relationship in previous texts I had read. I will use natural language processing techniques to extract sentences which refer to women from Jane Eyre: An Autobiography by Charlotte Brontë. Literature is largely up to the interpretation of every reader. I’d like to explore how these interpretations can change when the element of computer analysis and algorithms present literature in a new light. 

As mentioned above, I will use Max Woolf's lab in order to train a model to generate text based on Jane Eyre. I will use this model to generate copious amounts of text about women that I will further analyze. I will use POS tagging and frequency distributions to extract most common information, such as adjectives and verbs, to understand the language that is used in text about women. My hope is that this project will further highlight the impact of early women authors and allow the reader to explore the role of gender in this 19th century novel and in any text they read going forward. I am a firm believer that a creator’s identity and lived experiences affects their writing or other creative endeavors. This is also why I believe that it is important to study how women write about women as it often contrasts how men write about women. Using a recurrent neural network is a new tool for studying literature but also one way to see how even modern technologies are complicit in gender biases. I also hope that projects such as this one can help to create a bridge between the humanities and technology which in my undergraduate experience have had a deep divide. 


The cells below set up the code dependencies for the model.

In [None]:
%tensorflow_version 1.x

In [None]:
!pip install -q textgenrnn
from google.colab import files
from textgenrnn import textgenrnn
from datetime import datetime
import os

from nltk import tokenize, word_tokenize
from nltk import pos_tag
from nltk.book import *
from nltk.corpus import stopwords

Instructions from Woolf:

Set the textgenrnn model configuration here: the default parameters here give good results for most workflows. (see the [demo notebook](https://github.com/minimaxir/textgenrnn/blob/master/docs/textgenrnn-demo.ipynb) for more information about these parameters)

If you are using an input file where documents are line-delimited, make sure to set `line_delimited` to `True`.

In [None]:
model_cfg = {
    'word_level': False,   # set to True if want to train a word-level model (requires more data and smaller max_length)
    'rnn_size': 128,   # number of LSTM cells of each layer (128/256 recommended)
    'rnn_layers': 3,   # number of LSTM layers (>=2 recommended)
    'rnn_bidirectional': False,   # consider text both forwards and backward, can give a training boost
    'max_length': 30,   # number of tokens to consider before predicting the next (20-40 for characters, 5-10 for words recommended)
    'max_words': 10000,   # maximum number of words to model; the rest will be ignored (word-level model only)
}

train_cfg = {
    'line_delimited': False,   # set to True if each text has its own line in the source file
    'num_epochs': 20,   # set higher to train the model for longer
    'gen_epochs': 5,   # generates sample text from model after given number of epochs
    'train_size': 0.8,   # proportion of input data to train on: setting < 1.0 limits model from learning perfectly
    'dropout': 0.0,   # ignore a random proportion of source tokens each epoch, allowing model to generalize better
    'validation': False,   # If train__size < 1.0, test on holdout dataset; will make overall training slower
    'is_csv': False   # set to True if file is a CSV exported from Excel/BigQuery/pandas
}


Upload **any text file** and update the file name in the cell below, then run the cell.

In [None]:
file_name = "Jane_Eyre.txt"
model_name = 'colaboratory'   # change to set file name of resulting trained models/texts

The cells below being the training.

In [None]:
textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file

train_function(
    file_path=file_name,
    new_model=True,
    num_epochs=train_cfg['num_epochs'],
    gen_epochs=train_cfg['gen_epochs'],
    batch_size=1024,
    train_size=train_cfg['train_size'],
    dropout=train_cfg['dropout'],
    validation=train_cfg['validation'],
    is_csv=train_cfg['is_csv'],
    rnn_layers=model_cfg['rnn_layers'],
    rnn_size=model_cfg['rnn_size'],
    rnn_bidirectional=model_cfg['rnn_bidirectional'],
    max_length=model_cfg['max_length'],
    dim_embeddings=100,
    word_level=model_cfg['word_level'])

In [None]:
# this temperature schedule cycles between 1 very unexpected token, 1 unexpected token, 2 expected tokens, repeat.
# changing the temperature schedule can result in wildly different output!
temperature = [1.0, 0.5, 0.2, 0.2]   
prefix = None   # if you want each generated text to start with a given seed text

if train_cfg['line_delimited']:
  n = 1000
  max_gen_length = 60 if model_cfg['word_level'] else 300
else:
  n = 1
  max_gen_length = 2000 if model_cfg['word_level'] else 10000
  
timestring = datetime.now().strftime('%Y%m%d_%H%M%S')
gen_file = '{}_gentext_{}.txt'.format(model_name, timestring)

textgen.generate_to_file(gen_file,
                         temperature=temperature,
                         prefix=prefix,
                         n=n,
                         max_gen_length=max_gen_length)
files.download(gen_file)

The code above downloads a file containing all of the generated text stored in generated_poems.txt.



In [None]:
text_sents = open("generated_poems.txt").readlines()


In [None]:
# Here I am removing the lines breaks, (\n)

text_sents = [sent.lower().replace("\n", "") for sent in text_sents if sent != "\n"]  


Below I'll be turning the list above into a string in order to tokenize then POS tag it.

In [None]:
female_sents_string = ' '.join(text_sents)

In [None]:
words = female_sents_string.split()
# The code below removes stop words (I, me, its) etc. which are not useful when analyzing the text
words = [word for word in words if word not in stopwords.words('english')]

fdist_words = FreqDist(words)
fdist_words.most_common(15)

[('strange', 109),
 ('still', 89),
 ('see', 87),
 ('“i', 77),
 ('stranger', 75),
 ('time', 70),
 ('one', 67),
 ('little', 65),
 ('stairs', 60),
 ('house', 60),
 ('could', 58),
 ('would', 51),
 ('sure', 50),
 ('seemed', 49),
 ('said', 47)]

In [None]:
female_tokenized_string = tokenize.word_tokenize(female_sents_string)

In [None]:
pos_tagged_female = pos_tag(female_tokenized_string) 

In [None]:
import nltk, re, pprint
from nltk import corpus, sent_tokenize, pos_tag

In [None]:
# These are lists of names that have been categorized as female in the gutenberg corpus which will be made lowercase

female_names = corpus.names.words('female.txt')
female_names = [name.lower() for name in female_names]

In [None]:
# This is how I will determine if a sentence is about a woman

woman_identifiers = ['she', 'her', 'hers', 'mrs', 'miss','missus'] + female_names

In [None]:
# This looks for any identifiers being present in the sentences, and prints the sentences that have mention women.

female_sents = []

for sent in text_sents:
    if any(ident in sent.split() for ident in woman_identifiers):
        female_sents.append(sent)

# Below all the words in the sentences about women are being POS tagged, with the first example being printed

female_text = " ".join(female_sents)
female_text_tags = pos_tag(tokenize.word_tokenize(female_text))


[('district', 'NN'), ('and', 'CC'), ('derevative', 'JJ'), ('and', 'CC'), ('standing', 'VBG'), ('that', 'IN'), ('so', 'RB'), ('much', 'JJ'), ('on', 'IN'), ('the', 'DT'), ('stairs', 'NN'), (';', ':'), ('the', 'DT'), ('strange', 'NN'), ('was', 'VBD'), ('still', 'RB'), ('distress', 'JJ'), ('.', '.'), ('i', 'NN'), ('resolved', 'VBD'), (',', ','), ('and', 'CC'), ('this', 'DT'), ('specieless', 'NN'), ('were', 'VBD'), ('all', 'PDT'), ('the', 'DT'), ('sweetest', 'JJS'), ('that', 'IN'), ('she', 'PRP'), ('would', 'MD'), ('not', 'RB'), ('be', 'VB'), ('attentive', 'JJ'), ('to', 'TO'), ('make', 'VB'), ('for', 'IN'), ('me', 'PRP'), (',', ','), ('and', 'CC'), ('i', 'RB'), ('wanted', 'VBD'), ('to', 'TO'), ('inform', 'VB'), ('you', 'PRP'), ('as', 'IN'), ('a', 'DT'), ('more', 'JJR'), ('than', 'IN'), ('suffer', 'NN'), (',', ','), ('together', 'RB'), ('the', 'DT'), ('sound', 'NN'), ('of', 'IN'), ('proposing', 'VBG'), ('where', 'WRB'), ('we', 'PRP'), ('had', 'VBD'), ('now', 'RB'), ('been', 'VBN'), ('discove

In [None]:
# Here I'll be looking for the eight most common tags in female sentences

tag_female = nltk.FreqDist(tag for (word, tag) in female_text_tags)
tag_female.most_common()[:8]

[('NN', 2472),
 ('DT', 1892),
 ('IN', 1581),
 ('VBD', 847),
 ('JJ', 810),
 ('CC', 764),
 ('PRP', 604),
 ('RB', 591)]

In [None]:
# This is a list of the most common comparative adjectives (JJR) and superlative adjectives (JJS) in the female sentences

female_pos_tagged_jj = [token_tag_pair[0] for token_tag_pair in female_text_tags if token_tag_pair[1].startswith("JJR") or token_tag_pair[1].startswith("JJS") ]

fdist_jj = FreqDist(female_pos_tagged_jj)
print(fdist_jj)
fdist_jj.most_common(10)

<FreqDist with 10 samples and 36 outcomes>


[('best', 9),
 ('more', 8),
 ('least', 6),
 ('longer', 4),
 ('sweetest', 2),
 ('better', 2),
 ('most', 2),
 ('shoulder', 1),
 ('strangest', 1),
 ('worst', 1)]

In [None]:
# This is a list of the most common present participle verbs (VBG)

female_pos_tagged_vb = [token_tag_pair[0] for token_tag_pair in female_text_tags if token_tag_pair[1].startswith("VBG")]

fdist_vb = FreqDist(female_pos_tagged_vb)
print(fdist_vb)
fdist_vb.most_common(10)

<FreqDist with 73 samples and 104 outcomes>


[('standing', 5),
 ('going', 5),
 ('looking', 4),
 ('talking', 4),
 ('being', 4),
 ('sitting', 3),
 ('writing', 3),
 ('parting', 3),
 ('covering', 2),
 ('passing', 2)]

In looking at the adjectives and verbs used in the same sentences about women, we can make conclusions about the context in which they are written about. Women are described as as "sweet" but also "worst" and common verbs give insight into the lives of characters of Jane Eyre, travel ("going", "parting") and their activities "looking" "talking" "sitting" "writing"), etc. This model can be replicated to extract many additional types of information from the sentences.