Skip to content

mfogelson/11-785_project

Repository files navigation

11-785_project

Authors:

Date: 11-30-2020

Description:

Goals:

  • The goal of this project was to create a novel poetry generation Deep Learning Model.

Constraints:

  • We decided to constrain the problem to the poetry form of Limericks.
  • Limericks are rhyming poems of the form: AABBA

Model Architecture:

Hardware:

  • We trained the model on an NVIDIA Tesla V100

Data Base:

Preprocessing:

  • We removed all punctuation
  • We converted all numbers to text
  • We removed all poems that did not conform to the structure above
  • We added <|endoftext|> token to end of each poem

Training Time:

  • The model was trained on 24 GPU hrs
  • The final loss was ~0.90

Evaluation Metrics:

  • We implemented a Rhyming evaluation
  • We implemented a Coreference evaluation
  • We implemented a Nonsense word evaluation
  • We also set up a website HERE where we had human's evaluate poems generated from our model vs. from the training dataset
    • This is the best way we can evaluate the success of our system

Downsampling:

  • From 8000 unconditionally generated poems 1000 were scored well enough to pass the 3 metrics described above and used for user testing

Deliverables

Key Resources

Data

Models

  • 117M-Clean (Gwern Model): https://mega.nz/#!2PhghaZD!_IJPpErXIRIDwRI0ktq2UKUZClDEoY7z8UpF28_qme8

  • 117M-Clean-Lym Note:Model is too large to store on github contact Mitch to share

    • Train time: 21hrs
    • Loss: 0.09
  • 117M-AA: Note:Model is too large to store on github contact Mitch to share

    • Train time: 40hrs
    • Loss: 0.11
  • 117M-AABB: Note:Model is too large to store on github contact Mitch to share

    • Train time: 40hrs
    • Loss: 0.1
  • 117M-limerick Note:Model is too large to store on github contact Mitch to share

    • Train time: 40hrs
    • Loss: 0.26

Samples (Experiment1 Model)

1:

caboyola's a genus of weeds

that grows near the shore and seeds seeds

or these shrubs found beside

are quite furry each side

<|endoftext|>

2:

a person who's often so rude

takes a tack of a beach that's subdued

in a business the lad

is more childish than bad

<|endoftext|>

3:

an episcopal practice i'm told

is quite certain to fight for our gold

to get gold from the king

to be saved from the thing

<|endoftext|>

4:

this is all about grandma who's proud

of her years in society's crowd

she has got a big raise

in those fungal-type ways

<|endoftext|>

Experiment 1

Create corpus of limericks:

[A |$| A]

[B |$| B]

[END]

Limericks Definition

  • 5 Line Rhyming Poem
  • Rhyming Structure: A A B B A

Raw Data Example:

cap'n jack was washed over the side.

his crew searched but found not hair nor hide.

no longer the helm,

but the deep benthic realm,

is where jack will forever reside.

Processed Data Output:

["cap'n jack was washed over the side|$|his crew searched but found not hair nor hide"]

['no longer the helm|$|but the deep benthic realm']

['<|endoftext|>']

Files Included

  • preprocesser.ipynb -> Jupyter Notebook for preprocessing raw data

  • rhyming_evaluation.ipynb -> Jupyter Notebook to evaluate output samples Rhyming Success

Setting up

Dowloading data files required to run the app/notebook experiments

Step 1: Create .env file with required variables. See the .env.sample template for pointers AWS credentials will be on our slack. Will add a script for public download later on. Step 2: Run the setup script

bash setup.sh

How to move forward?

  1. Better preprocessing data now that we know how the GPT2 matches structure of input - Xinkai

  2. Finding ways to evaluate outputs quantitatively

    • Rhyming - Chris
    • Non-sense words - Mitch
    • Pronoun reference - Tony
    • Action reference
  3. Change GPT2

    • Loss Function
    • NOTE: Cannot be done until non-human quantitative evaluation methods are made

PS: Setting up HTTP Access on EC2 instances:

Update:

Alright, I just figured out that some steps were not needed at all. Should be really simple.

Some details:

  1. Launch instances

    • Amazon Linux 2 AMI (HVM), SSD Volume Type - ami-03657b56516ab7912 (64-bit x86) / ami-023b120e01f4779c1 (64-bit Arm)

    (The first one in the free tier group)

    (Note that the username is ec2-user instead of ubuntu)

    (Not recommended because many libraries (including pip, flask) need manual installation)

    (Install pip:)

    $ curl -O https://bootstrap.pypa.io/get-pip.py
    $ python get-pip.py --user
    

    (Probably need to install python3 later for our project???)

  2. On the Configure Security Group page

    • Add Rule "Custom TCP Rule", where the Port Range must cover the port number used by our web app
    • The source IP can be set to "0.0.0.0/0, ::/0" just for now
  3. In app.py

    • The listening IP should be set to brodcast IP address, i.e. "0.0.0.0"

About

11-785 Group Project: YouShen Poetry generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages