GitHub

An Analysis of Contemporary American Poetry Using K means Clustering & NLTK

####General Assembly Data Science Course Final Project November 2015

Specific Aim

For this project I applied Natural Language Processing methods to break down the corpus of Contemporary American Poems as scraped from the Academy of American Poets website into numerical representations of the text in terms of style.

Then I used an unsupervised learning method, K-means clustering, to see if these poems would naturally cluster into groups that may or may not represent meaningful stylistic differences.

###Design please note that this is a work in progress and I may make a ton of changes

please check out the Final Project powerpoint presentation that I created for the class for a quick overview of process

####Libraries Used

Scrapy
Pymongo
Pandas
Numpy
SciKit Learn
NLTK
Seaborn

####Data Storage

Mongodb

###Steps to Reproduce

1 Extract From Academy of American Poets Website using Scrapy project

Currently the project is set to pipe to Mongo database however, to output to json or other format, please change the configurations in settings.py
Go to Extract/poetryorg directory and run from terminal $ scrapy crawl poetryspider

2 Optional: Store in Mongo Database with two collections: Poets & Poems

3 Go to Transform directory and run Ipython notebooks PoetDataProcessing then PoemDataProcessing (PoemDataProcessing can take up to 3 hours)

4 Make sure you have ExtractHelper.py in the directory as it contains all the helper methods (docstrings available) needed to extract feature variables

5 Go to Analysis directory and run ipython notebook Feature Selection

6 Run ipython notebook Kmeans

7 Run ipython notebook Kmeans_with_scaled

###About the Raw Data

1,467 poems were scraped from twenty 21st century poetic movements and represent the works of 349 poets published between 1900 and 2015
There were many different ways of obtaining this data using Scrapy but I chose to scrape only the works from poets that were tagged to a specific movement
I also scraped the biographical and geographical (if any) information from the poets because why not? Perhaps something could be done with that information if not now, then later
Some poets and their works had additional movement tags

###Feature Variables Defined Below

Average line per stanza - the average number of lines that a stanza has within a poem
Enjambent Score – this is the total number of line breaks within each sentence divided by the total number of sentences in the poem
Type Token Ratio – number of distinct words over the total number of word
Abstraction Score - a score representing the abstractness of the poem. It is calculated by 1) tokenizing each sentence into words, 2) using nltk’s parts of speech tagger to label each word, 3) taking each noun tagged word and looking up the noun synsets available in wordnet (nltk) 4) because each noun has multiple senses, I take the total number of abstract senses and divide it against the total number of concrete senses in order calculate how abstract the noun is 5) all the abstraction scores for all the nouns are averaged.
Lesk Abstraction Score – a score representing the abstractness of the poem in the same way as the abstraction_score is calculated with the exception being that I am using nltk’s lesk algorithm to calculate the best sense of the noun in its sentence then using that sense to look at how abstract that is, and the score is for all the nouns averaged
Title Lesk Abstraction Score – same as above except for just the title
Pronoun Score - this is the distinct number of pronoun occurrences referring to the subject or reader divided by the total number of sentences in the poem
Conjunction Score - this is the number of conjunctions and subordinate conjunctions found in the poem divided by the total number of sentences
Noun, Verb, and Adjective Phrase Ratio - this is ratio of the noun, verb, or adjective phrases over the total number of noun, verb, and adjective phrases found in the text
Average Noun, Verb, and Adjective Phrase Complexity Score - I define a phrase as complex when it contains more than one word. This is divided by the total number of phrases within each poem.
1 word, 2 word, and 3 word noun phrase frequency ratio - these variables stand for the frequency of 1, 2, or 3 word noun phrases within each poem

###Ousted Variables:

wordcount - the total number of words that a poem has
wordcount_d - the distinct number of words that a poem has
numlines - the number of lines that a poem text has
numstanzas - the number of stanzas that a poem text has
yrpub - the year the poem was published, two poems were missing the yr published so I just got rid of them
num_sentences - the number of sentences that a poem has

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
Analysis		Analysis
Data		Data
Extract		Extract
Resources		Resources
Transform		Transform
.gitignore		.gitignore
FinalProjectPresentation.pptx		FinalProjectPresentation.pptx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Analysis of Contemporary American Poetry Using K means Clustering & NLTK

Specific Aim

About

Uh oh!

Releases

Packages

Languages

huiruru/FinalProject

Folders and files

Latest commit

History

Repository files navigation

An Analysis of Contemporary American Poetry Using K means Clustering & NLTK

Specific Aim

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages