## Categorization of Poetry

In this notebook, we will be splitting the poetry different schemas. We will create a genre dataframe, a text dataframe, and an author dataframe, as well as preserving the original dataframe for reference. 

In [1]:
### importing libraries ###

import pandas as pd

In [2]:
### importing poetry data ###

df = pd.read_csv('data/poem_clean.csv', index_col = 0)

df.head()

Unnamed: 0,title,author,genre,text
0,1-800-FEAR,Jody Gladding,Living SocialCommentaries PopularCulture Popul...,We'd like to talk with you about fear they sai...
1,1 January 1965,Joseph Brodsky,Living Death GrowingOld Time&Brevity Nature Wi...,The Wise Men will unlearn your name. Above you...
2,"10-Year-Old Shot Three Times, but She’s Fine",Patricia Smith,Living Youth SocialCommentaries Crime&Punishme...,"Dumbfounded in hospital whites, you are pictur..."
3,103 Korean Martyrs,Monica Youn,Religion Arts&Sciences Photography&Film Photog...,Where was it that we went that night? That lon...
4,#104 from The Poems of Gaius Valerius Catullus,Brandon Brown,Living LifeChoices TheMind Relationships Frien...,with Dana Ward I have so little want of activi...


Now that we have imported the data, we will look for an potentially drop missing values from the dataset incase our cleaning did not catch it earlier.

In [3]:
### checking the shape ###

df.shape

(11091, 4)

In [4]:
### checking for missing values ###

df.isna().sum()

title     0
author    0
genre     0
text      1
dtype: int64

In [5]:
### dropping missing values ###

df = df.dropna(axis = 0).reset_index(drop = True)

### Genre

First we will create a genre dataframe. We will do the equivalent of one hot encoding it using a count vectorization and then resetting all values greater than 1 to 1. 

In [6]:
### count vectorization ###

# importing library
from sklearn.feature_extraction.text import CountVectorizer

# count vectorization
cv = CountVectorizer()
genre_cv = cv.fit_transform(df['genre']).todense()
genre_cv = pd.DataFrame(genre_cv, columns = cv.get_feature_names(), index = df.index)

genre_cv.head()

Unnamed: 0,10,13,14,17,activities,aging,alliteration,allusion,anaphora,ancestors,...,war,weather,weddings,winter,women,workandplay,working,worry,yomkippur,youth
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,3
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


We have then manually went through the genres and decided that we did not want "stylistic" genres such as allusion, or haiku. Similarly, numbers did not make sense to us such as 10 and 14 so we dropped them.

In [7]:
### dropping genres ###

genre_cv = genre_cv.drop(['10', '13', '14', '17',
                         'alliteration', 'allusion', 'anaphora', 'aphorism',
                         'arspoetica', 'assonance', 'blankverse', 'books',
                         'brevity', 'classics', 'concreteorpatternpoetry', 'couplet',
                         'ekphrasis', 'epic', 'epigram', 'epigraph',
                         'epistle', 'epithalamion', 'ghazal', 'haiku',
                         'imagery', 'imagist', 'language', 'limerick',
                         'linguistics', 'metaphor', 'meters', 'mixed',
                         'modes', 'pantoum', 'ottavarima', 'patrick',
                         'poetry', 'poets', 'prosepoem', 'quatrain',
                         'reading', 'rhymedstanza', 'satire', 'sday',
                         'sequence', 'series', 'simile', 'sonnet',
                         'st', 'stanzaforms', 'syllabic', 'symbolist',
                         'tales', 'techniques', 'terzarima', 'tercet',
                         'verseforms', 'visualpoetry', 'villanelle', 'sestina',
                         'types', 'ballad', 'folklore', 'freeverse'], axis = 1)

Now since there are genres that overlap heavily with eachother, such as grief and grieving, we have arbitrarily created aggregated genres.

In [8]:
### aggregated genres ###

children = [['children_agg'], ['beginningreaders', 'comingofage', 'earlychildhood', 'earlyreaders', 'nurseryrhymes', 
            'teens', 'tweens', 'youth', 'forallages', 'school']]

religion = [['religion_agg'], ['buddhism', 'christianity', 'hanukkah', 'islam', 'kwanzaa', 'judaism', 'yomkippur', 
            'thedivine', 'roshhashanah', 'religion', 'ramadan', 'passover', 'otherreligions', 'god', 'faith']]

love = [['love_agg'], ['anniversary', 'aubade', 'classiclove', 'commitment', 'companionship', 'crushes', 'desire', 
        'firstlove', 'friends', 'engagement', 'marriage', 'vexedlove', 'valentine', 'unrequitedlove', 
        'romanticlove', 'relationships']]

family = [['family_agg'], ['children', 'infancy', 'ancestors', 'aging', 'birth', 'birthdays', 'family', 'father', 'growingold', 
          'love', 'mother', 'parenthood', 'homelife']]

celebration = [['celebration_agg'], ['christmas', 'cincodemayo', 'celebrations', 'easter', 'graduation', 'halloween', 'fun', 
               'laborday', 'weddings', 'thanksgiving', 'newyear', 'independenceday', 'holidays', 'toasts']]

sad = [['sad_agg'], ['elegy', 'anger', 'apologies', 'blame', 'boredom', 'complicated', 'confessional', 'disappointment', 
       'failure', 'farewells', 'funerals', 'frustration', 'heartache', 'grieving', 'grief', 'doubt', 'loss', 'worry', 
       'sorrow', 'memorialday']]

conflict = [['conflict_agg'], ['conflict', 'divorce', 'crime', 'death', 'enemies', 'war', 'september11th', 'socialcommentaries']]

identity = [['identity_agg'], ['gay', 'gender', 'ethnicity', 'lesbian', 'men', 'queer', 'sexuality', 'women', 'race']]

supernatural = [['supernatural_agg'], ['ghosts', 'thesupernatural', 'thespiritual', 'spirituality', 'mythology', 'fairy']]

# list of aggregated genres
concise = [children, religion, love, family, celebration, sad, conflict, identity, supernatural]

Now we aggregate them all and drop the used columns.

In [9]:
### genre aggregation ###

# aggregating
for category in concise:
    genre_cv[category[0][0]] = 0
    for name in category[1]:
        genre_cv[category[0][0]] += genre_cv[name]

# dropping used genres
used_col = []

for category in concise:
    used_col += category[1]

genre_cv = genre_cv.drop(used_col, axis = 1)

# resetting all numbers greater than 1 to 1
genre_cv[genre_cv > 0] = 1

genre_cv.head()

Unnamed: 0,activities,animals,architecture,arts,artsandsciences,beingoneself,break,cities,class,commonmeasure,...,working,children_agg,religion_agg,love_agg,family_agg,celebration_agg,sad_agg,conflict_agg,identity_agg,supernatural_agg
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,1,1,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,1,1,0,0,0,0,0,...,0,0,0,1,0,0,1,1,0,0


Since we needed to aggregate genres first, we were unable to use a min_df in our count vectorization. Now that aggregation is complete, we will set a min_df = 50 manually so that underrepresented genres are not used in our recommender.

In [10]:
### setting minimum occurence of genres to 50 ###

# setting mask
mask = genre_cv.sum() < 50

# applying mask
filt_genre = genre_cv.sum()[mask]

# filtering with min_df
for column in filt_genre.index:
    genre_cv = genre_cv.drop(column, axis = 1)

# giving columns with genres a unique identifier so that if it is joined with other dataframes, they can be told apart
for column in genre_cv.columns:
    genre_cv.rename(columns = {column : 'g_' + column}, inplace = True)

In [11]:
genre_cv.head()

Unnamed: 0,g_activities,g_animals,g_architecture,g_arts,g_artsandsciences,g_beingoneself,g_break,g_cities,g_class,g_commonmeasure,...,g_working,g_children_agg,g_religion_agg,g_love_agg,g_family_agg,g_celebration_agg,g_sad_agg,g_conflict_agg,g_identity_agg,g_supernatural_agg
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,1,1,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,1,1,0,0,0,0,0,...,0,0,0,1,0,0,1,1,0,0


### Text

Now we will apply tf-idf vectorization on the text.

In [12]:
### tf-idf vectorization ###

# importing library
from sklearn.feature_extraction.text import TfidfVectorizer

# tf-idf vectorization
tv = TfidfVectorizer(stop_words = 'english', min_df = 50)
text_tv = tv.fit_transform(df['text'].str.replace('[0-9]*', '')).todense()
text_tv = pd.DataFrame(text_tv, columns = tv.get_feature_names(), index = df.index)

text_tv.head()

Unnamed: 0,abandoned,able,abroad,absence,absent,absolute,abstract,abyss,accent,accept,...,yield,yon,yonder,york,young,younger,youth,youthful,zero,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Author

Now we will one hot encode authors.

In [13]:
### dummy variables ###

# one hot encoding
author_dv = pd.get_dummies(df['author'])

# giving columns with authors a unique identifier so that if it is joined with other dataframes, they can be told apart
for column in author_dv.columns:
    author_dv.rename(columns = {column : 'a_' + column}, inplace = True)

author_dv.head()

Unnamed: 0,a_'Annah Sobelman,a_A. B. Spellman,a_A. E. Housman,a_A. F. Moritz,a_A. Poulin Jr.,a_A. R. Ammons,a_A. Van Jordan,a_Aaron Shurin,a_Aase Berg,a_Aazhidegiizhig,...,a_Zozan Hawez,a_bell hooks,a_bill bissett,a_dg nanouk okpik,a_elena minor,a_erica lewis,a_joanne burns,a_kari edwards,a_t'ai freedom ford,a_Æmilia Lanyer
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Exporting

Now that we have our data, we will export them for model testing.

In [14]:
### exporting data ###

df.to_csv('data/poem.csv')
text_tv.to_csv('data/text.csv')
genre_cv.to_csv('data/genre.csv')
author_dv.to_csv('data/author.csv')