# Poems for Children and Adults Assignment

## Part 1: Word Length

In this first section of this research project, I will be looking at the differences in the length of words, and their proportional use, in both Adult's (Barbauld) and Children's (Smith) poems. 

I hypothesize that adult poems will show a proportionaly higher use of long words than in children's poems. I do not, however, think that the difference will be very large. I think that both adult poems and children's poems will use large words, but that the types of long words they use will be different and that this difference in type will lead to differences in their proportional usage.

This would hold for the hypothesis that children's poems tend to make up words which could lead to longer lengths. I think that this can be true, but that this will not necessarily lead to a higher proportion of long length words. I think that made up words in children's poems would occur somewhat infrequently, naming a new thing or sound that doesn't already exist in the world. I believe this would form a condition on the amount of made up words that can appear in a children's poem, so even if these words are unusually long I believe they will be infrequent. I also think that children's poems will have other longer length words, but that their use is restricted by the limited vocabulary that children have compared to adults.

This is where I think adult poems will gain their higher proportion of length, in that they can use more complex words that capture more complex types of description. We would probably see more adjectives, adverbs, words portrying immaterial or mental states, etc. Children's poems on the other hand are more restricted to nouns an verbs with simple adjectives, etc. 

I will attempt to bear out these hypotheses by conducting some compuational analyses on these texts, returning the average word length of each text, as well as proportions of words longer than four (4) and ten (10) letters respectively. By comparing these figures, and conducting a brief qualitative analysis of the words that are longer than four and ten letters, I hope to see whether my hypothesis bears fruit.

### Barbauld's Adult Poems

#### Opening the Data

Here, I first open the data, check what type of data this file is, and print the first 100 characters of the file to see what state the data is in.

In [81]:
barbauld_string = open('../data/barbauld_poems.txt', encoding = 'utf-8').read()
print(type(barbauld_string))
barbauld_string[:100]

<class 'str'>


"\n\n\n\n\nCorsica.\n\n\n\n\n―― A manly race\nOf unſubmitting ſpirit, wiſe and brave;\nWho ſtill thro' bleeding a"

#### Cleaning and Formatting the Data for Analysis

So we know from the output provided above that we have string type data, and we can see it is strewn with lots of punctuation and formatting. 

Let's start cleaning up our data, so we can get it into a useable format for counting the lengths of words. 

I am going to start by producing a punctuation list (punct_list). We are going to use this list to remove all of the punctuation in the string data so we are only left with words, the unit we are attempting to analyze and count. 

When looking at the poems in a text editor, I realized that numerical digits are used to seperate stanzas or sections of the text from one another. While this could be useful if we were analyzing differences between stanzas, we are only looking at word length and pronouns at this point. Thus the numerical digits are not really a part of the text in the way we are looking at analyzing it.

To remove these formatting digits and only get words for our analysis, I created a list of numerical digits to exclude from the list (digit_list).

I then create a new list of characters from this string data that only keeps items within the Barbauld string that do not appear in the punctuation string, and then keeps only the characters that do not appear on the digit list.

In [82]:
punct_list = ['―','!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
digit_list = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']

char_list = ([char for char in barbauld_string if char not in punct_list])
char_list = ([char for char in char_list if char not in digit_list])
char_list

['\n',
 '\n',
 '\n',
 '\n',
 '\n',
 'C',
 'o',
 'r',
 's',
 'i',
 'c',
 'a',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 ' ',
 'A',
 ' ',
 'm',
 'a',
 'n',
 'l',
 'y',
 ' ',
 'r',
 'a',
 'c',
 'e',
 '\n',
 'O',
 'f',
 ' ',
 'u',
 'n',
 'ſ',
 'u',
 'b',
 'm',
 'i',
 't',
 't',
 'i',
 'n',
 'g',
 ' ',
 'ſ',
 'p',
 'i',
 'r',
 'i',
 't',
 ' ',
 'w',
 'i',
 'ſ',
 'e',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 'b',
 'r',
 'a',
 'v',
 'e',
 '\n',
 'W',
 'h',
 'o',
 ' ',
 'ſ',
 't',
 'i',
 'l',
 'l',
 ' ',
 't',
 'h',
 'r',
 'o',
 ' ',
 'b',
 'l',
 'e',
 'e',
 'd',
 'i',
 'n',
 'g',
 ' ',
 'a',
 'g',
 'e',
 's',
 ' ',
 'ſ',
 't',
 'r',
 'u',
 'g',
 'g',
 'l',
 'e',
 'd',
 ' ',
 'h',
 'a',
 'r',
 'd',
 '\n',
 'T',
 'o',
 ' ',
 'h',
 'o',
 'l',
 'd',
 ' ',
 'a',
 ' ',
 'g',
 'e',
 'n',
 'e',
 'r',
 'o',
 'u',
 's',
 ' ',
 'u',
 'n',
 'd',
 'i',
 'm',
 'i',
 'n',
 'i',
 'ſ',
 'h',
 'd',
 ' ',
 'ſ',
 't',
 'a',
 't',
 'e',
 '\n',
 'T',
 'o',
 'o',
 ' ',
 'm',
 'u',
 'c',
 'h',
 ' ',
 'i',
 'n',
 ' ',
 'v',
 'a',
 '

I used the join function to join the elements of the list of characters back into a string, this time without punctuation or numerical digits.

In [84]:
barbauld_string_clean = ''.join(char_list)
barbauld_string_clean

'\n\n\n\n\nCorsica\n\n\n\n\n A manly race\nOf unſubmitting ſpirit wiſe and brave\nWho ſtill thro bleeding ages ſtruggled hard\nTo hold a generous undiminiſhd ſtate\nToo much in vain\n\n\nThomson\n\n\n\nHail generous Corsica unconquerd iſle\nThe fort of freedom that amidſt the waves\nStands like a rock of adamant and dares\nThe wildeſt fury of the beating ſtorm\n\nB\nAnd\n\n\n\n\nAnd are there yet in this late ſickly age\nUnkindly to the towring growths of virtue\nSuch bold exalted ſpirits Men whoſe deeds\nTo the bright annals of old Greece opposd\nWould throw in ſhades her yet unrivald name\nAnd dim the luſtre of her faireſt page\nAnd glows the flame of Liberty ſo ſtrong\nIn this lone ſpeck of earth this ſpot obſcure\nShaggy with woods and cruſted oer with rock\nBy ſlaves ſurrounded and by ſlaves oppreſsd\nWhat then ſhould Britons feel ſhould they not catch\nThe warm contagion of heroic ardour\nAnd kindle at a fire ſo like their own\n\n\nSuch were the working thoughts which ſwelld the 

#### Counting and Averaging the Length of Words in the Text

With this done, now it is time to split the new string into a list of words. It's only once we have the data in the format of a list of words that we can begin to count the lengths of these words.

With this list in hand, I wrote a command that will total the sum of the length of each word in the list as well as the total number of words in the list/text. With this information, we can get the average word length in the text as a whole.  

In [122]:
#split string into a list
barbauld_list = barbauld_string_clean.split()

total = 0
count = 0
for word in barbauld_list:
    total = total + len(word)
    count = count + 1
    
average = total/count
  
print("Number of Words in the Text: ", count)
print("Average Word Length in th Text:", average)
print(average1)

Number of Words in the Text:  14057
Average Word Length in th Text: 4.519954471082023
4.519954471082023


I chose then make different sublists of the complete word list we created above.

I made a list of all the words in the text that were longer than 4 letters long. I then divided the number of words on this list by the number of the words in the whole text. I then provide the list of words four letters or longer for qualitative analysis down the line.

In [90]:
#create a new list, keeping only elements that have a character length equal to 4
barbauld_list_fourlw = [e for e in barbauld_list if len(e)>4]

#divide the length of the four-letter word lists by the full novel list
print("Proportion of Characters in the Text longer than Four-letters: ", len(barbauld_list_fourlw) / len(barbauld_list))

Proportion of Characters in the Text longer than Four-letters:  0.4427687273244647


In [87]:
#The Four Letter list for Qualitative Analysis
barbauld_list_fourlw

['Corsica',
 'manly',
 'unſubmitting',
 'ſpirit',
 'brave',
 'ſtill',
 'bleeding',
 'ſtruggled',
 'generous',
 'undiminiſhd',
 'ſtate',
 'Thomson',
 'generous',
 'Corsica',
 'unconquerd',
 'freedom',
 'amidſt',
 'waves',
 'Stands',
 'adamant',
 'dares',
 'wildeſt',
 'beating',
 'ſtorm',
 'there',
 'ſickly',
 'Unkindly',
 'towring',
 'growths',
 'virtue',
 'exalted',
 'ſpirits',
 'whoſe',
 'deeds',
 'bright',
 'annals',
 'Greece',
 'opposd',
 'Would',
 'throw',
 'ſhades',
 'unrivald',
 'luſtre',
 'faireſt',
 'glows',
 'flame',
 'Liberty',
 'ſtrong',
 'ſpeck',
 'earth',
 'obſcure',
 'Shaggy',
 'woods',
 'cruſted',
 'ſlaves',
 'ſurrounded',
 'ſlaves',
 'oppreſsd',
 'ſhould',
 'Britons',
 'ſhould',
 'catch',
 'contagion',
 'heroic',
 'ardour',
 'kindle',
 'their',
 'working',
 'thoughts',
 'which',
 'ſwelld',
 'breaſt',
 'generous',
 'Boswell',
 'nobler',
 'views',
 'beyond',
 'narrow',
 'beaten',
 'track',
 'trivial',
 'fancy',
 'turnd',
 'courſe',
 'poliſhd',
 'Gallias',
 'delicious',
 '

I then made a list of all the words in the text that were longer than 10 letters long. I then divided the number of words on this list by the number of the words in the whole text. I then provide the list of words longer than ten letters long for qualitative analysis later.

In [88]:
#create a new list, keeping only elements that have a character length equal to 10
barbauld_list_tenlw = [e for e in barbauld_list if len(e)>10]

#divide the length of the ten-letter word lists by the full novel list
print("Proportion of Characters in the Text longer than Ten-letters: ", len(barbauld_list_tenlw) / len(barbauld_list))

Proportion of Characters in the Text longer than Ten-letters:  0.003628085651276944


In [89]:
#The Ten letter list for qualitative analysis
barbauld_list_tenlw

['unſubmitting',
 'undiminiſhd',
 'forunefortune',
 'preſumptuous',
 'friendſhips',
 'approaching',
 'transforming',
 'enthuſiastic',
 'inſtinctive',
 'perſpective',
 'nouriſhment',
 'ſubſtantial',
 'Preſbyterians',
 'inſpiration',
 'Backwardneſs',
 'increpitans',
 'zephyroſque',
 'thinchanted',
 'approaching',
 'unrelenting',
 'philoſophic',
 'deſtruction',
 'premeditates',
 'congregated',
 'ſubſequiturque',
 'SongWriting',
 'unſubmitting',
 'unconquerable',
 'Sachariſſas',
 'diſpleaſing',
 'unrelenting',
 'diſtinguiſhd',
 'hardlyhardy',
 'cotemporary',
 'inhoſpitable',
 'inhoſpitable',
 'Remorſeleſs',
 'unharmonious',
 'friendſhips',
 'acclamation',
 'EasterSunday',
 'chariotwheels',
 'ſupplicating',
 'terreſtrial',
 'everlaſting',
 'heart—Beware',
 'Contemplation',
 'hieroglyphics',
 'ſelfcollected',
 'recollected',
 'terreſtrial']

### Smith's Childrens Poems

Now it's time to run all the same operations we conducted on the Barbauld text above on Smith's Children's Poems.

#### Opening the Data

Here, I first open the data, check what type of data this file is, and print the first 100 characters of the file to see what state the data is in.

In [91]:
smith_string = open('../data/smith_conversations.txt', encoding = 'utf-8').read()
print(type(smith_string))
smith_string[:100]

<class 'str'>


'\n\n\n\nConversation the First.\n\nPoems.\n\n\nTo a Green-chafer, on a white Rose.\n\n\nTo a Lady-bird.\n\n\nThe Sn'

#### Cleaning and Formatting the Data for Analysis

So we know from the output provided above that we have string type data, and we can see it is strewn with lots of punctuation and formatting. 

Let's start cleaning up our data, so we can get it into a useable format for counting the lengths of words. 

I am going to start by producing a punctuation list (punct_list). We are going to use this list to remove all of the punctuation in the string data so we are only left with words, the unit we are attempting to analyze and count. 

When looking at the poems in a text editor, I realized that numerical digits are used to seperate stanzas or sections of the text from one another. While this could be useful if we were analyzing differences between stanzas, we are only looking at word length and pronouns at this point. Thus the numerical digits are not really a part of the text in the way we are looking at analyzing it.

To remove these formatting digits and only get words for our analysis, I created a list of numerical digits to exclude from the list (digit_list).

I then create a new list of characters from this string data that only keeps items within the Barbauld string that do not appear in the punctuation string, and then keeps only the characters that do not appear on the digit list.

In [92]:
punct_list = ['―','!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
digit_list = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']

char_list = ([char for char in smith_string if char not in punct_list])
char_list = ([char for char in char_list if char not in digit_list])
char_list

['\n',
 '\n',
 '\n',
 '\n',
 'C',
 'o',
 'n',
 'v',
 'e',
 'r',
 's',
 'a',
 't',
 'i',
 'o',
 'n',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'F',
 'i',
 'r',
 's',
 't',
 '\n',
 '\n',
 'P',
 'o',
 'e',
 'm',
 's',
 '\n',
 '\n',
 '\n',
 'T',
 'o',
 ' ',
 'a',
 ' ',
 'G',
 'r',
 'e',
 'e',
 'n',
 'c',
 'h',
 'a',
 'f',
 'e',
 'r',
 ' ',
 'o',
 'n',
 ' ',
 'a',
 ' ',
 'w',
 'h',
 'i',
 't',
 'e',
 ' ',
 'R',
 'o',
 's',
 'e',
 '\n',
 '\n',
 '\n',
 'T',
 'o',
 ' ',
 'a',
 ' ',
 'L',
 'a',
 'd',
 'y',
 'b',
 'i',
 'r',
 'd',
 '\n',
 '\n',
 '\n',
 'T',
 'h',
 'e',
 ' ',
 'S',
 'n',
 'a',
 'i',
 'l',
 '\n',
 '\n',
 '\n',
 'A',
 ' ',
 'W',
 'a',
 'l',
 'k',
 ' ',
 'b',
 'y',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'W',
 'a',
 't',
 'e',
 'r',
 '\n',
 '\n',
 '\n',
 'I',
 'n',
 'v',
 'i',
 't',
 'a',
 't',
 'i',
 'o',
 'n',
 ' ',
 't',
 'o',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'B',
 'e',
 'e',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 'C',
 'o',
 'n',
 'v',
 'e',
 'r',
 's',
 'a',
 't',
 'i',
 'o'

I used the join function to join the elements of the list of characters back into a string, this time without punctuation or numerical digits.

In [111]:
smith_string_clean = ''.join(char_list)
#smith_string_clean

#### Counting and Averaging the Length of Words in the Text

With this done, now it is time to split the new string into a list of words. It's only once we have the data in the format of a list of words that we can begin to count the lengths of these words.

With this list in hand, I wrote a command that will total the sum of the length of each word in the list as well as the total number of words in the list/text. With this information, we can get the average word length in the text as a whole.

In [105]:
#split string into a list
smith_list = smith_string_clean.split()

total = 0
count = 0
for word in smith_list:
    total = total + len(word)
    count = count + 1
    
average = total/count
    
print("Number of Words in the Text: ", count)
print("Average Word Length in th Text:", average)

Number of Words in the Text:  28814
Average Word Length in th Text: 4.341639480807941


I chose then make different sublists of the complete word list we created above.

I made a list of all the words in the text that were longer than 4 letters long. I then divided the number of words on this list by the number of the words in the whole text. I then provide the list of words four letters or longer for qualitative analysis down the line.

In [106]:
#create a new list, keeping only elements that have a character length greater than 4
smith_list_fourlw = [e for e in smith_list if len(e)>4]

#divide the length of the greater than four-letter word lists by the full novel list
print("Proportion of Characters in the Text longer than Four-letters: ", len(smith_list_fourlw) / len(smith_list))

Proportion of Characters in the Text longer than Four-letters:  0.38401471506906365


In [107]:
#The Four Letter list for Qualitative Analysis
smith_list_fourlw

['Conversation',
 'First',
 'Poems',
 'Greenchafer',
 'white',
 'Ladybird',
 'Snail',
 'Water',
 'Invitation',
 'Conversation',
 'First',
 'George—Emily',
 'little',
 'Garden',
 'called',
 'their',
 'George',
 'Emily',
 'beauti\xad',
 'shining',
 'insect',
 'which',
 'almost',
 'itself',
 'white',
 'favourite',
 'shaped',
 'those',
 'brownish',
 'chafers',
 'which',
 'desired',
 'gardeners',
 'children',
 'yesterday',
 'because',
 'thought',
 'going',
 'torment',
 'pret\xad',
 'little',
 'tassels',
 'horns',
 'wings',
 'shine',
 'peacocks',
 'feathers',
 'Emily',
 'pretty—but',
 'indeed',
 'George',
 'afraid',
 'should',
 'Mamma',
 'cruel',
 'deprive',
 'insect',
 'liberty',
 'perhaps',
 'would',
 'fined',
 'George',
 'Mamma',
 'could',
 'would',
 'whether',
 'without',
 'hurting',
 'might',
 'little',
 'paper',
 'which',
 'could',
 'strong',
 'paper',
 'holes',
 'could',
 'carry',
 'gently',
 'which',
 'crept',
 'snugly',
 'gather',
 'finest',
 'flower',
 'blown',
 'Emily',
 'suppose'

I then made a list of all the words in the text that were longer than 10 letters long. I then divided the number of words on this list by the number of the words in the whole text. I then provide the list of words four letters or longer for qualitative analysis down the line.

In [108]:
#create a new list, keeping only elements that have a character length greater than 10
smith_list_tenlw = [e for e in smith_list if len(e)>10]

#divide the length of the greater than ten-letter word lists by the full novel list
print("Proportion of Characters in the Text longer than Four-letters: ", len(smith_list_tenlw) / len(smith_list))

Proportion of Characters in the Text longer than Four-letters:  0.015547997501214687


In [109]:
#The Ten Letter list for Qualitative Analysis
smith_list_tenlw

['Conversation',
 'Greenchafer',
 'Conversation',
 'George—Emily',
 'greenchafer',
 'satisfaction',
 'description',
 'neighbouring',
 'disagreeable',
 'unintelligible',
 'nightingale',
 'nightingale',
 'GreenChafer',
 'ingratitude',
 'distinguished',
 'troublesome',
 'extraordinary',
 'misplacdWhen',
 'Ladybird—fly',
 'selfcollecting',
 'Displeasure',
 'independant',
 'Switzerland',
 'consumptions',
 'mischievous',
 'conversation',
 'acquaintance',
 'Talbot—Emily',
 'acquaintance',
 'Scamperville',
 'fashionably',
 'WaterEmilyLet',
 'willowsDart',
 'waterfliesMidst',
 'waterlillies',
 'glidingShun',
 'hookWanderers',
 'Scamperville',
 'Scamperville',
 'inhabitants',
 'dwelling—the',
 'defenceless',
 'manmilliner',
 'Scamperville',
 'shopkeeping',
 'Scamperville',
 'contrivances',
 'Scamperville',
 'frightening',
 'Scamperville',
 'Scamperville',
 'Scamperville',
 'speculations',
 'Translation',
 'interesting',
 'speculations',
 'congratulate',
 'consequence',
 'illustrious',
 'consolat

### Analysis

#### I. Computational Analysis

##### Barbauld's Adult Poems
    
    Number of Words in the Text:  14057
    Average Word Length in th Text: 4.519954471082023
    Proportion of Characters in the Text longer than Four-letters:  0.4427687273244647
    Proportion of Characters in the Text longer than Ten-letters:  0.003628085651276944
    
##### Smith's Children's Poems
    Number of Words in the Text:  28814
    Average Word Length in th Text: 4.341639480807941
    Proportion of Characters in the Text longer than Four-letters:  0.38401471506906365
    Proportion of Characters in the Text longer than Four-letters:  0.015547997501214687
    
    
It would seem that Adult Poems have longer length words when considering average word length, and the proportion of words that are longer than four letters, by .17 and 6% respectively. 

However, when looking at the proportion of each text that is made up of words over ten letters long, Children's poems have higher proportion of long words with a difference of 1.2%.

Overall, looking purely at the computational outcomes of the programs executed above, it would seem that my hypothesis that Adult Poems would contain a higher proportion of long words, but by only a slightly larger margin than Children's poems panned out.

##### II. Qualitative Analysis

The question remains as to why this is, and as to whether my hypothesis is true that this was because adults have a larger vocabulary and therefore a wider range of words that can perform more varied functions than children. For this, we need to take a look at the lists created above to conduct some qualitative analysis.

###### Barbauld's Four Letter Word List

In [87]:
#The Four Letter list for Qualitative Analysis
barbauld_list_fourlw

['Corsica',
 'manly',
 'unſubmitting',
 'ſpirit',
 'brave',
 'ſtill',
 'bleeding',
 'ſtruggled',
 'generous',
 'undiminiſhd',
 'ſtate',
 'Thomson',
 'generous',
 'Corsica',
 'unconquerd',
 'freedom',
 'amidſt',
 'waves',
 'Stands',
 'adamant',
 'dares',
 'wildeſt',
 'beating',
 'ſtorm',
 'there',
 'ſickly',
 'Unkindly',
 'towring',
 'growths',
 'virtue',
 'exalted',
 'ſpirits',
 'whoſe',
 'deeds',
 'bright',
 'annals',
 'Greece',
 'opposd',
 'Would',
 'throw',
 'ſhades',
 'unrivald',
 'luſtre',
 'faireſt',
 'glows',
 'flame',
 'Liberty',
 'ſtrong',
 'ſpeck',
 'earth',
 'obſcure',
 'Shaggy',
 'woods',
 'cruſted',
 'ſlaves',
 'ſurrounded',
 'ſlaves',
 'oppreſsd',
 'ſhould',
 'Britons',
 'ſhould',
 'catch',
 'contagion',
 'heroic',
 'ardour',
 'kindle',
 'their',
 'working',
 'thoughts',
 'which',
 'ſwelld',
 'breaſt',
 'generous',
 'Boswell',
 'nobler',
 'views',
 'beyond',
 'narrow',
 'beaten',
 'track',
 'trivial',
 'fancy',
 'turnd',
 'courſe',
 'poliſhd',
 'Gallias',
 'delicious',
 '

It seems like Barbauld's Adult Poems have a wide variety of forms and subjects, with many nouns and verbs, but also adjectives, adverbs, prepositions, etc. The words themselves are quite complex and particular, going up many levels of description and complexity which would require more complicated sentence structures to use, and the words themselves being more unique to capture a fine grained aspect of distinction. When description becomes more particular it seems to require longer words.

###### Smith's Four Letter Word List

In [107]:
#The Four Letter list for Qualitative Analysis
smith_list_fourlw

['Conversation',
 'First',
 'Poems',
 'Greenchafer',
 'white',
 'Ladybird',
 'Snail',
 'Water',
 'Invitation',
 'Conversation',
 'First',
 'George—Emily',
 'little',
 'Garden',
 'called',
 'their',
 'George',
 'Emily',
 'beauti\xad',
 'shining',
 'insect',
 'which',
 'almost',
 'itself',
 'white',
 'favourite',
 'shaped',
 'those',
 'brownish',
 'chafers',
 'which',
 'desired',
 'gardeners',
 'children',
 'yesterday',
 'because',
 'thought',
 'going',
 'torment',
 'pret\xad',
 'little',
 'tassels',
 'horns',
 'wings',
 'shine',
 'peacocks',
 'feathers',
 'Emily',
 'pretty—but',
 'indeed',
 'George',
 'afraid',
 'should',
 'Mamma',
 'cruel',
 'deprive',
 'insect',
 'liberty',
 'perhaps',
 'would',
 'fined',
 'George',
 'Mamma',
 'could',
 'would',
 'whether',
 'without',
 'hurting',
 'might',
 'little',
 'paper',
 'which',
 'could',
 'strong',
 'paper',
 'holes',
 'could',
 'carry',
 'gently',
 'which',
 'crept',
 'snugly',
 'gather',
 'finest',
 'flower',
 'blown',
 'Emily',
 'suppose'

In this list of words over four letters long in Smith's Childrens Poems, it seems that the variety of types of words, as well as the depth of description obtained by them is not as large as with the adult poems. Here we see words that are names, simple nouns, basic activities, with adjectives like "beautiful", "shaped" and "brownish" rounding out the type of description that seems possible in this text, single modifying adjectives placed before a noun.

This would seem to hold with my hypothesis that the difference in complexity and variety of vocabulary for each group does shape what types of words we are see, and as a result the word length that is being produced. 

In order to really test this hypothesis, however, I would need to transform the data I have into type lists, that recode each word according to its type (noun, adjective, adverb, etc.). I will not attempt that here but this study further opened up the door for that type of analysis.

## Part 2: Pronouns

Now I am moving on to the second part of this research project, examining the use of pronouns and their different usage within Childrens and Adult Poetry. 

I hypothesize that Adult Poetry will have more possesive pronouns on the assumption that adult poetry will capture more of the subjective nature of experience in narrative, as well as understanding of intersubjectivity, than will the children's poetry. I think that children's poetry will have simpler structure will stay more at the level of objective desription, describing events as opposed to revealing internalization and belonging, having not yet been encultutrated to understand one's place in a social grouping, or relations of possesion between indidivuals, the type revealed through possesive pronouns. I do think, that as a result of its more objective descriptive level of operation, children's poetry will have more personal pronouns. 

### Barbauld's Adult Poems

To begin, I want to make sure that my string is all lower case. This didn't matter for the length analysis conducted above because any character was counted regardless of case. For these analyses, we are looking for specific subsets of words, in this case pronouns. If we were to search for "He", we would only come up with the instances of "He" that started with an upper case "H" and miss all those that started with a lower case "h". As a result, we need to lowercase everything so that we can observe all instances of each pronoun.

In [112]:
barbauld_string_low = barbauld_string_clean.lower()
barbauld_list_low = barbauld_string_low.split()

#testing that it worked
barbauld_list_low

['corsica',
 'a',
 'manly',
 'race',
 'of',
 'unſubmitting',
 'ſpirit',
 'wiſe',
 'and',
 'brave',
 'who',
 'ſtill',
 'thro',
 'bleeding',
 'ages',
 'ſtruggled',
 'hard',
 'to',
 'hold',
 'a',
 'generous',
 'undiminiſhd',
 'ſtate',
 'too',
 'much',
 'in',
 'vain',
 'thomson',
 'hail',
 'generous',
 'corsica',
 'unconquerd',
 'iſle',
 'the',
 'fort',
 'of',
 'freedom',
 'that',
 'amidſt',
 'the',
 'waves',
 'stands',
 'like',
 'a',
 'rock',
 'of',
 'adamant',
 'and',
 'dares',
 'the',
 'wildeſt',
 'fury',
 'of',
 'the',
 'beating',
 'ſtorm',
 'b',
 'and',
 'and',
 'are',
 'there',
 'yet',
 'in',
 'this',
 'late',
 'ſickly',
 'age',
 'unkindly',
 'to',
 'the',
 'towring',
 'growths',
 'of',
 'virtue',
 'such',
 'bold',
 'exalted',
 'ſpirits',
 'men',
 'whoſe',
 'deeds',
 'to',
 'the',
 'bright',
 'annals',
 'of',
 'old',
 'greece',
 'opposd',
 'would',
 'throw',
 'in',
 'ſhades',
 'her',
 'yet',
 'unrivald',
 'name',
 'and',
 'dim',
 'the',
 'luſtre',
 'of',
 'her',
 'faireſt',
 'page',


With all the data lower cased, we now have to create new sublists of the lower cased list that only pull in words that are personal pronouns.

To do this, we have to create a list of personal pronouns, and tell the computer to grab only those items from the list that are in this list.  

I then take this list and divide it by the total number of words in this text, giving us the proportion of personal pronouns in the text.

In [114]:
per_list = ["i", "you", "he", "she", "it", "we", "they", "what", "who", "me", "him", "her", "us", "them"]
    
# create a new list, keeping only words that are personal pronouns
barbauld_list_per = [e for e in barbauld_list_low if e in per_list]
# Could run this line of code if we wanted to check if this worked, to see the list
#barbauld_list_per

#divide the number of personal pronouns in the novel by the number of words in the full novel 
print("Proportion of personal pronouns in the Text: ", len(barbauld_list_per) / len(barbauld_list_low))

Proportion of personal pronouns in the Text:  0.03407554954826777


With all the data lower cased, we now have to create new sublists of the lower cased list that only pull in words that are possesive pronouns.

To do this, we have to create a list of possesive pronouns, and tell the computer to grab only those items from the list that are in this list.  

I then take this list and divide it by the total number of words in this text, giving us the proportion of possessive pronouns in the text.

In [115]:
pos_list = ["mine", "yours", "his", "hers", "ours", "theirs"]

# create a new list, keeping only words that are possessive pronouns
barbauld_list_pos = [e for e in barbauld_list_low if e in pos_list]
# Could run this line of code if we wanted to check if this worked, to see the list
#barbauld_list_pos

#divide the number of possesive pronouns in the novel by the number of words in the full novel 
print("Proportion of possesive pronouns in the Text: ", len(barbauld_list_pos) / len(barbauld_list_low))

Proportion of possesive pronouns in the Text:  0.008323255317635342


### Smith's Children's Poems

To begin, I want to make sure that my string is all lower case. This didn't matter for the length analysis conducted above because any character was counted regardless of case. For these analyses, we are looking for specific subsets of words, in this case pronouns. If we were to search for "He", we would only come up with the instances of "He" that started with an upper case "H" and miss all those that started with a lower case "h". As a result, we need to lowercase everything so that we can observe all instances of each pronoun.

In [69]:
smith_string_low = smith_string_clean.lower()
smith_list_low = smith_string_low.split()

smith_list_low

['conversation',
 'the',
 'first',
 'poems',
 'to',
 'a',
 'greenchafer',
 'on',
 'a',
 'white',
 'rose',
 'to',
 'a',
 'ladybird',
 'the',
 'snail',
 'a',
 'walk',
 'by',
 'the',
 'water',
 'invitation',
 'to',
 'the',
 'bee',
 'conversation',
 'the',
 'first',
 'george—emily',
 'in',
 'a',
 'little',
 'garden',
 'called',
 'their',
 'own',
 'george',
 'look',
 'emily',
 'look',
 'at',
 'this',
 'beauti\xad',
 'ful',
 'shining',
 'insect',
 'which',
 'has',
 'almost',
 'hid',
 'itself',
 'in',
 'this',
 'white',
 'rose',
 'on',
 'your',
 'favourite',
 'tree',
 'it',
 'is',
 'shaped',
 'very',
 'like',
 'those',
 'brownish',
 'chafers',
 'which',
 'you',
 'desired',
 'me',
 'to',
 'take',
 'away',
 'from',
 'the',
 'gardeners',
 'children',
 'yesterday',
 'because',
 'you',
 'thought',
 'they',
 'were',
 'going',
 'to',
 'torment',
 'and',
 'hurt',
 'them',
 'but',
 'this',
 'is',
 'not',
 'so',
 'big',
 'and',
 'is',
 'much',
 'pret\xad',
 'tier',
 'see',
 'what',
 'little',
 'tassels

With all the data lower cased, we now have to create new sublists of the lower cased list that only pull in words that are personal pronouns.

To do this, we have to create a list of personal pronouns, and tell the computer to grab only those items from the list that are in this list.

I then take this list and divide it by the total number of words in this text, giving us the proportion of personal pronouns in the text.

In [79]:
per_list = ["i", "you", "he", "she", "it", "we", "they", "what", "who", "me", "him", "her", "us", "them"]
    
# create a new list, keeping only words that are personal pronouns
smith_list_per = [e for e in smith_list_low if e in per_list]
#smith_list_per

#divide the number of personal pronouns in the novel by the number of words in the full novel 
print("Proportion of personal pronouns in the Text: ", len(smith_list_per) / len(smith_list_low))

Proportion of personal pronouns in the Text:  0.07996113000624697


With all the data lower cased, we now have to create new sublists of the lower cased list that only pull in words that are possessive pronouns.

To do this, we have to create a list of possesive pronouns, and tell the computer to grab only those items from the list that are in this list.

I then take this list and divide it by the total number of words in this text, giving us the proportion of possessive pronouns in the text.

In [116]:
pos_list = ["mine", "yours", "his", "hers", "ours", "theirs"]

# create a new list, keeping only words that are possessive pronouns
smith_list_pos = [e for e in smith_list_low if e in pos_list]
#smith_list_pos

#divide the number of possessive pronouns in the novel by the number of words in the full novel 
print("Proportion of possessive pronouns in the Text: ", len(smith_list_pos) / len(smith_list_low))

Proportion of possessive pronouns in the Text:  0.004824043867564378


### Analysis

#### I. Computational Analysis

##### Barbauld's Adult Poems
    Proportion of personal pronouns in the Text:  0.03407554954826777
    Proportion of possessive pronouns in the Text:  0.008323255317635342
    
##### Smith's Childrens Poems
    Proportion of personal pronouns in the Text:  0.07996113000624697
    Proportion of possessive pronouns in the Text:  0.004824043867564378
    
From a look at the values produced by the computational analyses we ran, it appears that my hypothesis was correct, Adult Poems have more possessive pronouns than Children's Poems by .37%, while Children's Poems have more personal pronouns than Adult Poems by 4.5%. It also appears that both Adult and Children's Poems have more personal pronouns than they do possessive pronouns. 

#### II. Qualitative Analysis

The question remains as to why this is? Is my hypothesis true that this is because adults have a greater understanding of their social world and group belonging which is construed by possessive pronouns? Will children have a more objective and isolated understanding of how things are related which will result in more descriptive poetry that does not bring with it an understanding of relational ties, putting a focus on objective description operating through personal pronouns? For this, we need to take a look at the lists created above to conduct some qualitative analysis.

##### Barbauld's Lists of Personal and Possessive Pronouns

In [118]:
barbauld_list_per

['who',
 'her',
 'her',
 'what',
 'they',
 'he',
 'her',
 'her',
 'her',
 'i',
 'i',
 'her',
 'her',
 'it',
 'her',
 'her',
 'her',
 'they',
 'it',
 'it',
 'him',
 'him',
 'he',
 'he',
 'he',
 'her',
 'she',
 'her',
 'her',
 'her',
 'him',
 'her',
 'her',
 'her',
 'who',
 'her',
 'them',
 'her',
 'her',
 'her',
 'her',
 'her',
 'her',
 'her',
 'she',
 'her',
 'her',
 'her',
 'her',
 'her',
 'she',
 'you',
 'her',
 'her',
 'us',
 'they',
 'him',
 'they',
 'her',
 'her',
 'her',
 'her',
 'her',
 'her',
 'her',
 'her',
 'who',
 'you',
 'what',
 'her',
 'her',
 'her',
 'they',
 'her',
 'her',
 'her',
 'her',
 'her',
 'what',
 'her',
 'her',
 'her',
 'her',
 'i',
 'what',
 'we',
 'it',
 'what',
 'i',
 'i',
 'who',
 'i',
 'who',
 'her',
 'i',
 'me',
 'me',
 'i',
 'i',
 'i',
 'i',
 'me',
 'me',
 'who',
 'me',
 'me',
 'her',
 'her',
 'it',
 'her',
 'her',
 'her',
 'her',
 'her',
 'her',
 'her',
 'her',
 'her',
 'they',
 'her',
 'them',
 'her',
 'who',
 'who',
 'her',
 'me',
 'you',
 'he',
 'i'

In [119]:
barbauld_list_pos

['his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'yours',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'mine',
 'mine',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'yours',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'mine',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'hers',
 'mine',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his']

It looks like the majority of Barbauld's personal pronoun choices are "her", but also with a variety of other frequently used pronouns, though a noticeable infrequency of the word "he". This is opposed to the majority of Barbauld's possessive pronoun choices being "his". I don't think I could draw any conclusions based off of these lists of words, but it does make me wonder whether the poems in Barbauld's collection are framed around men possessing things or people, whether males are antagonists in these poems whose possession of things is being contested. This is more a question for this set of poems as opposed to a comparison between Adult and Children's Poems, a question more related to the gender relationships portrayed in the texts. While this was not our initial area of study for this analysis, this does represent an opening for further study, for another research project coming out of this one.  


##### Smith's Lists of Personal and Possessive Pronouns

In [120]:
smith_list_per

['it',
 'you',
 'me',
 'you',
 'they',
 'them',
 'what',
 'it',
 'it',
 'i',
 'it',
 'you',
 'it',
 'i',
 'it',
 'you',
 'it',
 'it',
 'it',
 'i',
 'it',
 'she',
 'us',
 'it',
 'it',
 'you',
 'it',
 'you',
 'i',
 'you',
 'it',
 'i',
 'it',
 'it',
 'i',
 'them',
 'i',
 'it',
 'you',
 'she',
 'it',
 'you',
 'i',
 'them',
 'they',
 'we',
 'it',
 'he',
 'we',
 'it',
 'it',
 'i',
 'you',
 'it',
 'they',
 'you',
 'them',
 'you',
 'you',
 'them',
 'you',
 'i',
 'you',
 'it',
 'i',
 'she',
 'it',
 'it',
 'you',
 'us',
 'it',
 'it',
 'we',
 'you',
 'me',
 'we',
 'you',
 'i',
 'you',
 'you',
 'them',
 'you',
 'him',
 'who',
 'you',
 'it',
 'you',
 'me',
 'you',
 'it',
 'i',
 'what',
 'them',
 'they',
 'i',
 'i',
 'it',
 'you',
 'us',
 'you',
 'us',
 'them',
 'you',
 'i',
 'we',
 'you',
 'you',
 'you',
 'them',
 'they',
 'them',
 'they',
 'i',
 'i',
 'them',
 'you',
 'i',
 'i',
 'it',
 'them',
 'i',
 'them',
 'you',
 'we',
 'he',
 'we',
 'you',
 'i',
 'i',
 'i',
 'who',
 'who',
 'you',
 'you',
 '

In [121]:
smith_list_pos

['his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'mine',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'ours',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'hers',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his',
 'his

It seems that Smith's preferred personal pronouns were "I", "It", and "You", with a lessor extent being "We" and "Them". This makes me think that children's poems have a smaller, more constrained scope, or at least orientation to story-telling. With most personal pronouns referring to "I", "It", and "You" it seems like the narrative/social world of the poems might be constrained to the space of the encounter. What I mean by this is that there do not seem to be forces outside of the encounter that impinge upon it, with each actor in the encounter relatively self-constrained. I would have to do a deep reading of the text to be able to back this interpretation up but just looking at the list of these words gave me this as an idea to do. 

As for the possessive pronouns, it seems Smith, like Barbauld, used mainly "His", with two "ours" and one "mine". I do need to wonder again, as I did above, what this says about the nature of this set of poems, the gender relationships that are represented in these pages as men seem to be the only figures that can possess something. Again, this alone does not shed light on the hypothesis I presented above, but does make me go beyond merely trying to explain the differences in frequency of using personal versus possessive pronouns in adults and childrens poems, and to think more about the different pronouns of each type that are used, giving me a new way to read the text and new research questions to answer.

## Conclusion

In the end, it seems that I can only definitively say that Adult Poems tend to, on average have longer words, as well as have a greater proportion of words longer than four letters long than do childrens poems. It is noted that childrens poems have a larger proportion of words over ten letters long, though for both types of poems the proportions are very small and the difference between these proportions are even smaller. 

It also seems that children's poems use proportionly more personal pronouns than adult poems, while adult poems use proportionaly more possessive pronouns than childrens poems.

The reasons for these findings are still less concrete but I have been able to shed light on the possibility that adult poetry, compared to childrens poetry, has a larger set of words at its disposal, due to a larger vocabulary which can tolerate more particularity and abstraction, whose words are longer, and thus have more of an ability to use longer words. 

As for perosnal and possessive pronoun use, I was unable to provide an account of why these differences between the genres might be happening, but was able to ask the questions of why different pronouns in these sets were being used more than others. This opened up a new potential line of investigation that can be explored in the future.

## In Class Check-In

### Differences in Programming Approaches

We differed on whether we chose to take out numerical digits from the strings or include them as part of the analysis. I took them out because I saw them more as formatting as opposed to actual words in the text, which would throw off our count of word lengths. My partner didn't think about that as a problem, but acknowledged this might yield more accurate results.

We got slightly different results, but they were very small (thousandths of a degree) which we attributed to rounding differences in our hardware.

We used different code for the average, though both codes performed the same function, which we tested by using each other's code and arrived at the same answer each of us had via our own code.

My partner also standardized some of the code by creating functions, which would make cleaner code and reduce overall work, so that's probably better.

We also arranged our code differently, I seperated my code based on what poem sections I was analyzing and did them in successive chunks based on analysis. My partner did all of the length analyses at once on both sets of texts, and the same on the pronoun analyses. I think both work, it depends how you want to organize them and when you want to make comments. I also made my comments and documentation in mark down, she did it through # comments.

### Three Further Analytic Techniques

1. We could use statistical tests (t-tests, anova, etc.) to determine whether the differences between the adult and childrens poems were actually significant. This would allow us to actually demarcate whether these computational text analyses show any real difference between the genres.

2. We could do topic modelling to see what thematic differences and similarities exist between the genres. It also establishes further grounds upon which to make theoretical hypotheses and inferences, giving more information to continue our comparisons.

3. We could add a further qualitative component, using deep reading and hermaneutics to read for the findings we got from the computational analysis, using the findings to generate new hypotheses we wouldn't have had beforehand. An example would be reading for male possessiveness in the texts, which we saw as a possibility through the possessive pronoun analysis. It would also complement and confirm the topic modelling laid out above. 

### Substansive Questions

1. We could analyze whether children have more of an ego-centric orientation to the world, a more constricted understanding of social world, as in not seeing exterior or social forces impinging on them or explaining their available action and context? How much of an awareness do they have of greater scalar phenomena and the way they impact their lives? Does awareness of the social world and of higher scalar forces and relations come in with age which we would see in the poems and word choices of adults?

2. Based off the findings that possessive pronouns for both the adult and childrens poems were predominantly "His", is there a greater understanding of or perception of male domination by children and adults? Being that the adult and children's poems were written by women, would we find the same things if we checked childrens poems and adult poems written by men? Would we find that women perceive male domination and possesion, and children are influenced by the women who write children's books? Or would it show also the role of patriarchy in the household, of the father being dominant in relation to both the mother and the child? 

3. We could also further examine word length along a developmental continuum by including texts for adolescents, teens, young adults, etc. Would we be seeing progressively higher use of long words as we develop across the age scale? What types of words are appearing when? What themes would we see for each age bracket? What does it tell us about this stage of the life-cycle? What does it tell us about the available vocabulary at each stage of development? 