# Parsing gendered text

Implement the `parse_gender` function as described on pp. 10-12 of the textbook. Run the function over the three texts indicated below and comment (briefly) on the results.

Starter code is included below. When finished, commit your code and issue a pull request to me.

In [32]:
# Imports
import nltk
import os
from collections import Counter

# Variables
text_dir = os.path.join('..', 'data', 'texts') # Where are the texts?
texts = [
    'A-Alcott-Little_Women-1868-F.txt', # _Little Women_
    'A-Twain-Huck_Finn-1885-M.txt',     # _Huck Finn_
    'B-Eliot-Middlemarch-1869-F.txt'    # _Middlemarch_
]

In [33]:
# Word lists
MALE = 'male'
FEMALE = 'female'
UNKNOWN = 'unknown'
BOTH = 'both'

MALE_WORDS = set([
    'guy','spokesman','chairman',"men's",'men','him',"he's",'his',
    'boy','boyfriend','boyfriends','boys','brother','brothers','dad',
    'dads','dude','father','fathers','fiance','gentleman','gentlemen',
    'god','grandfather','grandpa','grandson','groom','he','himself',
    'husband','husbands','king','male','man','mr','nephew','nephews',
    'priest','prince','son','sons','uncle','uncles','waiter','widower',
    'widowers'
])

FEMALE_WORDS = set([
    'heroine','spokeswoman','chairwoman',"women's",'actress','women',
    "she's",'her','aunt','aunts','bride','daughter','daughters','female',
    'fiancee','girl','girlfriend','girlfriends','girls','goddess',
    'granddaughter','grandma','grandmother','herself','ladies','lady',
    'lady','mom','moms','mother','mothers','mrs','ms','niece','nieces',
    'priestess','princess','queens','she','sister','sisters','waitress',
    'widow','widows','wife','wives','woman'
])

In [34]:
def genderize(words):

    mwlen = len(MALE_WORDS.intersection(words))
    fwlen = len(FEMALE_WORDS.intersection(words))
    
    if mwlen > 0 and fwlen == 0:
        return MALE
    elif mwlen == 0 and fwlen > 0:
        return FEMALE
    elif mwlen > 0 and fwlen > 0:
        return BOTH
    else:
        return UNKNOWN
    

In [35]:
def count_gender(sentences):
    
    sents = Counter()
    words = Counter ()
    
    for sentence in sentences:
        gender = genderize(sentence)
        sents[gender] += 1
        words[gender] += len(sentence)
        
    return sents, words

In [58]:
def parse_gender(text):
    
    sentences = [
        [word.lower() for word in nltk.word_tokenize(sentence)]
        for sentence in nltk.sent_tokenize(text)
    ]
    
    sents, words = count_gender(sentences)
    total = sum(words.values())
    
    for gender, count in words.items():
        pcent = (count / total) * 100
        nsents = sents[gender]
        
        print(
            "{0:0.3f}% {1:} ({2:} sentences)".format(pcent, gender, nsents)
        )
        

In [59]:
# Run and examine the output
for text in texts: # Loop over texts in corpus directory
    with open(os.path.join(text_dir, text), 'r') as f: # Open each text in turn
        parse_gender(f.read()) # Run the gender-parsing function

32.575% unknown (4539 sentences)
33.212% female (2504 sentences)
17.909% both (1010 sentences)
16.305% male (1393 sentences)
47.037% unknown (3576 sentences)
36.356% male (1650 sentences)
9.556% female (415 sentences)
7.051% both (185 sentences)
36.999% male (4558 sentences)
19.872% both (1880 sentences)
28.951% unknown (6528 sentences)
14.177% female (1917 sentences)


## Discussion

The results of the parse_gender function did not surprise me given the thematic material of the texts. Little Women in comparison to Huck Finn has a larger number of female characters and the plot itself is more centered on the relationships between the March sisters, so the 33.212% portion of the text coded as 'female'follows. Huck Finn in comparison has a smaller proportion of female characters and deals with masculine themes, so the 36.356% portion of the text coded as 'male' additionally follows. Finally, Middlemarch is a much more balanced text that features an intersecting cast of both female and male characters, so the large proportion of the (28.951%) being coded as 'unknown' makes sense given the background of the cast.