# New York Times Example

### Consider simple model that reveals a gender role in a piece of text.

#### Any sentence can be classified as talking about men, about women, both men and women and unknown. Sets are not exhaustive.

In [1]:
MALE = 'male'
FEMALE = 'female'
UNKNOWN = 'unknown'
BOTH = 'both'

MALE_WORDS = set(['guy','spokesman','chairman',"men's",'men','him',"he's",'his',
    'boy','boyfriend','boyfriends','boys','brother','brothers','dad',
    'dads','dude','father','fathers','fiance','gentleman','gentlemen',
    'god','grandfather','grandpa','grandson','groom','he','himself',
    'husband','husbands','king','male','man','mr','nephew','nephews',
    'priest','prince','son','sons','uncle','uncles','waiter','widower',
    'widowers'])

FEMALE_WORDS = set(['heroine','spokeswoman','chairwoman',"women's",'actress','women',
    "she's",'her','aunt','aunts','bride','daughter','daughters','female',
    'fiancee','girl','girlfriend','girlfriends','girls','goddess',
    'granddaughter','grandma','grandmother','herself','ladies','lady',
    'mom','moms','mother','mothers','mrs','ms','niece','nieces',
    'priestess','princess','queens','she','sister','sisters','waitress',
    'widow','widows','wife','wives','woman'])

#### Let's look at the method of determing gender class of a sentence. Let's create a function that counts number of words in a sentence that fall into our sets MALE_WORDS and FEMALE_WORDS.  

In [2]:
def genderize(words):
    
    mwlen = len(MALE_WORDS.intersection(words))
    fwlen = len(FEMALE_WORDS.intersection(words))
    
    if mwlen > 0 and fwlen == 0:
        return MALE
    elif mwlen == 0 and fwlen > 0:
        return FEMALE
    elif mwlen > 0 and fwlen > 0:
        return BOTH
    else:
        return UNKNOWN

#### It is necessary to count the frequency of feature words and sentences throughout the text. For this task, you can use the built-in Python class collections.Counters.

In [3]:
from collections import Counter

def count_gender(sentences):
    
    sents = Counter()
    words = Counter()
    
    for sentence in sentences:
        gender = genderize(sentence)
        sents[gender] += 1
        words[gender] += len(sentence)
        
    return sents, words

#### Using the NLTK library to split paragraphs into sentences. Let's apply the count_gender function for each sentence and calculate the final statistics.

In [4]:
import nltk
nltk.download('punkt')

def parse_gender(text):
    
    sentences = [
        [word.lower() for word in nltk.word_tokenize(sentence)]
        for sentence in nltk.sent_tokenize(text)
    ]

    
    sents, words = count_gender(sentences)
    total = sum(words.values())
    
    for gender, count in words.items():
        pcent = (count/total) * 100
        nsents = sents[gender]
        
        print(
            "{:0.3f}% {} ({} sentences)".format(pcent, gender, nsents)
        )

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\79771\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Displaying results

In [19]:
text = open('Chapter_1_data.txt','r',encoding='utf-8').read()
parse_gender(text)

39.269% unknown (48 sentences)
52.994% female (38 sentences)
4.393% both (2 sentences)
3.344% male (3 sentences)
