# Gendered Language

This notebook is based on work by Neal Caren's ["Using Python to see how the Times writes about men and women"][1] as simplified by Bengfort, Bilbro, and Ojeda in _Applied Text Analysis with Python_ which has a [GitHub repo][2] for comparison both with Caren as well as with the current implementation.


[1]: https://nbviewer.jupyter.org/gist/nealcaren/5105037
[2]: https://github.com/foxbook/atap

In [1]:
import nltk, re, pandas as pd
from collections import Counter

In [2]:
df = pd.read_csv('../output/TEDall_speakers.csv')
print(list(df))

['Set', 'Talk_ID', 'public_url', 'headline', 'description', 'event', 'duration', 'published', 'tags', 'views', 'text', 'speaker_1', 'speaker1_occupation', 'speaker1_introduction', 'speaker1_profile', 'speaker_2', 'speaker2_occupation', 'speaker2_introduction', 'speaker2_profile', 'speaker_3', 'speaker3_occupation', 'speaker3_introduction', 'speaker3_profile', 'speaker_4', 'speaker4_occupation', 'speaker4_introduction', 'speaker4_profile']


In [3]:
texts = df.text.tolist()

We are creating a list of texts here so we can simply grab a sample text, and perhaps a subset as well, with which to test our code. It makes more sense to work with the dataset directly, working through the rows using `df.itertuples()` and then `row.column_name`:

```python
for row in df.itertuples():
    print(row.public_url)

https://www.ted.com/talks/al_gore_on_averting_climate_crisis
https://www.ted.com/talks/david_pogue_says_simplicity_sells
https://www.ted.com/talks/majora_carter_s_tale_of_urban_renewal
...
```

But I don't know how to grab just a couple of rows from the column as of this writing.

With the data loaded, we establish the control vocabularies for gender. 

**TODO**: we will need to tune this list for the corpus.

In [4]:
# Gender Word Lists

FEMININE = 'feminine'
MASCULINE = 'masculine'
UNKNOWN = 'unknown'
BOTH = 'both'

MASCULINE_WORDS = set([
    'guy','spokesman','chairman',"men's",'men','him',"he's",'his',
    'boy','boyfriend','boyfriends','boys','brother','brothers','dad',
    'dads','dude','father','fathers','fiance','gentleman','gentlemen',
    'god','grandfather','grandpa','grandson','groom','he','himself',
    'husband','husbands','king','male','man','mr','nephew','nephews',
    'priest','prince','son','sons','uncle','uncles','waiter','widower',
    'widowers'
])

FEMININE_WORDS = set([
    'heroine','spokeswoman','chairwoman',"women's",'actress','women',
    "she's",'her','aunt','aunts','bride','daughter','daughters','female',
    'fiancee','girl','girlfriend','girlfriends','girls','goddess',
    'granddaughter','grandma','grandmother','herself','ladies','lady',
    'mom','moms','mother','mothers','mrs','ms','niece','nieces',
    'priestess','princess','queens','she','sister','sisters','waitress',
    'widow','widows','wife','wives','woman'
])

In [5]:
def genderize(words):
    
    fwlen = len(FEMININE_WORDS.intersection(words))
    mwlen = len(MASCULINE_WORDS.intersection(words))
    
    if fwlen > 0 and mwlen == 0:
        return FEMININE
    elif fwlen == 0 and mwlen > 0:
        return MASCULINE
    elif fwlen > 0 and mwlen > 0:
        return BOTH
    else:
        return UNKNOWN

def count_gender(sentences):
    
    """ REQUIRES: from collections import Counter"""
        
    sents = Counter()
    words = Counter()
    
    for sentence in sentences:
        gender = genderize(sentence)
        sents[gender] += 1
        words[gender] +=len(sentence)
        
    return sents, words

def parse_gender(text):
    
        """ REQUIRES: import nltk """
        
        sentences = [
            [word.lower() for word in nltk.word_tokenize(sentence)]
            for sentence in nltk.sent_tokenize(text)
        ]
        
        sents, words = count_gender(sentences)
        total = sum(words.values())
        
        for gender, count in words.items():
            pcent = (count / total) * 100
            nsents = sents[gender]
            
            print(f"{pcent:.2f} {gender} ( {nsents} ) ")

The example below simply establishes that gender is not straightforward: there is no guarantee that someone using gendered words, as constructed above, is actually talking about women.

In [6]:
text = "My dog Molly is a good girl."
words = re.sub("[^\w+]", " ", text.lower())
print(words)

my dog molly is a good girl 


In [7]:
parse_gender(words)

100.00 feminine ( 1 ) 


Al Gore's text is a bellweather because it sits at `0` in the list index. Here we mix it up with Majora Carter's talk on urban renewal:

In [8]:
parse_gender(texts[2])

86.61 unknown ( 170 ) 
5.18 feminine ( 5 ) 
7.06 masculine ( 13 ) 
1.15 both ( 1 ) 


In [9]:
print(texts[2][0:100])

  If you're here today — and I'm very happy that you are — you've all heard about how sustainable de


Carter's talk shares a lot with Al Gore's: both have a predominant amount of `unknown` -- Gore's talk is even less "gendered":

    90.45 unknown ( 137 ) 
    4.54 feminine ( 3 ) 
    0.78 both ( 1 ) 
    4.23 masculine ( 4 ) 

Is this potentially a function of the talk being in first person and then about a "neutral" topic like urban renewal or climate change. (Both documents return `unknown` at 90%.) Do the gendered word lists need to be examined, revised?

What if we did the same thing for first, second, third persons — and maybe consider singular versus plural as well?

## Refining Results to Get a List/Tuple

One of the things that comes up in running the code above is that the sequence of the 4 outcomes -- `feminine`, `masculine`, `both`, `unknown` -- varies. It would be nice to have the parser return a four value list or tuple which is always in the same sequence.

In [10]:
def pargen(text):
    
        """ REQUIRES: import nltk """
        
        sentences = [
            [word.lower() for word in nltk.word_tokenize(sentence)]
            for sentence in nltk.sent_tokenize(text)
        ]
        
        sents, words = count_gender(sentences)
        total = sum(words.values())

        for gender, count in words.items():
            pcent = (count / total) * 100
            print(f'{pcent:.2f},')

In [11]:
pargen(texts[2])

86.61,
5.18,
7.06,
1.15,


In [None]:
for row in df.itertuples():
    gendered = pargen(row.text)
    print (row.Talk_ID, gendered)