# Gendered Vocabulary

This notebook captures subsequent work on gender and vocabulary in the Tedtalks corpus. I've had a few weeks to think about the previous work and to think about various factors that may or may not affect our understanding. This notebook began as a copy of the previous notebook, `Tt-08-vocab-gender` and reuses a small portion of the code. I began work here with the assumption that what we need to build is a new dataframe with each row a word. Here is my tentative sketch of columns needed:

    WORD, OVERALL FREQ, OVERALL RF, F FREQ, F RF, M FREQ, M RF

    FREQ =  frequency (number of times a word occurs)
    RF   =  relative frequency (number of times a word occurs 
            divided by the total number of words)
    F    =  speaker is a woman
    M    =  speaker is a man

As soon as I got done with this and filtered the dataframe by the usage of PMS and saw that it came down to two talks, one by a woman and one by a man, I realized we probably needed to add columns for number of talks in overall, women, and men.

In [1]:
# get the data

# Let python create the column names list:
with open('../data/talks_6d.csv') as f:
    colnames = f.readline().strip().split(",")
print(colnames)

# Now will import the csv as a dataframe
import pandas
TEDtalks = pandas.read_csv('../data/talks_6d.csv', names=colnames)

TEDtalks.head()

['', 'citation', 'author', 'gender', 'title', 'date', 'length', 'text', 'occupation', 'numDate']


Unnamed: 0,Unnamed: 1,citation,author,gender,title,date,length,text,occupation,numDate
0,,citation,author,gender,title,date,length,text,occupation,numDate
1,1.0,Al Gore 2006,Al Gore,male,Averting the climate crisis,Jun 2006,957,Thank you so much Chris. And it's truly a gre...,Climate advocate,200606
2,2.0,David Pogue 2006,David Pogue,male,Simplicity sells,Jun 2006,1271,Hello voice mail my old friend. I've called f...,Technology columnist,200606
3,3.0,Cameron Sinclair 2006,Cameron Sinclair,male,My wish: A call for open-source architecture,Jul 2006,1398,I'm going to take you on a journey very quickl...,"Co-founder, Architecture for Humanity",200607
4,4.0,Sergey Brin + Larry Page 2007,Sergey Brin + Larry Page,male,The genesis of Google,May 2007,1205,Sergey Brin I want to discuss a question I kn...,,200705


In [3]:
# Create a list of just the texts
texts = TEDtalks.text.tolist()

## All Talks

In [2]:
import nltk
import operator
import string
from nltk.tokenize import RegexpTokenizer

def relafreq (list_of_texts):
    # Take the list and turn it into one long string
    all_texts = " ".join(list_of_texts).lower()
    # Invoke the NLTK god
    freqdist = nltk.FreqDist()
    tokenizer = RegexpTokenizer(r'\w+')
    words = tokenizer.tokenize(all_texts)
    for word in words:
        freqdist.update([word])
    # Get the total number of words so we can establish relative frequencies
    total_words = sum(freqdist.values())
    # Convert the FreqDist container to a dictionary
    freqdist_dict = dict(freqdist)
    # Create a new dictionary of relative frequencies
    relative_frequency = {k: v/total_words for k, v in freqdist_dict.items()}
    # Convert the dictionary to a rankable list of tuples
    # & rank it with the most frequent word first
    ranked = sorted(relative_frequency.items(), key=operator.itemgetter(1), reverse = True)
    # return the ranked list of tuples
    return(ranked)

In [None]:
all_talks = relafreq(texts)
all_talks[0:10]

In [None]:
# This should be our total number of words. 
print(len(all_talks))

In [6]:
TEDtalks[TEDtalks['text'].str.contains("PMS")]

Unnamed: 0,Unnamed: 1,citation,author,gender,title,date,length,text,occupation,numDate
557,557.0,VS Ramachandran 2007,VS Ramachandran,male,3 clues to understanding your brain,Oct 2007,1356,Well as Chris pointed out I study the human ...,,200710
931,931.0,Robyn Stein DeLuca 2015,Robyn Stein DeLuca,female,The good news about PMS,Mar 2015,880,How many people here have heard of PMS Everyb...,Psychologist,201503


In [11]:
pms_talks = TEDtalks[TEDtalks['text'].str.contains("PMS")].text.tolist()

In [14]:
print(pms_talks[0])

Well  as Chris pointed out  I study the human brain  the functions and structure of the human brain. And I just want you to think for a minute about what this entails. Here is this mass of jelly  three pound mass of jelly you can hold in the palm of your hand  and it can contemplate the vastness of interstellar space. It can contemplate the meaning of infinity and it can contemplate itself contemplating on the meaning of infinity. And this peculiar recursive quality that we call self awareness  which I think is the holy grail of neuroscience  of neurology  and hopefully  someday  we'll understand how that happens. OK  so how do you study this mysterious organ  I mean  you have     billion nerve cells  little wisps of protoplasm  interacting with each other  and from this activity emerges the whole spectrum of abilities that we call human nature and human consciousness. How does this happen  Well  there are many ways of approaching the functions of the human brain. One approach  the one

Okay, this is weird. Filtering for use of "PMS" turns up 2 talks, and in those talks "PMS" is used 49 times in one and 1 time by the other. The length of the two talks is 2076 to 4405 words. (The shorter talk with the higher incidence of "PMS" is by a woman; the longer talk with only one occurrence is by a man.) 

I would argue that the number of talks in which a word occurs does have some affect on how we think of a word "gendering" a speaker, or, obversely, how a speaker's gender affects our understanding of a word's usage. So, do we also need to keep count of the number of texts in which a word occurs? 

## Gendered Talks

The next step is to create two additional collections filtered by the `gender` column of the dataframe. What happens in the first line below is that we filter the dataframe, in the line that follows we pull the text column out. Originally, I had this as one line:

    f_talks = TEDtalks[TEDtalks.gender == 'female'].text.tolist()

But that produced a string and not a list object. I don't know why the one line would not work.

**2018-03-22**: Now it's working. Did I have `texts.tolist()` -- I'm not clear what happened.

In [None]:
# Filter by gender

m_talks = TEDtalks[TEDtalks.gender == 'male'].text.tolist()
f_talks = TEDtalks[TEDtalks.gender == 'female'].text.tolist()

# A quick check of numbers:

print("Of the {} TED talks given, {} were given by women and {} by men.".format
      (len(texts), len(f_talks), len(m_talks)))

In [None]:
1437 / 607

TODO: One thing to do here is to compare the talks to see which words are used **only** by women or **only** by men.

In [None]:
f_relative = relafreq(f_talks)
m_relative = relafreq(m_talks)

In [None]:
# Just a way to check against rank
import random
random_num = random.randint(1,500)
print(random_num, f_relative[random_num], m_relative[random_num])

## The Other Talks

Out of curiosity, how many talks are given by more than one speaker or by a non-single gender speaker:

In [None]:
# How many talks does that leave:

len(texts) - (len(f_talks) + len(m_talks))

In [None]:
o_talks = TEDtalks[(TEDtalks.gender != 'male')&(TEDtalks.gender != 'female')]

# This will show you all 25 rows:
o_talks.head(25)

## Comparing M/F Word Usage

I'm going to start by converting the list of tuples to a dictionary both because I think matching keys is going to be easier (at least based on my limited coding ability) and because, according to what I read, it appears to be faster. Since we don't really need a ranked listing for this work, it would probably be wise to rewrite the `RelaFreq` function so that it produces a dictionary. No reason to go back and forth like this.

In [None]:
f_rf = dict(f_relative)

m_rf = dict(m_relative)

Some notes on how to compare -- as I write this I am trying to find a way to limit my for loop through the two dictionaries only to N results just so I can see if it's working. 

[Python - Return first N key:value pairs from dict](https://stackoverflow.com/questions/7971618/python-return-first-n-keyvalue-pairs-from-dict)


In [None]:
# a way to match words in the f/m dictionaries

for key in f_rf:
    if key in m_rf:
        print(key, f_rf[key], m_rf[key])

What I decided to do was run the cell and then stop it. The results from above look like this:

    hour 8.613999628814925e-05 0.00012193460211598934
    debased 7.830908753468114e-07 2.924091177841471e-07
    deteriorate 3.1323635013872456e-06 1.7544547067048825e-06
    perimeter 1.5661817506936228e-06 3.508909413409765e-06

What we need now is to compare one list against the other for differences in usage. I am starting with twice as often to see what that turns up. >>> I need to re-read the literature here to see what comparison thresholds have been used. 

**2018-03-26**: I decided that the easiest way to approach this is to create a dictionary comprehension that divides the female relative frequency by the male. We can then look at numbers >1 for the words preferred by women and numbers < 1 for words preferred by men.

In [None]:
# We are just going to do math:

weighted = {f_rf[key]/m_rf[key] for key in f_rf}

for key in f_rf:
    
if f_rf[key] > m_rf[key]:
    print(key, f_rf[key], m_rf[key])

In [None]:
# Create a dataframe from a list of dictionaries
comp = pandas.DataFrame([f_rf, m_rf]).T

# Rename the columns to something human readable
comp.columns = ['women', 'men']

# Check our results
comp.head()

In [None]:
# I'm doing the vision manually below instead of using the built-in functionality: 
# DataFrame.divide(other, axis='columns', level=None, fill_value=None)
ratios = comp.assign(ratio = comp.women / comp.men)

In [None]:
ratios.head()

In [None]:
ratios.index.rename('word', inplace=True)

In [None]:
ratios.head()

In [None]:
import matplotlib.pyplot as plt

ratios.plot.scatter(x='word', y='ratio')

In [None]:
occurs = ratios.assign(woccurs = comp.women * 4493178, moccurs = comp.men*4493178)

In [None]:
# And now to sort:

occurs.sort_values(by='ratio', ascending=False)