# Gendered Vocabulary

In this notebook we want to examine the potential preferred usage of certain words by male/female speakers. We will not begin with filtering out stop words, both because they will account for a relatively small percentage of words but also because there may be something interesting there. 

To start, we will need to:

* import the csv
* grab all the texts and calculate word frequencies
* then grab all the texts by gender and calculate word frequencies
* compare frequencies
* BONUS: create a useful visualization

Double word-score bonus might be to look at trends: all and gendered. (I'm not sure what, if anything, that might reveal.)

Okay, onto the code.

Initial work: 2018-03-22.

In [1]:
# get the data

# Let python create the column names list:
with open('../data/talks_6d.csv') as f:
    colnames = f.readline().strip().split(",")
print(colnames)

# Now will import the csv as a dataframe
import pandas
TEDtalks = pandas.read_csv('../data/talks_6d.csv', names=colnames)

TEDtalks.head()

['', 'citation', 'author', 'gender', 'title', 'date', 'length', 'text', 'occupation', 'numDate']


Unnamed: 0,Unnamed: 1,citation,author,gender,title,date,length,text,occupation,numDate
0,,citation,author,gender,title,date,length,text,occupation,numDate
1,1.0,Al Gore 2006,Al Gore,male,Averting the climate crisis,Jun 2006,957,Thank you so much Chris. And it's truly a gre...,Climate advocate,200606
2,2.0,David Pogue 2006,David Pogue,male,Simplicity sells,Jun 2006,1271,Hello voice mail my old friend. I've called f...,Technology columnist,200606
3,3.0,Cameron Sinclair 2006,Cameron Sinclair,male,My wish: A call for open-source architecture,Jul 2006,1398,I'm going to take you on a journey very quickl...,"Co-founder, Architecture for Humanity",200607
4,4.0,Sergey Brin + Larry Page 2007,Sergey Brin + Larry Page,male,The genesis of Google,May 2007,1205,Sergey Brin I want to discuss a question I kn...,,200705


## All Talks

I'm importing all of the `nltk` below because I'm not sure what, if any, of the library might be useful here. Otherwise I would simply `from nltk.tokenize import WhitespaceTokenizer`. 

In [3]:
import nltk

# Create a list of just the texts
texts = TEDtalks.text.tolist()

# Mash all the talks together & then tokenize
alltexts = " ".join(texts).lower()
tokens = nltk.tokenize.WhitespaceTokenizer().tokenize(alltexts)

# Remove the name of the column which is the first item in the list:
tokens.pop(0)

print(len(tokens), alltexts[0:200])

4373823 text thank you so much  chris. and it's truly a great honor to have the opportunity to come to this stage twice  i'm extremely grateful. i have been blown away by this conference  and i want to thank 


In the past, I've used simply dictionaries to count word frequencies -- see `Tt-02a-words` -- but we not only want to use already available functionality but we want more then word frequencies, we want to normalize word frequencies as percentage of overall corpus so that we can distinguish words that are more frequent in one or another.

**TBH (2018-03-22)**: About the only thing that `nltk.FreqDist`, from what I can tell, does is deliver a containerized dictionary. I'll keep the code as is for now, in the belief that sticking with the NLTK is good for interoperability (or something).

In [None]:
# Below it has to be `fd.update([word])` and not `fd.update(word)`: 
# the latter returns a list of letters.

fd = nltk.FreqDist()
for sentence in nltk.sent_tokenize(alltexts):
    for word in nltk.word_tokenize(sentence):
        fd.update([word])

fd.most_common(10)

In [None]:
print(fd)

This means there are a total of 54,269 words with a total usage of 4,746,253. The difference between a raw token count above of 4,373,823 is not explained by adding back in the frequency of periods of 251,860. If you subtract the total of those two combined, you are still left with a difference of: `4,746,253 - (4,373,823 + 251,860) = 120,570`.

An NLTK `FreqDist` is a list of tuples containing the word and its frequency, e.g. `('and', 110130)`. I need to iterate through these three million tuples and normalize:

    percentages = []
    for word, count in old_tuple:
        percentage = count / total words
        word, percentage in new_tuple

Even my pseudo-code is kind of ugly, I'm afraid.

In [None]:
# To doublecheck the "outcomes" listed above is also the 
# total number of words: FreqDist is a Python counter and 
# inherits those methods:

total_words = sum(fd.values()) 
print(total_words)

In [None]:
# Now to calculate relative frequencies

freq_dist = dict(fd)

# MODEL: d2 = dict((k, f(v)) for k, v in d1.items())
rel_freq = {k: v/total_words for k, v in freq_dist.items()}

In [2]:
# It dawned on me we could have a function that does everything we need:

import nltk
import operator

def RelaFreq (list_of_texts):
    # Take the list and turn it into one long string
    all_texts = " ".join(list_of_texts).lower()
    # Invoke the NLTK god
    freqdist = nltk.FreqDist()
    # We're getting sentence data here, but I'm not sure it's needed
    # and I don't know how much it might be slowing the process.
    for sentence in nltk.sent_tokenize(all_texts):
        for word in nltk.word_tokenize(sentence):
            freqdist.update([word])
    # Get the total number of words so we can establish relative frequencies
    total_words = sum(freqdist.values())
    # Convert the FreqDist container to a dictionary
    freqdist_dict = dict(freqdist)
    # Create a new dictionary of relative frequencies
    relative_frequency = {k: v/total_words for k, v in freqdist_dict.items()}
    # Convert the dictionary to a rankable list of tuples
    # & rank it with the most frequent word first
    ranked = sorted(relative_frequency.items(), key=operator.itemgetter(1), reverse = True)
    # return the ranked list of tuples
    return(ranked)

2018-03-27: The code above was the first function I wrote to generate a list of words and then to rank them, but the NLTK functionality is keeping apostrophes at the beginning and end of words and splitting contractions on the apostrophe, so the function below was written using the `string` module to see if we can't get better results. 

The first version of the new function below results in a list of 56,043 words, some of which include some apparent oddities. The original function produced a list of 54269 words. 

What if we filter using the NLTK sentence tokenizer? It returns a list of **more** words: 56039.

New chain: **sentence tokenizer > string methods > word tokenizer**. This is, no doubt very inefficient, but we just need a clean list for now.

    justwords = [word.strip(string.punctuation) for word in all_texts.split(" ")]
    for word in justwords:
        freqdist.update([word])

In [49]:
import nltk
import operator
import string
from nltk.tokenize import RegexpTokenizer

def relafreq (list_of_texts):
    # Take the list and turn it into one long string
    all_texts = " ".join(list_of_texts).lower()
    # Invoke the NLTK god
    freqdist = nltk.FreqDist()
    tokenizer = RegexpTokenizer(r'\w+')
    words = tokenizer.tokenize(all_texts)
    for word in words:
        freqdist.update([word])
    # Get the total number of words so we can establish relative frequencies
    total_words = sum(freqdist.values())
    # Convert the FreqDist container to a dictionary
    freqdist_dict = dict(freqdist)
    # Create a new dictionary of relative frequencies
    relative_frequency = {k: v/total_words for k, v in freqdist_dict.items()}
    # Convert the dictionary to a rankable list of tuples
    # & rank it with the most frequent word first
    ranked = sorted(relative_frequency.items(), key=operator.itemgetter(1), reverse = True)
    # return the ranked list of tuples
    return(ranked)

In [50]:
all_talks = relafreq(texts)
all_talks[0:10]

[('the', 0.04672706044585814),
 ('and', 0.03357979585941176),
 ('to', 0.02819363043262475),
 ('of', 0.025808681516734926),
 ('a', 0.023719291779671314),
 ('that', 0.021429153263013395),
 ('i', 0.01863113368755923),
 ('in', 0.017549716481296756),
 ('it', 0.016799913112723334),
 ('you', 0.015928369630582185)]

In [51]:
print(len(all_talks))

53705


## Total Words

This next bit of code is just to have a number to multiply very small numbers by in order to get back the frequency of a word.

In [68]:
alltexts = " ".join(texts).lower()
freqdist = nltk.FreqDist()
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(alltexts)
for word in words:
    freqdist.update([word])
# Get the total number of words so we can establish relative frequencies
total_words = sum(freqdist.values())

In [71]:
print(total_words)

4493178


## Gendered Talks

The next step is to create two additional collections filtered by the `gender` column of the dataframe. What happens in the first line below is that we filter the dataframe, in the line that follows we pull the text column out. Originally, I had this as one line:

    f_talks = TEDtalks[TEDtalks.gender == 'female'].text.tolist()

But that produced a string and not a list object. I don't know why the one line would not work.

**2018-03-22**: Now it's working. Did I have `texts.tolist()` -- I'm not clear what happened.

In [5]:
# Filter by gender

m_talks = TEDtalks[TEDtalks.gender == 'male'].text.tolist()
f_talks = TEDtalks[TEDtalks.gender == 'female'].text.tolist()

# A quick check of numbers:

print("Of the {} TED talks given, {} were given by women and {} by men.".format
      (len(texts), len(f_talks), len(m_talks)))

Of the 2069 TED talks given, 607 were given by women and 1437 by men.


TODO: One thing to do here is to compare the talks to see which words are used **only** by women or **only** by men.

In [52]:
f_relative = relafreq(f_talks)
m_relative = relafreq(m_talks)

In [53]:
# Just a way to check against rank
import random
random_num = random.randint(1,500)
print(random_num, f_relative[random_num], m_relative[random_num])

19 ('but', 0.00630225218420012) ('have', 0.0061547439624038865)


## The Other Talks

Out of curiosity, how many talks are given by more than one speaker or by a non-single gender speaker:

In [None]:
# How many talks does that leave:

len(texts) - (len(f_talks) + len(m_talks))

In [None]:
o_talks = TEDtalks[(TEDtalks.gender != 'male')&(TEDtalks.gender != 'female')]

# This will show you all 25 rows:
o_talks.head(25)

## Comparing M/F Word Usage

I'm going to start by converting the list of tuples to a dictionary both because I think matching keys is going to be easier (at least based on my limited coding ability) and because, according to what I read, it appears to be faster. Since we don't really need a ranked listing for this work, it would probably be wise to rewrite the `RelaFreq` function so that it produces a dictionary. No reason to go back and forth like this.

In [54]:
f_rf = dict(f_relative)

m_rf = dict(m_relative)

Some notes on how to compare -- as I write this I am trying to find a way to limit my for loop through the two dictionaries only to N results just so I can see if it's working. 

[Python - Return first N key:value pairs from dict](https://stackoverflow.com/questions/7971618/python-return-first-n-keyvalue-pairs-from-dict)


In [None]:
# a way to match words in the f/m dictionaries

for key in f_rf:
    if key in m_rf:
        print(key, f_rf[key], m_rf[key])

What I decided to do was run the cell and then stop it. The results from above look like this:

    hour 8.613999628814925e-05 0.00012193460211598934
    debased 7.830908753468114e-07 2.924091177841471e-07
    deteriorate 3.1323635013872456e-06 1.7544547067048825e-06
    perimeter 1.5661817506936228e-06 3.508909413409765e-06

What we need now is to compare one list against the other for differences in usage. I am starting with twice as often to see what that turns up. >>> I need to re-read the literature here to see what comparison thresholds have been used. 

**2018-03-26**: I decided that the easiest way to approach this is to create a dictionary comprehension that divides the female relative frequency by the male. We can then look at numbers >1 for the words preferred by women and numbers < 1 for words preferred by men.

In [None]:
# We are just going to do math:

weighted = {f_rf[key]/m_rf[key] for key in f_rf}

for key in f_rf:
    
if f_rf[key] > m_rf[key]:
    print(key, f_rf[key], m_rf[key])

In [59]:
# Create a dataframe from a list of dictionaries
comp = pandas.DataFrame([f_rf, m_rf]).T

# Rename the columns to something human readable
comp.columns = ['women', 'men']

# Check our results
comp.head()

Unnamed: 0,women,men
a,0.022903,0.02403695
aa,3e-06,2.472778e-06
aaa,2e-06,1.854583e-06
aaaa,,3.090972e-07
aaaaa,2e-06,


In [87]:
# I'm doing the vision manually below instead of using the built-in functionality: 
# DataFrame.divide(other, axis='columns', level=None, fill_value=None)
ratios = comp.assign(ratio = comp.women / comp.men)

In [88]:
ratios.head()

Unnamed: 0,women,men,ratio
a,0.022903,0.02403695,0.952835
aa,3e-06,2.472778e-06,1.335422
aaa,2e-06,1.854583e-06,0.890281
aaaa,,3.090972e-07,
aaaaa,2e-06,,


In [90]:
ratios.index.rename('word', inplace=True)

In [91]:
ratios.head()

Unnamed: 0_level_0,women,men,ratio
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,0.022903,0.02403695,0.952835
aa,3e-06,2.472778e-06,1.335422
aaa,2e-06,1.854583e-06,0.890281
aaaa,,3.090972e-07,
aaaaa,2e-06,,


In [93]:
import matplotlib.pyplot as plt

ratios.plot.scatter(x='word', y='ratio')

KeyError: 'word'

In [77]:
occurs = ratios.assign(woccurs = comp.women * 4493178, moccurs = comp.men*4493178)

In [78]:
# And now to sort:

occurs.sort_values(by='ratio', ascending=False)

Unnamed: 0,women,men,ratio,moccurs,woccurs
pms,4.045197e-05,3.090972e-07,130.871354,1.388829,181.757912
vagina,5.943963e-05,6.181945e-07,96.150382,2.777658,267.072851
tapirs,2.146431e-05,3.090972e-07,69.441943,1.388829,96.442974
sw,3.797532e-05,6.181945e-07,61.429411,2.777658,170.629877
mantis,1.568546e-05,3.090972e-07,50.746035,1.388829,70.477558
jf,2.971982e-05,6.181945e-07,48.075191,2.777658,133.536425
glamour,4.457972e-05,9.272917e-07,48.075191,4.166487,200.304638
feminism,2.889427e-05,6.181945e-07,46.739769,2.777658,129.827080
brigades,1.403436e-05,3.090972e-07,45.404347,1.388829,63.058868
replicator,1.320881e-05,3.090972e-07,42.733503,1.388829,59.349522
