# Gendered Vocabulary

In this notebook we want to examine the potential preferred usage of certain words by male/female speakers. We will not begin with filtering out stop words, both because they will account for a relatively small percentage of words but also because there may be something interesting there. 

To start, we will need to:

* import the csv
* grab all the texts and calculate word frequencies
* then grab all the texts by gender and calculate word frequencies
* compare frequencies
* BONUS: create a useful visualization

Double word-score bonus might be to look at trends: all and gendered. (I'm not sure what, if anything, that might reveal.)

Okay, onto the code.

In [5]:
# get the data

# Let python create the column names list:
with open('../data/talks_6d.csv') as f:
    colnames = f.readline().strip().split(",")
print(colnames)

# Now will import the csv as a dataframe
import pandas
TEDtalks = pandas.read_csv('../data/talks_6d.csv', names=colnames)

TEDtalks.head()

['', 'citation', 'author', 'gender', 'title', 'date', 'length', 'text', 'occupation', 'numDate']


Unnamed: 0,Unnamed: 1,citation,author,gender,title,date,length,text,occupation,numDate
0,,citation,author,gender,title,date,length,text,occupation,numDate
1,1.0,Al Gore 2006,Al Gore,male,Averting the climate crisis,Jun 2006,957,Thank you so much Chris. And it's truly a gre...,Climate advocate,200606
2,2.0,David Pogue 2006,David Pogue,male,Simplicity sells,Jun 2006,1271,Hello voice mail my old friend. I've called f...,Technology columnist,200606
3,3.0,Cameron Sinclair 2006,Cameron Sinclair,male,My wish: A call for open-source architecture,Jul 2006,1398,I'm going to take you on a journey very quickl...,"Co-founder, Architecture for Humanity",200607
4,4.0,Sergey Brin + Larry Page 2007,Sergey Brin + Larry Page,male,The genesis of Google,May 2007,1205,Sergey Brin I want to discuss a question I kn...,,200705


## All Talks

I'm importing all of the `nltk` below because I'm not sure what, if any, of the library might be useful here. Otherwise I would simply `from nltk.tokenize import WhitespaceTokenizer`. 

In [6]:
import nltk

# Create a list of just the texts
texts = TEDtalks.text.tolist()

# Mash all the talks together & then tokenize
alltexts = " ".join(texts).lower()
tokens = nltk.tokenize.WhitespaceTokenizer().tokenize(alltexts)

# Remove the name of the column which is the first item in the list:
tokens.pop(0)

'text'

In [7]:
len(tokens)

4373823

In [8]:
print(alltexts[0:200])

text thank you so much  chris. and it's truly a great honor to have the opportunity to come to this stage twice  i'm extremely grateful. i have been blown away by this conference  and i want to thank 


In the past, I've used simply dictionaries to count word frequencies -- see `Tt-02a-words` -- but we not only want to use already available functionality but we want more then word frequencies, we want to normalize word frequencies as percentage of overall corpus so that we can distinguish words that are more frequent in one or another.

In [33]:
fd = nltk.FreqDist()
for sentence in nltk.sent_tokenize(alltexts):
    for word in nltk.word_tokenize(sentence):
        fd.update([word])

fd.most_common(10)

[('.', 251860),
 ('the', 209937),
 ('and', 150879),
 ('to', 126676),
 ('of', 115963),
 ('a', 106332),
 ('that', 96280),
 ('i', 83494),
 ('in', 78852),
 ('it', 75480)]

In the code above, it has to be `fd.update([word])` and not `fd.update(word)`: the latter returns a list of letters.

In [10]:
print(fd)

<FreqDist with 54269 samples and 4746253 outcomes>


This means there are a total of 54,269 words with a total usage of 4,746,253. 

The difference between a raw token count above of 4,373,823 is not explained by adding back in the frequency of periods of 251,860. If you subtract the total of those two combined, you are still left with a difference of: 

In [12]:
len(tokens) - (4746253 - 251860)

-120570

An NLTK `FreqDist` is a list of tuples containing the word and its frequency, e.g. `('and', 110130)`. I need to iterate through these three million tuples and normalize:

    percentages = []
    for word, count in old_tuple:
        percentage = count / total words
        word, percentage in new_tuple

Even my pseudo-code is kind of ugly, I'm afraid.

In [34]:
# To doublecheck the "outcomes" listed above is also the 
# total number of words: FreqDist is a Python counter and 
# inherits those methods:

total_words = sum(fd.values()) 
print(total_words)

4746253


In [35]:
# Now to calculate relative frequencies

freq_dist = dict(fd)

# MODEL: d2 = dict((k, f(v)) for k, v in d1.items())
rel_freq = {k: v/total_words for k, v in freq_dist.items()}

In [36]:
# Sort by value 

import operator

rf = sorted(rel_freq.items(), key=operator.itemgetter(1))
rf.reverse()
rf[0:10]

[('.', 0.05306501781510594),
 ('the', 0.04423215534443697),
 ('and', 0.031789076562079605),
 ('to', 0.026689685526667038),
 ('of', 0.024432536571480704),
 ('a', 0.022403356921765444),
 ('that', 0.020285475721585008),
 ('i', 0.017591561174678215),
 ('in', 0.016613526501853146),
 ('it', 0.01590307132805605)]

## Gendered Talks

The next step is to create two additional collections filtered by the `gender` column of the dataframe. What happens in the first line below is that we filter the dataframe

In [17]:
m_talks = TEDtalks[TEDtalks.gender == 'male'].text.tolist()

In [18]:
f_talks = TEDtalks[TEDtalks.gender == 'female'].text.tolist()

In [21]:
# A quick check of numbers:

print("Of the {} TED talks given, {} were given by women and {} by men.".format
      (len(texts), len(f_talks), len(m_talks)))

Of the 2069 TED talks given, 607 were given by women and 1437 by men.


In [32]:
f_alltalks[0:20]

'what you just heard '

In [39]:
import operator

f_fd = nltk.FreqDist()
f_alltalks = " ".join(f_talks).lower()
for sentence in nltk.sent_tokenize(f_alltalks):
    for word in nltk.word_tokenize(sentence):
        f_fd.update([word])
        
print(f_fd)

<FreqDist with 31063 samples and 1276991 outcomes>


In [37]:
f_totals = sum(f_fd.values()) 
print(f_totals)

f_dist = dict(f_fd)
f_rf = {k: v/f_totals for k, v in f_dist.items()}

f_rf_sorted = sorted(f_rf.items(), key=operator.itemgetter(1))

In [38]:
# f_rf.reverse()
f_rf[0:10]

TypeError: unhashable type: 'slice'

In [None]:
m_fd = nltk.FreqDist()
for sentence in nltk.sent_tokenize(m_talks):
    for word in nltk.word_tokenize(sentence):
        fd.update([word])

## The Other Talks

In [22]:
# How many talks does that leave:

len(texts) - (len(f_talks) + len(m_talks))

25

In [26]:
o_talks = TEDtalks[(TEDtalks.gender != 'male')&(TEDtalks.gender != 'female')]

# This will show you all 25 rows:
o_talks.head(25)