In [2]:
import pandas

# Let python create the column names list:
with open('../data/talks_6d.csv') as f:
    colnames = f.readline().strip().split(",")

TEDtalks = pandas.read_csv('../data/talks_6d.csv', names=colnames)

ftexts = TEDtalks[TEDtalks.gender == 'female'].text.tolist()
print(len(ftexts))

607


In [9]:
import nltk 
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

my_tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize
vectorizer = CountVectorizer(tokenizer = my_tokenizer)
   
X = vectorizer.fit_transform(ftexts)

In [10]:
X.shape

(607, 30852)

In [15]:
freq = np.ravel(X.sum(axis=0)) # sum each columns to get total counts for each word

In [16]:
# this freq will correspond to value in dictionary vectorizer.vocabulary_

import operator
# get vocabulary keys, sorted by value
vocab = [v[0] for v in sorted(vectorizer.vocabulary_.items(), key=operator.itemgetter(1))]
fdist = dict(zip(vocab, freq)) # return same format as nltk

In [None]:
def relafreq (list_of_texts):
    '''Determines relative frequencies of words by first getting frequncies.'''
    # Take the list and turn it into one long string
    all_texts = " ".join(list_of_texts).lower()
    # Invoke the NLTK god
    freqdist = nltk.FreqDist()
    tokenizer = RegexpTokenizer(r'\w+')
    words = tokenizer.tokenize(all_texts)
    for word in words:
        freqdist.update([word])
    # Get the total number of words so we can establish relative frequencies
    total_words = sum(freqdist.values())
    # Convert the FreqDist container to a dictionary
    freqdist_dict = dict(freqdist)
    # Create a new dictionary of relative frequencies
    relative_frequency = {k: v/total_words for k, v in freqdist_dict.items()}
    # Convert the dictionary to a rankable list of tuples
    # & rank it with the most frequent word first
    ranked = sorted(relative_frequency.items(), key=operator.itemgetter(1), reverse = True)
    # return the ranked list of tuples
    return(ranked)

## All Talks

Okay, we don't need all talks, because that will include talks besides individual men and women speakers -- and the total number of those talks is small enough that they can be examined by hand at some other time. What we need are all the talks by women and all the talks by women, which, I think, will also make it easier to construct the dataframe described at the top of this notebook.

Here is what I deleted:

```python
all_talks = relafreq(texts)
all_talks[0:10]

print(len(all_talks))
>>> 53705
```

I also want to keep track of how to filter by string search:

```python
TEDtalks[TEDtalks['text'].str.contains("PMS")]
```

This will show a filtered version of the dataframe. To get just the two texts involved, I used the following code:

```python
pms_talks = TEDtalks[TEDtalks['text'].str.contains("PMS")].text.tolist()
print(len(pms_talks))
```


## Gendered Talks

The next step is to create two additional collections filtered by the `gender` column of the dataframe. What happens in the first line below is that we filter the dataframe, in the line that follows we pull the text column out. Originally, I had this as one line:

    f_talks = TEDtalks[TEDtalks.gender == 'female'].text.tolist()

But that produced a string and not a list object. I don't know why the one line would not work.

**2018-03-22**: Now it's working. Did I have `texts.tolist()` -- I'm not clear what happened.

In [None]:
# Filter by gender

m_talks = TEDtalks[TEDtalks.gender == 'male'].text.tolist()
f_talks = TEDtalks[TEDtalks.gender == 'female'].text.tolist()

# A quick check of numbers:

print("Of the {} TED talks given, {} were given by women and {} by men.".format
      (len(texts), len(f_talks), len(m_talks)))

In [None]:
1437 / 607

TODO: One thing to do here is to compare the talks to see which words are used **only** by women or **only** by men.

In [None]:
f_relative = relafreq(f_talks)
m_relative = relafreq(m_talks)

In [None]:
# Just a way to check against rank
import random
random_num = random.randint(1,500)
print(random_num, f_relative[random_num], m_relative[random_num])

## Comparing M/F Word Usage

I'm going to start by converting the list of tuples to a dictionary both because I think matching keys is going to be easier (at least based on my limited coding ability) and because, according to what I read, it appears to be faster. Since we don't really need a ranked listing for this work, it would probably be wise to rewrite the `RelaFreq` function so that it produces a dictionary. No reason to go back and forth like this.

In [None]:
f_rf = dict(f_relative)

m_rf = dict(m_relative)

Some notes on how to compare -- as I write this I am trying to find a way to limit my for loop through the two dictionaries only to N results just so I can see if it's working. 

[Python - Return first N key:value pairs from dict](https://stackoverflow.com/questions/7971618/python-return-first-n-keyvalue-pairs-from-dict)


In [None]:
# a way to match words in the f/m dictionaries

for key in f_rf:
    if key in m_rf:
        print(key, f_rf[key], m_rf[key])

What I decided to do was run the cell and then stop it. The results from above look like this:

    hour 8.613999628814925e-05 0.00012193460211598934
    debased 7.830908753468114e-07 2.924091177841471e-07
    deteriorate 3.1323635013872456e-06 1.7544547067048825e-06
    perimeter 1.5661817506936228e-06 3.508909413409765e-06

What we need now is to compare one list against the other for differences in usage. I am starting with twice as often to see what that turns up. >>> I need to re-read the literature here to see what comparison thresholds have been used. 

**2018-03-26**: I decided that the easiest way to approach this is to create a dictionary comprehension that divides the female relative frequency by the male. We can then look at numbers >1 for the words preferred by women and numbers < 1 for words preferred by men.

In [None]:
# We are just going to do math:

weighted = {f_rf[key]/m_rf[key] for key in f_rf}

for key in f_rf:
    
if f_rf[key] > m_rf[key]:
    print(key, f_rf[key], m_rf[key])

In [None]:
# Create a dataframe from a list of dictionaries
comp = pandas.DataFrame([f_rf, m_rf]).T

# Rename the columns to something human readable
comp.columns = ['women', 'men']

# Check our results
comp.head()

In [None]:
# I'm doing the vision manually below instead of using the built-in functionality: 
# DataFrame.divide(other, axis='columns', level=None, fill_value=None)
ratios = comp.assign(ratio = comp.women / comp.men)

In [None]:
ratios.head()

In [None]:
ratios.index.rename('word', inplace=True)

In [None]:
ratios.head()

In [None]:
import matplotlib.pyplot as plt

ratios.plot.scatter(x='word', y='ratio')

In [None]:
occurs = ratios.assign(woccurs = comp.women * 4493178, moccurs = comp.men*4493178)

In [None]:
# And now to sort:

occurs.sort_values(by='ratio', ascending=False)