# Gendered Words

A number of recent studies have attended to the variety of ways that gender can affect discourse: women speakers have been found to use *hedge* words and phrases, for example. What can TED talks add to these current explorations?

### Imports and Data

In [None]:
# Imports
import pandas as pd, re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
# Load the Data
dfAll = pd.read_csv('../output/TEDall.csv')

# Filter the dataframe to just the TED main talks:
main = dfAll[dfAll['Set']=='only']
# main.shape
main['presented'].value_counts().sort_index(ascending=True)

In [None]:
list(dfAll)

In [None]:
# Concatenate all the texts of the talks into one big pseudo-document for each gender
all_years = main.groupby(['gender'])['text'].apply(lambda x: ','.join(x))

In [None]:
parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)", 
                  "\(laughs\)", "\(applause ends\)", "\(audio\)", "\(singing\)", 
                  "\(music ends\)", "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", "\(marimba sounds\)", 
                  "\(drum sounds\)" ]
len(parentheticals)

In [None]:
speakers = dfAll.speaker_1.tolist() + dfAll.speaker_2.tolist() + dfAll.speaker_3.tolist() + dfAll.speaker_4.tolist()
print(speakers[0:10])

In [None]:
def remove_parens(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
    return new_text

def remove_speaker_names(text):
    temp_text = text
    for rgx_match in speakers:
        temp_text = re.sub(rgx_match, ' ', temp_text)
    return temp_text

def clean_text(text):
    the_text = text
    cleaned = remove_parens(remove_speaker_names(the_text))
    return cleaned

In [None]:
# Convert our series to a dataframe to make it easier to work in place:
dfYears = years.to_frame()

# Lowercase our texts
dfYears = dfYears.apply(lambda x: x.astype(str).str.lower())

# Remove everything that isn't a word, or space
dfYears = dfYears.replace('[^\w\s\+]', '', regex = True)

# Split on spaces and then count the length of the resulting list
dfYears['word_count'] = dfYears.text.apply(lambda x: len(str(x).split(' ')))

# See the results
dfYears.head(16)

In [None]:
# Countvectorizer expects a list, so we create a list
texts = [ value for index, value in years.iteritems() ]

# We are going to bring our years back to the resulting term matrix below, 
# so while we are creating lists from our series, lets grab those years
# (And yes you can create two lists from one list comprehension, but don't.)
year_labels = [ index for index, value in years.iteritems() ]

# This just checks our results
print(len(texts), texts[0][0:50], year_labels[0:5])

In [None]:
# The usual incantation (minus the desired speaker removal for now):
vec = CountVectorizer(preprocessor = remove_parens, min_df = 2)
word_count_vector=vec.fit_transform(texts)
word_count_vector.shape

In [None]:
# Create a dataframe from the resulting array
X = vec.fit_transform(texts)
term_matrix = pd.DataFrame(X.todense(), columns=vec.get_feature_names())
term_matrix.shape

In [None]:
term_matrix['year'] = year_labels
term_matrix.set_index('year', inplace = True)

In [None]:
word_df = term_matrix.transpose()
word_df.reset_index(inplace=True)
word_df = word_df.rename(columns={'index': 'term'})
word_df.head()

In [None]:
# Let's save this dataframe 
# ==> Commented out so re-running notebook doesn't result in new file
# word_df.to_csv('../output/term_year_matrix.csv')

## Analysis

With all of the above done, we now have a matrix with every word in a row and every year a column such that we can read a word's usage from left to right moving forward in time.

In [None]:
# Load the Data
df = pd.read_csv('../output/term_year_matrix.csv') # , index_col = 'term'

# The 'Unnamed: 0' column is a vestigial index, let's drop it:
# df.drop(columns = ['Unnamed: 0'], inplace=True)
# df.set_index('term')

# Check shape and list columns:
print(df.shape, list(df))

In [None]:
# One term:
df[word_df['term']=='nuclear']

In [None]:
# Multiple terms:
terms = ['nuclear', 'global', 'climate']

# And this is the pandas way
df[word_df.term.isin(terms)]

### Normalizing by Year

In the next series of cells, we first get the total number of words for each year, and then we get a list of our year columns so that we can then get a sum for each column and divide each term for a given year by the total number of words for that year. 

In [None]:
# a quick check of the sums involved
df.sum(axis = 0, skipna = True)

In [None]:
# a list of our columns minus the first one which is where our terms are located:
years = list(df)[1:]

In [None]:
# divide each cell in a column by the total for each column
df[years] = df[years] / df[years].sum()

In [None]:
# and here's are three sample terms now with normalized frequency for a year
df[word_df.term.isin(terms)]

In [None]:
# word_df.to_csv('../output/term_year_matrix_normalized.csv')