## Imports & Data Load

In [1]:
# IMPORTS
import spacy, textacy
import pandas as pd
from nltk import sent_tokenize

# if needed, run the following in terminal: python3 -m spacy download en_core_web_sm
# Load the Space pipeline to be used
nlp = spacy.load('en_core_web_sm')

In [21]:
# Loading the Data in a gendered partitioned fashion: 
talks_m = pd.read_csv('../output/talks_male.csv', index_col='Talk_ID')
talks_f = pd.read_csv('../output/talks_female.csv', index_col='Talk_ID')
talks_nog = pd.read_csv('../output/talks_nog.csv', index_col='Talk_ID')
talks_all = pd.concat([talks_m, talks_f, talks_nog])

# Make this work with the one dataframe approach
# print(f"From our {talks_all.shape[0]}x{talks_all.shape[1]} CSV, \
# we have a list of {len(texts_all)} talks: {len(texts_women)} by women and \
# {len(texts_men)} by men.")

In [25]:
# Get a list of all the columns
# talks_m.columns.tolist()
drop = ['public_url', 'headline', 'duration', 'published', 'views',
        'description', 'tags','event', 'speaker_1', 'speaker_2', 'speaker_3', 'speaker_4']
df_origin = talks_all.drop(columns=drop)
df_origin

Unnamed: 0_level_0,text,talk_gender
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"Thank you so much, Chris. And it's truly a g...",male
7,"(Music: ""The Sound of Silence,"" Simon & Garf...",male
66,Good morning. How are you? (Laughter) ...,male
92,"About 10 years ago, I took on the task to te...",male
96,Thank you. I have to tell you I'm both chall...,male
...,...,...
1972,"Pat Mitchell: That day, January 8, 2011, began...",No one gender
2300,"Alec Soth: So about 10 years ago, I got a ca...",No one gender
2611,(Music) I went down to St. James Infirmar...,No one gender
2481,(Music) Amanda Palmer (singing): Ground Con...,No one gender


In [30]:
# This leaves us only with: Talk_ID, text, talk_gender
# Talk_ID is the dataframe's index!
print(f"Index => {df_origin.index.name}.")
df_origin['text_id'] = df_origin.index
print(df_origin.head(3))

Index => Talk_ID.
                                                      text talk_gender   
Talk_ID                                                                  
1          Thank you so much, Chris. And it's truly a g...        male  \
7          (Music: "The Sound of Silence," Simon & Garf...        male   
66         Good morning. How are you?    (Laughter)    ...        male   

         text_id  
Talk_ID           
1              1  
7              7  
66            66  


## From Texts to Sentences to SVOs

### Texts to Sentences

In [41]:
df = df_origin.copy()

# Lowercase the texts
df['text'].str.lower()

# Break the each text into a list of sentences
df['text'] = df['text'].apply(sent_tokenize)

# Break each sentence into its own row
df = df.explode('text').reset_index().rename(columns={'text_id' : 'row_id'})

# Count the rows
df['row_id'] = df.groupby('Talk_ID').cumcount()

# Create a unique sentence identifier
df['sentence_id'] = df['Talk_ID'].astype('str') + "-" + df["row_id"].astype('str')

# Drop the unneeded column
df = df.drop(columns=['row_id'])

# Check the results
df.head()

Unnamed: 0,Talk_ID,text,talk_gender,sentence_id
0,1,"Thank you so much, Chris.",male,1-0
1,1,And it's truly a great honor to have the oppor...,male,1-1
2,1,"I have been blown away by this conference, and...",male,1-2
3,1,"And I say that sincerely, partly because (Mock...",male,1-3
4,1,(Laughter) Put yourselves in my position.,male,1-4


In [42]:
df.shape

(129256, 4)

### Sentences to SVOs

In [57]:
df.iloc[1,1]
s, v = textacy.extract.triples.subject_verb_object_triples(nlp(df.iloc[1,1]))
print(s,v)

ValueError: not enough values to unpack (expected 2, got 0)

In [50]:
df.iloc[2,1]

'I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night.'

In [48]:
s, v = textacy.extract.triples.subject_verb_object_triples(nlp(df.iloc[2,1]))

In [49]:
print(s, v)

SVOTriple(subject=[I], verb=[have, been, blown], object=[conference]) SVOTriple(subject=[I], verb=[want], object=[to, thank, all, of, you, for, the, many, nice, comments, about, what, I, had, to, say, the, other, night])


In [None]:
docs_w = list(nlp.pipe(texts_w))

def actions(doc_id, doc, svo_list):
    svotriples = list(textacy.extract.triples.subject_verb_object_triples(doc))
    for item in svotriples:
        svo_list.append(
            {
                'doc': doc_id,
                'subject': str(item[0][-1]), 
                'verb': str(item[1][-1]), 
                'object': str(item[2])
            }
        )

In [None]:
# Create the two lists
all_svos_m = []
all_svos_w = []
doc_id = 0

# Populate the lists with SVO triples
for doc in docs_m:
    actions(doc_id, doc, all_svos_m)
    doc_id += 1

for doc in docs_w:
    actions(doc_id, doc, all_svos_w)
    doc_id += 1

# Convert the lists to dataframes
svos_w = pd.DataFrame(all_svos_w)
svos_m = pd.DataFrame(all_svos_m)

print(svos_m.shape[0], svos_w.shape[0])

In [None]:
# Save to CSV files 
# >>> Commented out once run
#svos_w.to_csv("../output/svos_w.csv")
#svos_m.to_csv("../output/svos_m.csv")

## 2: Counts of Sentences vs SVOs <a id='sentences'></a>

The code above suggests that 70% of the SVOs in TED talks have `'i', 'we', 'she', 'he', 'they', 'it', 'you'` as their subject. It's not clear, however, how much the SVO pattern represents all sentences in the talks. In this section we explore counting sentences, both through NLTK and spaCy, but also a hand count of a few sample texts to see how well our code is reflecting underlying realities.

### NLTK

In [None]:
sents_w = [ sent_tokenize(text) for text in texts_w ]    
sents_m = [ sent_tokenize(text) for text in texts_m ]

print(len(sents_w[0]))

In [None]:
sent_count_m = 0
for text in texts_m:
    sent_count_m += len(sent_tokenize(text))

sent_count_w = 0
for text in texts_w:
    sent_count_w += len(sent_tokenize(text))

print(f" Female corp sent count: {sent_count_w}\n Male corp sent count: {sent_count_m}")

That results in the following percentages of SVOs out of the total number of sentences:

In [None]:
print(f"Female subcorpora: {svos_w.shape[0] / sent_count_w}")
print(f"Male subcorpora: {svos_m.shape[0] / sent_count_m}")

### spaCy

Our spaCy documents already exist, so we just need to use the `.sents` method to call the sentences and count them.

In [None]:
snt_cnt_w = 0
for doc in docs_w:
    snt_cnt_w += len(list(doc.sents))

snt_cnt_m = 0
for doc in docs_m:
    snt_cnt_m += len(list(doc.sents))

print(f"F: {snt_cnt_w}, M: {snt_cnt_m}.")

In [None]:
print(f"F: {svos_w.shape[0] / snt_cnt_w}")
print(f"M: {svos_m.shape[0] / snt_cnt_m}")

The total sentence counts are:
```
Women - NLTK : 30,799 with SVO ratio of 86%
        spaCy: 31,673 with SVO ratio of 84%
Men -   NLTK : 96,342 with SVO ratio of 83%
        spaCy: 99,039 with SVO ratio of 80%
```

### Post-SVO Lemmatizing

Two possible approaches to lemmatizing verbs in a dataframe:
* [How to lemmatise a dataframe column Python - Stack Overflow](https://stackoverflow.com/questions/61987040/how-to-lemmatise-a-dataframe-column-python)
* [dataframe - lemmatizing a verb list in a data frame in Python - Stack Overflow](https://stackoverflow.com/questions/72394840/lemmatizing-a-verb-list-in-a-data-frame-in-python)

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
# https://www.nltk.org/_modules/nltk/stem/wordnet.html
wnl = WordNetLemmatizer()
svos_w.verb = svos_w.verb.map(lambda word: wnl.lemmatize(word, pos="v"))

In [None]:
svos_w.shape

In [None]:
svos_m.verb = svos_m.verb.map(lambda word: wnl.lemmatize(word, pos="v"))

In [None]:
# Save to CSV files 
# >>> Commented out once run
# svos_w.to_csv("../output/svos_w_lem.csv")
# svos_m.to_csv("../output/svos_m_lem.csv")