## Imports & Data Load

In [1]:
# IMPORTS
import spacy, textacy
import pandas as pd
from nltk import sent_tokenize

# if needed, run the following in terminal: python3 -m spacy download en_core_web_sm
# Load the Space pipeline to be used
nlp = spacy.load('en_core_web_sm')

2023-08-03 11:41:23.021616: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Load from the gendered corpora
talks_m = pd.read_csv('../output/talks_male.csv', index_col='Talk_ID')
talks_f = pd.read_csv('../output/talks_female.csv', index_col='Talk_ID')
talks_nog = pd.read_csv('../output/talks_nog.csv', index_col='Talk_ID')

# Create one dataframe
df = pd.concat([talks_m, talks_f, talks_nog])
print(df.shape)
df.head(3)

(992, 14)


Unnamed: 0_level_0,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2,speaker_3,speaker_4,talk_gender
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,,male
7,https://www.ted.com/talks/david_pogue_says_sim...,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,6/27/06,"simplicity,entertainment,interface design,soft...",1702201,"(Music: ""The Sound of Silence,"" Simon & Garf...",David Pogue,,,,male
66,https://www.ted.com/talks/ken_robinson_says_sc...,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,6/27/06,"children,teaching,creativity,parenting,culture...",51614087,Good morning. How are you? (Laughter) ...,Ken Robinson,,,,male


## From Texts to Sentences

In [3]:
# Talk_ID is the dataframe's index!
print(f"Index => {df.index.name}.")

Index => Talk_ID.


In [4]:
# Lowercase the texts
df['text'].str.lower()

# Break the each text into a list of sentences
df['sentence'] = df['text'].apply(sent_tokenize)

# Copy the index to use in our sentence-ID column
df['text_id'] = df.index

# Break each sentence into its own row
df = df.explode('sentence').reset_index().rename(columns={'text_id' : 'row_id'})

# Count the rows
df['row_id'] = df.groupby('Talk_ID').cumcount()

# Create a unique sentence identifier
df['sentence_id'] = df['Talk_ID'].astype('str') + "-" + df["row_id"].astype('str')

# Drop the unneeded column
df = df.drop(columns=['row_id'])

# Check the results
df.head(3)

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2,speaker_3,speaker_4,talk_gender,sentence,sentence_id
0,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,,male,"Thank you so much, Chris.",1-0
1,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,,male,And it's truly a great honor to have the oppor...,1-1
2,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,,male,"I have been blown away by this conference, and...",1-2


In [5]:
df.shape

(129256, 17)

As a reminder, the previous sentence counts have been:

| Gender |   NLTK  |  spaCy  |
| ------ |  ------ |  -----  |
| Women  |  30,799 |  31,673 |
| Men    |  96,342 |  99,039 |
| *Total*| 127,141 | 130,712 |

## Sentences to SVOs

For each sentence in the new **sentence** column, we need to populate subject, verb, and object columns. The problem is that some sentences generate TWO SVOs, and some sentences like produce no SVO. For the time being, the current code passes over sentences that have no SVOs and it captures only the first SVO in compound and complex sentences. 

In [6]:
df.set_index('sentence_id')

Unnamed: 0_level_0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,speaker_1,speaker_2,speaker_3,speaker_4,talk_gender,sentence
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1-0,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,,male,"Thank you so much, Chris."
1-1,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,,male,And it's truly a great honor to have the oppor...
1-2,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,,male,"I have been blown away by this conference, and..."
1-3,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,,male,"And I say that sincerely, partly because (Mock..."
1-4,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",Al Gore,,,,male,(Laughter) Put yourselves in my position.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2481-16,2481,https://www.ted.com/talks/amanda_palmer_jherek...,"""Space Oddity""",Singer Amanda Palmer pays tribute to the inimi...,TED2016,0:06:09,11/4/16,"live music,vocals,TED Fellows,music,performanc...",756190,(Music) Amanda Palmer (singing): Ground Con...,Amanda Palmer,Jherek Bischoff,,,No one gender,Planet Earth is blue and there's nothing I can...
2481-17,2481,https://www.ted.com/talks/amanda_palmer_jherek...,"""Space Oddity""",Singer Amanda Palmer pays tribute to the inimi...,TED2016,0:06:09,11/4/16,"live music,vocals,TED Fellows,music,performanc...",756190,(Music) Amanda Palmer (singing): Ground Con...,Amanda Palmer,Jherek Bischoff,,,No one gender,"(Music) [""I'm not a prophet or a stone-age ..."
2481-18,2481,https://www.ted.com/talks/amanda_palmer_jherek...,"""Space Oddity""",Singer Amanda Palmer pays tribute to the inimi...,TED2016,0:06:09,11/4/16,"live music,vocals,TED Fellows,music,performanc...",756190,(Music) Amanda Palmer (singing): Ground Con...,Amanda Palmer,Jherek Bischoff,,,No one gender,"I'm living on."""
2481-19,2481,https://www.ted.com/talks/amanda_palmer_jherek...,"""Space Oddity""",Singer Amanda Palmer pays tribute to the inimi...,TED2016,0:06:09,11/4/16,"live music,vocals,TED Fellows,music,performanc...",756190,(Music) Amanda Palmer (singing): Ground Con...,Amanda Palmer,Jherek Bischoff,,,No one gender,"David Bowie, 1947-2016] (Applause)"


In [7]:
def svo(text):
    svo = list(textacy.extract.triples.subject_verb_object_triples(nlp(text)))
    if not svo:
        s1 = v1 = o1 = ''
        s2 = v2 = o2 = ''
    else:
        s1 = str(svo[0][0]).strip("[]")
        v1 = str(svo[0][1][-1]).strip("[]")
        o1 = str(svo[0][2]).strip("[]")
    try:
        s2 = str(svo[1][0]).strip("[]")
        v2 = str(svo[1][1][-1]).strip("[]")
        o2 = str(svo[1][2]).strip("[]")
    except:
        s2 = v2 = o2 = ''
    return s1, v1, o1, s2, v2, o2

In [8]:
df[['s1', 'v1', 'o1', 's2', 'v2', 'o2']] = df['sentence'].apply(svo).apply(pd.Series)
df.head()

Unnamed: 0,Talk_ID,public_url,headline,description,event,duration,published,tags,views,text,...,speaker_4,talk_gender,sentence,sentence_id,s1,v1,o1,s2,v2,o2
0,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",...,,male,"Thank you so much, Chris.",1-0,,,,,,
1,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",...,,male,And it's truly a great honor to have the oppor...,1-1,,,,,,
2,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",...,,male,"I have been blown away by this conference, and...",1-2,I,blown,conference,I,want,"to, thank, all, of, you, for, the, many, nice,..."
3,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",...,,male,"And I say that sincerely, partly because (Mock...",1-3,I,need,that,,,
4,1,https://www.ted.com/talks/al_gore_on_averting_...,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,6/27/06,"alternative energy,cars,global issues,climate ...",3266733,"Thank you so much, Chris. And it's truly a g...",...,,male,(Laughter) Put yourselves in my position.,1-4,Laughter,Put,yourselves,,,


In [9]:
df.memory_usage(index=True).sum()

23783232

In [10]:
df_textless = df.drop(columns=['text'])
df_textless.memory_usage(index=True).sum()

22749184

In [13]:
# Save to CSV file 
# >>> Commented out once run
# df.to_csv("../output/svos-sentences.csv")

### Post-SVO Lemmatizing

Two possible approaches to lemmatizing verbs in a dataframe:
* [How to lemmatise a dataframe column Python - Stack Overflow](https://stackoverflow.com/questions/61987040/how-to-lemmatise-a-dataframe-column-python)
* [dataframe - lemmatizing a verb list in a data frame in Python - Stack Overflow](https://stackoverflow.com/questions/72394840/lemmatizing-a-verb-list-in-a-data-frame-in-python)

In [11]:
from nltk.stem import WordNetLemmatizer

# https://www.nltk.org/_modules/nltk/stem/wordnet.html
wnl = WordNetLemmatizer()

In [12]:
df.v1 = df.v1.map(lambda word: wnl.lemmatize(word, pos="v"))

In [14]:
df.v2 = df.v2.map(lambda word: wnl.lemmatize(word, pos="v"))

In [15]:
# Save to CSV file
# >>> Commented out once run
df.to_csv("../output/contexts-upto2svos.csv")