## Imports & Data Load

In [1]:
# IMPORTS
import spacy, textacy
import pandas as pd
from nltk import sent_tokenize

# if needed, run the following in terminal: python3 -m spacy download en_core_web_sm
# Load the Space pipeline to be used
nlp = spacy.load('en_core_web_sm')

2023-07-31 20:46:59.638974: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Loading the Data in a gendered partitioned fashion: 
talks_m = pd.read_csv('../output/talks_male.csv', index_col='Talk_ID')
talks_f = pd.read_csv('../output/talks_female.csv', index_col='Talk_ID')
talks_nog = pd.read_csv('../output/talks_nog.csv', index_col='Talk_ID')
talks_all = pd.concat([talks_m, talks_f, talks_nog])

# Make this work with the one dataframe approach
# print(f"From our {talks_all.shape[0]}x{talks_all.shape[1]} CSV, \
# we have a list of {len(texts_all)} talks: {len(texts_women)} by women and \
# {len(texts_men)} by men.")

In [3]:
# Get a list of all the columns
# talks_m.columns.tolist()
drop = ['public_url', 'headline', 'duration', 'published', 'views',
        'description', 'tags','event', 'speaker_1', 'speaker_2', 'speaker_3', 'speaker_4']
df_origin = talks_all.drop(columns=drop)
df_origin

Unnamed: 0_level_0,text,talk_gender
Talk_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"Thank you so much, Chris. And it's truly a g...",male
7,"(Music: ""The Sound of Silence,"" Simon & Garf...",male
66,Good morning. How are you? (Laughter) ...,male
92,"About 10 years ago, I took on the task to te...",male
96,Thank you. I have to tell you I'm both chall...,male
...,...,...
1972,"Pat Mitchell: That day, January 8, 2011, began...",No one gender
2300,"Alec Soth: So about 10 years ago, I got a ca...",No one gender
2611,(Music) I went down to St. James Infirmar...,No one gender
2481,(Music) Amanda Palmer (singing): Ground Con...,No one gender


In [4]:
# This leaves us only with: Talk_ID, text, talk_gender
# Talk_ID is the dataframe's index!
print(f"Index => {df_origin.index.name}.")
df_origin['text_id'] = df_origin.index
print(df_origin.head(3))

Index => Talk_ID.
                                                      text talk_gender   
Talk_ID                                                                  
1          Thank you so much, Chris. And it's truly a g...        male  \
7          (Music: "The Sound of Silence," Simon & Garf...        male   
66         Good morning. How are you?    (Laughter)    ...        male   

         text_id  
Talk_ID           
1              1  
7              7  
66            66  


## From Texts to Sentences to SVOs

### Texts to Sentences

In [5]:
df = df_origin.copy()

# Lowercase the texts
df['text'].str.lower()

# Break the each text into a list of sentences
df['text'] = df['text'].apply(sent_tokenize)

# Break each sentence into its own row
df = df.explode('text').reset_index().rename(columns={'text_id' : 'row_id'})

# Count the rows
df['row_id'] = df.groupby('Talk_ID').cumcount()

# Create a unique sentence identifier
df['sentence_id'] = df['Talk_ID'].astype('str') + "-" + df["row_id"].astype('str')

# Drop the unneeded column
df = df.drop(columns=['row_id'])

# Check the results
df.head()

Unnamed: 0,Talk_ID,text,talk_gender,sentence_id
0,1,"Thank you so much, Chris.",male,1-0
1,1,And it's truly a great honor to have the oppor...,male,1-1
2,1,"I have been blown away by this conference, and...",male,1-2
3,1,"And I say that sincerely, partly because (Mock...",male,1-3
4,1,(Laughter) Put yourselves in my position.,male,1-4


In [6]:
df.shape

(129256, 4)

As a reminder, the sentence counts are:
```
Women - NLTK : 30,799 with SVO ratio of 86%
        spaCy: 31,673 with SVO ratio of 84%
Men -   NLTK : 96,342 with SVO ratio of 83%
        spaCy: 99,039 with SVO ratio of 80%
```

### Sentences to SVOs

For each sentence in the **text** column, we need to populate subject, verb, and object columns. (The object column is actually less necessary since we will now have the sentence right next to the subject and verb pair.)

The problem is that some sentences, like Row 2 above generate TWO SVOs, and some sentences like Rows 0, 1, and 8 produce no SVO. 

In [16]:
df.set_index('sentence_id')

Unnamed: 0_level_0,Talk_ID,text,talk_gender
sentence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1-0,1,"Thank you so much, Chris.",male
1-1,1,And it's truly a great honor to have the oppor...,male
1-2,1,"I have been blown away by this conference, and...",male
1-3,1,"And I say that sincerely, partly because (Mock...",male
1-4,1,(Laughter) Put yourselves in my position.,male
...,...,...,...
2481-16,2481,Planet Earth is blue and there's nothing I can...,No one gender
2481-17,2481,"(Music) [""I'm not a prophet or a stone-age ...",No one gender
2481-18,2481,"I'm living on.""",No one gender
2481-19,2481,"David Bowie, 1947-2016] (Applause)",No one gender


In [17]:
def svo(text):
    svo = list(textacy.extract.triples.subject_verb_object_triples(nlp(text)))
    if not svo:
        s = ''
        v = ''
        o = ''
    else:
        s = str(svo[0][0]).strip("[]")
        v = str(svo[0][1][-1]).strip("[]")
        o = str(svo[0][2]).strip("[]")
    return s, v, o

In [18]:
df[['subject', 'verb', 'object']] = df['text'].apply(svo).apply(pd.Series)
df.head()

Unnamed: 0,Talk_ID,text,talk_gender,sentence_id,subject,verb,object
0,1,"Thank you so much, Chris.",male,1-0,,,
1,1,And it's truly a great honor to have the oppor...,male,1-1,,,
2,1,"I have been blown away by this conference, and...",male,1-2,I,blown,conference
3,1,"And I say that sincerely, partly because (Mock...",male,1-3,I,need,that
4,1,(Laughter) Put yourselves in my position.,male,1-4,Laughter,Put,yourselves


In [19]:
# Save to CSV files 
# >>> Commented out once run
# df.to_csv("../output/svos-sentences.csv")

### Post-SVO Lemmatizing

Two possible approaches to lemmatizing verbs in a dataframe:
* [How to lemmatise a dataframe column Python - Stack Overflow](https://stackoverflow.com/questions/61987040/how-to-lemmatise-a-dataframe-column-python)
* [dataframe - lemmatizing a verb list in a data frame in Python - Stack Overflow](https://stackoverflow.com/questions/72394840/lemmatizing-a-verb-list-in-a-data-frame-in-python)

In [20]:
from nltk.stem import WordNetLemmatizer

In [22]:
# https://www.nltk.org/_modules/nltk/stem/wordnet.html
wnl = WordNetLemmatizer()
df.verb = df.verb.map(lambda word: wnl.lemmatize(word, pos="v"))

In [23]:
df.shape

(129256, 7)

In [24]:
# Save to CSV files 
# >>> Commented out once run
# df.to_csv("../output/sents-svos-lemmas.csv")