# SVO Verbs  Lemmatized

Initial note on lemmatizing: https://stackoverflow.com/questions/51658153/lemmatize-a-doc-with-spacy

In [1]:
# IMPORTS
import re, spacy, textacy
import pandas as pd

# DATA LOAD
# Loading the Data in a gendered partitioned fashion: 
talks_m = pd.read_csv('../output/talks_male.csv', index_col='Talk_ID')
talks_f = pd.read_csv('../output/talks_female.csv', index_col='Talk_ID')
talks_nog = pd.read_csv('../output/talks_nog.csv', index_col='Talk_ID')
talks_all = pd.concat([talks_m, talks_f, talks_nog])

# GETTING JUST THE TEXTS
texts_all = talks_all.text.tolist()
texts_women = talks_f.text.tolist()
texts_men = talks_m.text.tolist()

# Lowercase everything before we create spaCy doc and Textacy SVO triple
texts_w = [text.lower() for text in texts_women]
texts_m = [text.lower() for text in texts_men]

In [2]:
print(len(texts_women), len(texts_men))

260 720


## From Texts to SVOs

In [3]:
# Load the Space pipeline to be used
nlp = spacy.load('en_core_web_sm')
# nlp.add_pipe('sentencizer')
# nlp.remove_pipe("lemmatizer")
# nlp.add_pipe("lemmatizer", config={"mode": "lookup"}).initialize()

In [4]:
# Use the pipe method to feed documents 
docs_w = list(nlp.pipe(texts_w))
docs_m = list(nlp.pipe(texts_m))

# A quick check of our work:
docs_m[0]._.preview

'Doc(2690 tokens: "  thank you so much, chris. and it\'s truly a gr...")'

According to the [iPython documentation](https://ipython.readthedocs.io/en/stable/config/extensions/storemagic.html), we can `%store` variables so that if the notebook crashes, we need not run the code above again. 

In [5]:
%store docs_w docs_m

Stored 'docs_w' (list)
Stored 'docs_m' (list)


In [6]:
docs_m[0][0:100]

  thank you so much, chris. and it's truly a great honor to have the opportunity to come to this stage twice; i'm extremely grateful. i have been blown away by this conference, and i want to thank all of you for the many nice comments about what i had to say the other night. and i say that sincerely, partly because (mock sob) i need that.    (laughter)    put yourselves in my position.    (laughter)    i flew on air force

In [7]:
lemmatas = [token.lemma_ for token in docs_m[0][0:100]]
print(lemmatas)

['  ', 'thank', 'you', 'so', 'much', ',', 'chris', '.', 'and', 'it', 'be', 'truly', 'a', 'great', 'honor', 'to', 'have', 'the', 'opportunity', 'to', 'come', 'to', 'this', 'stage', 'twice', ';', 'I', 'be', 'extremely', 'grateful', '.', 'I', 'have', 'be', 'blow', 'away', 'by', 'this', 'conference', ',', 'and', 'I', 'want', 'to', 'thank', 'all', 'of', 'you', 'for', 'the', 'many', 'nice', 'comment', 'about', 'what', 'I', 'have', 'to', 'say', 'the', 'other', 'night', '.', 'and', 'I', 'say', 'that', 'sincerely', ',', 'partly', 'because', '(', 'mock', 'sob', ')', 'I', 'need', 'that', '.', '   ', '(', 'laughter', ')', '   ', 'put', 'yourself', 'in', 'my', 'position', '.', '   ', '(', 'laughter', ')', '   ', 'I', 'fly', 'on', 'air', 'force']


In [8]:
lemmas = [token.lemma_ for token in docs_w[0]]

In [9]:
lemmas_w = []
for doc in docs_w:
    lemmas = [token.lemma_ for token in doc]
    lemmas_w.append(lemmas)

In [10]:
lemmas_m = []
for doc in docs_m:
    lemmas = [token.lemma_ for token in doc]
    lemmas_m.append(lemmas)

In [11]:
' '.join(lemmas_m[0][0:100])

'   thank you so much , chris . and it be truly a great honor to have the opportunity to come to this stage twice ; I be extremely grateful . I have be blow away by this conference , and I want to thank all of you for the many nice comment about what I have to say the other night . and I say that sincerely , partly because ( mock sob ) I need that .     ( laughter )     put yourself in my position .     ( laughter )     I fly on air force'

In [12]:
joined = ' '.join(lemmas_m[0])

In [13]:
z_svos = list(textacy.extract.triples.subject_verb_object_triples(joined))

AttributeError: 'str' object has no attribute 'sents'

In [14]:
lemmas = [token.lemma_ for token in texts_m[0]]

AttributeError: 'str' object has no attribute 'lemma_'

## SVOs to Dataframe

Since we create SVOs for every sentence in the two subcorpora, why not save both to two dataframes?

In [15]:
def createSVOs(doc, svo_list):
    # Create the list of tuples for the document
    svotriples = list(textacy.extract.triples.subject_verb_object_triples(doc))
    # Convert to list of dictionaries
    for item in svotriples:
        svo_list.append(
            {
                'subject': str(item[0][-1]), 
                'verb': str(item[1][-1]), 
                'object': str(item[2])
            }
        )

### A Small Experiment to get to Lemmas

With Z standing in for *docs_m[0]* above, which is Al Gore's TED talk, we are trying to get the lemma of the verbs in the SVOs.

In [16]:
Zsvos = list(textacy.extract.triples.subject_verb_object_triples(docs_m[0]))

In [17]:
print(Zsvos[0:5])

[SVOTriple(subject=[i], verb=[have, been, blown], object=[conference]), SVOTriple(subject=[i], verb=[want], object=[to, thank, all, of, you, for, the, many, nice, comments, about, what, i, had, to, say, the, other, night]), SVOTriple(subject=[i], verb=[need], object=[that]), SVOTriple(subject=[laughter], verb=[put], object=[yourselves]), SVOTriple(subject=[i], verb=[flew], object=[two])]


In [18]:
# %store Zsvos >Zsvos.txt

Writing 'Zsvos' (list) to file 'Zsvos.txt'.


In [19]:
Z = pd.DataFrame(Zsvos)
Z.shape

(117, 3)

In [20]:
def lemmatize(item):
    token = str(item[1][-1])
    lemma = token.lemma_
    return lemma

In [21]:
type(Zsvos[0])

textacy.extract.triples.SVOTriple

In [22]:
Z['verb'] = Z['verb'].apply(lemmatize)

TypeError: 'spacy.tokens.token.Token' object is not subscriptable

### Now at Scale

In [23]:
# Create the two lists
svos_m = []
svos_w = []

# Populate the lists with SVO triples
for doc in docs_m:
    createSVOs(doc, svos_m)

for doc in docs_w:
    createSVOs(doc, svos_w)

In [24]:
# Convert the lists to dataframes
svos_w = pd.DataFrame(svos_w)
svos_m = pd.DataFrame(svos_m)

print(svos_m.shape[0], svos_w.shape[0])

80460 26610


### Post-SVO Lemmatizing

Two possible approaches to lemmatizing verbs in a dataframe:
* [How to lemmatise a dataframe column Python - Stack Overflow](https://stackoverflow.com/questions/61987040/how-to-lemmatise-a-dataframe-column-python)
* [dataframe - lemmatizing a verb list in a data frame in Python - Stack Overflow](https://stackoverflow.com/questions/72394840/lemmatizing-a-verb-list-in-a-data-frame-in-python)

In [25]:
from nltk.stem import WordNetLemmatizer

In [26]:
# https://www.nltk.org/_modules/nltk/stem/wordnet.html
wnl = WordNetLemmatizer()
svos_w.verb = svos_w.verb.map(lambda word: wnl.lemmatize(word, pos="v"))

In [27]:
svos_w.shape

(26610, 3)

In [28]:
svos_m.verb = svos_m.verb.map(lambda word: wnl.lemmatize(word, pos="v"))

In [29]:
# Save to CSV files 
# >>> Commented out once run
# svos_w.to_csv("../output/svos_w_lem.csv")
# svos_m.to_csv("../output/svos_m_lem.csv")