# SVO Verbs  Lemmatized

Initial note on lemmatizing: https://stackoverflow.com/questions/51658153/lemmatize-a-doc-with-spacy

In [1]:
# IMPORTS
import re, spacy, textacy
import pandas as pd

# DATA LOAD
# Loading the Data in a gendered partitioned fashion: 
talks_m = pd.read_csv('../output/talks_male.csv', index_col='Talk_ID')
talks_f = pd.read_csv('../output/talks_female.csv', index_col='Talk_ID')
talks_nog = pd.read_csv('../output/talks_nog.csv', index_col='Talk_ID')
talks_all = pd.concat([talks_m, talks_f, talks_nog])

# GETTING JUST THE TEXTS
texts_all = talks_all.text.tolist()
texts_women = talks_f.text.tolist()
texts_men = talks_m.text.tolist()

# Lowercase everything before we create spaCy doc and Textacy SVO triple
texts_w = [text.lower() for text in texts_women]
texts_m = [text.lower() for text in texts_men]

## From Texts to SVOs

In [3]:
# Load the Space pipeline to be used
nlp = spacy.load('en_core_web_sm')
# nlp.add_pipe('sentencizer')
# nlp.remove_pipe("lemmatizer")
# nlp.add_pipe("lemmatizer", config={"mode": "lookup"}).initialize()

In [8]:
# Use the pipe method to feed documents 
docs_w = list(nlp.pipe(texts_w))
docs_m = list(nlp.pipe(texts_m))

# A quick check of our work:
docs_m[0]._.preview

'Doc(2690 tokens: "  thank you so much, chris. and it\'s truly a gr...")'

According to the [iPython documentation](https://ipython.readthedocs.io/en/stable/config/extensions/storemagic.html), we can `%store` variables so that if the notebook crashes, we need not run the code above again. 

In [5]:
#%store docs_w docs_m

Stored 'docs_w' (list)
Stored 'docs_m' (list)


In [50]:
lemmatas = [token.lemma_ for token in docs_m[0]]
print(lemmatas)

['  ', 'thank', 'you', 'so', 'much', ',', 'chris', '.', 'and', 'it', 'be', 'truly', 'a', 'great', 'honor', 'to', 'have', 'the', 'opportunity', 'to', 'come', 'to', 'this', 'stage', 'twice', ';', 'I', 'be', 'extremely', 'grateful', '.', 'I', 'have', 'be', 'blow', 'away', 'by', 'this', 'conference', ',', 'and', 'I', 'want', 'to', 'thank', 'all', 'of', 'you', 'for', 'the', 'many', 'nice', 'comment', 'about', 'what', 'I', 'have', 'to', 'say', 'the', 'other', 'night', '.', 'and', 'I', 'say', 'that', 'sincerely', ',', 'partly', 'because', '(', 'mock', 'sob', ')', 'I', 'need', 'that', '.', '   ', '(', 'laughter', ')', '   ', 'put', 'yourselves', 'in', 'my', 'position', '.', '   ', '(', 'laughter', ')', '   ', 'I', 'fly', 'on', 'air', 'force', 'two', 'for', 'eight', 'year', '.', '   ', '(', 'laughter', ')', '   ', 'now', 'I', 'have', 'to', 'take', 'off', 'my', 'shoe', 'or', 'boot', 'to', 'get', 'on', 'an', 'airplane', '!', '   ', '(', 'laughter', ')', '   ', '(', 'applause', ')', '   ', 'I', 'w

In [51]:
lemmas = [token.lemma_ for token in docs_w[0]]

In [52]:
lemmas_w = []
for doc in docs_w:
    lemmas = [token.lemma_ for token in doc]
    lemmas_w.append(lemmas)

In [53]:
len(lemmas_w)

260

In [54]:
lemmas_m = []
for doc in docs_m:
    lemmas = [token.lemma_ for token in doc]
    lemmas_m.append(lemmas)

In [55]:
' '.join(lemmas_m[0][0:100])

'   thank you so much , chris . and it be truly a great honor to have the opportunity to come to this stage twice ; I be extremely grateful . I have be blow away by this conference , and I want to thank all of you for the many nice comment about what I have to say the other night . and I say that sincerely , partly because ( mock sob ) I need that .     ( laughter )     put yourselves in my position .     ( laughter )     I fly on air force'

In [56]:
joined = ' '.join(lemmas_m[0])

In [57]:
z_svos = list(textacy.extract.triples.subject_verb_object_triples(joined))

AttributeError: 'str' object has no attribute 'sents'

In [58]:
lemmas = [token.lemma_ for token in texts_m[0]]

AttributeError: 'str' object has no attribute 'lemma_'

## SVOs to Dataframe

Since we create SVOs for every sentence in the two subcorpora, why not save both to two dataframes?

In [17]:
def createSVOs(doc, svo_list):
    # Create the list of tuples for the document
    svotriples = list(textacy.extract.triples.subject_verb_object_triples(doc))
    # Convert to list of dictionaries
    for item in svotriples:
        svo_list.append(
            {
                'subject': str(item[0][-1]), 
                'verb': str(item[1][-1]), 
                'object': str(item[2])
            }
        )

In [19]:
def createSVOs_by_talk(doc, all_svo_list):
    # Create the list of tuples for the document
    svotriples = list(textacy.extract.triples.subject_verb_object_triples(doc))
    # Convert to list of dictionaries
    talk_list = []
    for item in svotriples:
        talk_list.append(
            {
                'subject': str(item[0][-1]), 
                'verb': str(item[1][-1]), 
                'object': str(item[2])
            }
        )
    all_svo_list.append(talk_list)

In [33]:
def createSVOs_tID(doc, svo_list, talkID=None):
    # Create the list of tuples for the document
    svotriples = list(textacy.extract.triples.subject_verb_object_triples(doc))
    # Convert to list of dictionaries
    for item in svotriples:
        svo_list.append(
            {
                'subject': str(item[0][-1]), 
                'verb': str(item[1][-1]), 
                'object': str(item[2]),
                'TalkID':talkID
            }
        )

### A Small Experiment to get to Lemmas

With Z standing in for *docs_m[0]* above, which is Al Gore's TED talk, we are trying to get the lemma of the verbs in the SVOs.

In [25]:
Zsvos = list(textacy.extract.triples.subject_verb_object_triples(docs_m[0]))

In [26]:
print(Zsvos[0:5])

[SVOTriple(subject=[i], verb=[have, been, blown], object=[conference]), SVOTriple(subject=[i], verb=[want], object=[to, thank, all, of, you, for, the, many, nice, comments, about, what, i, had, to, say, the, other, night]), SVOTriple(subject=[i], verb=[need], object=[that]), SVOTriple(subject=[laughter], verb=[put], object=[yourselves]), SVOTriple(subject=[i], verb=[flew], object=[two])]


In [27]:
%store Zsvos >Zsvos.txt

Writing 'Zsvos' (list) to file 'Zsvos.txt'.


In [29]:
Z = pd.DataFrame(Zsvos)
Z.shape

(117, 3)

In [30]:
def lemmatize(item):
    token = str(item[1][-1])
    lemma = token.lemma_
    return lemma

In [37]:
type(Zsvos[0])

textacy.extract.triples.SVOTriple

In [None]:
Z['verb'] = Z['verb'].apply(lemmatize)

### Now at Scale

In [26]:
# Create the two lists
svos_m = []
svos_w = []

# Populate the lists with SVO triples
for doc in docs_m:
    createSVOs(doc, svos_m)

for doc in docs_w:
    createSVOs(doc, svos_w)

In [27]:
# Convert the lists to dataframes
svos_w = pd.DataFrame(svos_w)
svos_m = pd.DataFrame(svos_m)

print(svos_m.shape[0], svos_w.shape[0])

80550 26610


In [29]:
svos_m

Unnamed: 0,subject,verb,object
0,i,blown,[conference]
1,i,want,"[to, thank, all, of, you, for, the, many, nice..."
2,i,need,[that]
3,laughter,put,[yourselves]
4,i,have,"[to, take, off, my, shoes, or, boots, to, get,..."
...,...,...,...
80545,you,imagine,[him]
80546,you,see,"[credit, rating]"
80547,we,do,[what]
80548,it,expose,[humanity]


In [37]:
# Get talkIDs in order
m_talkIDs = talks_m.index.values.tolist()
w_talkIDs = talks_f.index.values.tolist()

720

In [46]:
# Create the two lists
svos_m_by_talk = []
svos_w_by_talk = []

# Populate the lists with SVO triples
for m in range(len(m_talkIDs)):
    doc = docs_m[m]
    talkID = m_talkIDs[m]
    createSVOs_tID(doc, svos_m_by_talk, talkID)

for w in range(len(w_talkIDs)):
    doc = docs_w[w]
    talkID = w_talkIDs[w]
    createSVOs_tID(doc, svos_w_by_talk, talkID)

In [43]:
svos_m_by_talk

[{'subject': 'i', 'verb': 'blown', 'object': '[conference]', 'TalkID': 1},
 {'subject': 'i',
  'verb': 'want',
  'object': '[to, thank, all, of, you, for, the, many, nice, comments, about, what, i, had, to, say, the, other, night]',
  'TalkID': 1},
 {'subject': 'i', 'verb': 'need', 'object': '[that]', 'TalkID': 1},
 {'subject': 'laughter', 'verb': 'put', 'object': '[yourselves]', 'TalkID': 1},
 {'subject': 'i',
  'verb': 'have',
  'object': '[to, take, off, my, shoes, or, boots, to, get, on, an, airplane]',
  'TalkID': 1},
 {'subject': 'i', 'verb': 'tell', 'object': '[story]', 'TalkID': 1},
 {'subject': 'i', 'verb': 'left', 'object': '[white, house]', 'TalkID': 1},
 {'subject': 'i', 'verb': 'looked', 'object': '[me]', 'TalkID': 1},
 {'subject': 'it', 'verb': 'hit', 'object': '[me]', 'TalkID': 1},
 {'subject': 'we',
  'verb': 'started',
  'object': '[looking, for, a, place, to, eat]',
  'TalkID': 1},
 {'subject': 'we',
  'verb': 'got',
  'object': '[to, exit, 238, ,, lebanon, ,, tenness

In [47]:
svos_mbt = pd.DataFrame(svos_m_by_talk)
svos_wbt = pd.DataFrame(svos_w_by_talk)

In [45]:
svos_mbt

Unnamed: 0,subject,verb,object,TalkID
0,i,blown,[conference],1
1,i,want,"[to, thank, all, of, you, for, the, many, nice...",1
2,i,need,[that],1
3,laughter,put,[yourselves],1
4,i,have,"[to, take, off, my, shoes, or, boots, to, get,...",1
...,...,...,...,...
80545,you,imagine,[him],10807
80546,you,see,"[credit, rating]",10807
80547,we,do,[what],10807
80548,it,expose,[humanity],10807


### Post-SVO Lemmatizing

Two possible approaches to lemmatizing verbs in a dataframe:
* [How to lemmatise a dataframe column Python - Stack Overflow](https://stackoverflow.com/questions/61987040/how-to-lemmatise-a-dataframe-column-python)
* [dataframe - lemmatizing a verb list in a data frame in Python - Stack Overflow](https://stackoverflow.com/questions/72394840/lemmatizing-a-verb-list-in-a-data-frame-in-python)

In [48]:
from nltk.stem import WordNetLemmatizer

In [49]:
# https://www.nltk.org/_modules/nltk/stem/wordnet.html
wnl = WordNetLemmatizer()
svos_wbt.verb.map(lambda word: wnl.lemmatize(word, pos="v"))

LookupError: 
**********************************************************************
  Resource [93momw-1.4[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('omw-1.4')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/omw-1.4[0m

  Searched in:
    - '/Users/katiek/nltk_data'
    - '/opt/anaconda3/envs/tedtalks/nltk_data'
    - '/opt/anaconda3/envs/tedtalks/share/nltk_data'
    - '/opt/anaconda3/envs/tedtalks/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [62]:
svos_wbt.shape

(26610, 3)

In [63]:
svos_mbt.verb = svos_m.verb.map(lambda word: wnl.lemmatize(word, pos="v"))

In [64]:
# Save to CSV files 
# >>> Commented out once run
svos_wbt.to_csv("../output/svos_wbt_lem.csv")
svos_mbt.to_csv("../output/svos_mbt_lem.csv")