*To work with the already extracted SVOs, skip to **DataFrame Ops** and load the dataframe from the CSV file.*

## Load Libraries & Data

spaCy documentation: https://spacy.io/

In [1]:
# IMPORTS
import re, spacy, textacy
import numpy as np, pandas as pd

# If needed
# parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)",  
#                   "\(video\)", "\(laughs\)", "\(applause ends\)", 
#                   "\(audio\)", "\(singing\)", "\(music ends\)", 
#                   "\(cheers\)", "\(cheering\)", "\(recording\)", 
#                   "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
#                   "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", 
#                   "\(marimba sounds\)", "\(drum sounds\)" ]

# def remove_parentheticals(text):
#     global parentheticals
#     new_text = text
#     for rgx_match in parentheticals:
#         new_text = re.sub(rgx_match, ' ', new_text.lower(), 
#                           flags=re.IGNORECASE)
#     return new_text

# Loading the Data in a gendered partitioned fashion: 
talks_m = pd.read_csv('talks_male.csv', index_col='Talk_ID')
talks_f = pd.read_csv('talks_female.csv', index_col='Talk_ID')
talks_nog = pd.read_csv('talks_nog.csv', index_col='Talk_ID')
talks_all = pd.concat([talks_m, talks_f, talks_nog])

# And then grabbing on the texts of the talks:
texts = talks_all.text.tolist()
texts_f = talks_f.text.tolist()
texts_m = talks_m.text.tolist()

print(f"From our {talks_all.shape[0]}x{talks_all.shape[1]} CSV, \
we have a list of {len(texts)} talks: {len(texts_f)} by women and \
{len(texts_m)} by men.")

From our 992x14 CSV, we have a list of 992 talks: 260 by women and 720 by men.


## spaCy / Textacy

Textacy is fussy about the size of texts being fed it, responding with `ValueError`s for `nlp.maxlength`. The workaround here is to create a `docs` object which is a list of spaCy `doc`s. The preview below demonstrates that each item in the list has the characteristics of a spaCy doc.

Textacy does have a `corpus` object, but it is not straightforward to implement.

```python
corpus = textacy.Corpus("en_core_web_sm", data=docs)
```

In [2]:
# Testing to see if we can lowercase everything 
# before we create a spaCy doc and then a Textacy SVO triple:

texts_f_l = [text.lower() for text in texts_f]

In [None]:
# Load the Space pipeline to be used
nlp = spacy.load('en_core_web_lg')

# Use the pipe method to feed documents 
docs = list(nlp.pipe(texts_f_l))

In [82]:
docs[0]._.preview

'Doc(3743 tokens: "  if you\'re here today — and i\'m very happy tha...")'

In [83]:
# Quick example to show spaCy's PoS tagging
for token in docs[0][0:5]:
    print (token, token.tag_, token.pos_) # spacy.explain(token.tag_)

   _SP SPACE
if IN SCONJ
you PRP PRON
're VBP AUX
here RB ADV


In [5]:
# Now to test the textacy SVO functionality.
# Note we are only extracting triples from the first document:
SVOs = list(textacy.extract.triples.subject_verb_object_triples(docs[0]))

# How many triples did we get?
print(len(SVOs))
print("---")

# What do they look like?
for item in SVOs[0:5]:
    print(item)

146
---
SVOTriple(subject=[development], verb=[will, save], object=[us])
SVOTriple(subject=[she], verb=[turned], object=[to, be, a, much, bigger, dog, than, i, 'd, anticipated])
SVOTriple(subject=[part], verb=[handled], object=[percent])
SVOTriple(subject=[that], verb=[bring], object=[truck, trips])
SVOTriple(subject=[area], verb=[has], object=[one])


In [6]:
# If we want to see all the nouns used 
# as subjects in the test document:
subjects = [str(item[0]) for item in SVOs]
subjects_set = set(subjects)

print(f"There are {len(subjects_set)} unique subjects out of {len(subjects)}.")
print(subjects_set)

There are 59 unique subjects out of 146.
{'[air, pollution]', '[agenda]', '[seed]', '[folks]', '[mr, ., gore]', '[robert, moses]', '[she]', '[improvements]', '[they]', "['s]", '[link]', '[justice]', '[things]', '[him]', '[degradation]', '[community]', '[this]', '[what]', '[use, decisions]', '[it]', '[we]', '[developers]', '[regulations]', '[development]', '[riverside, park]', '[projects]', '[dad]', '[he]', '[income, citizens]', '[parade]', '[someone]', '[sections]', '[others]', '[area]', '[part]', '[chris]', '[justice, activists]', '[nothing]', '[who]', '[roofs]', '[presentation]', '[that]', '[i]', '[abundance]', '[you]', '[which]', '[residents]', '[ruth]', '[administration]', '[example]', '[both]', '[sports, team]', '[one]', '[people]', '[lining]', '[south, bronx]', '[percent]', '[none]', '[disinvestment]'}


In [8]:
# Get out just the first person singular triples:
for item in SVOs:
    if str(item[0]) == '[i]':
        print(item)

SVOTriple(subject=[i], verb=[was, contacted], object=[parks])
SVOTriple(subject=[i], verb=[mentioned], object=[that])
SVOTriple(subject=[i], verb=[wo, n't, mention], object=[that])
SVOTriple(subject=[i], verb=['m, going], object=[to, exchange, marriage, vows, with, my, beloved])
SVOTriple(subject=[i], verb=[do], object=[which])
SVOTriple(subject=[i], verb=[watched], object=[half])
SVOTriple(subject=[i], verb=[told], object=[you])
SVOTriple(subject=[i], verb=[wrote], object=[dollar, transportation, grant])
SVOTriple(subject=[i], verb=[like], object=[that])
SVOTriple(subject=[i], verb=[have], object=[all])
SVOTriple(subject=[i], verb=[do, not, expect], object=[individuals, corporations, government])
SVOTriple(subject=[i], verb=['ll, tell], object=[you])
SVOTriple(subject=[i], verb=[like], object=[what])
SVOTriple(subject=[i], verb=[told], object=[you])
SVOTriple(subject=[i], verb=['ve, embraced], object=[capitalist])
SVOTriple(subject=[i], verb=[do, n't, have], object=[problem])
SVOTripl

It looks like the verb "contents" -- the verb phrase -- contains more material than we want. If all we want is the very itself, we will need to target the last item in the verb list.

In [9]:
for item in SVOs:
    if str(item[0]) == '[i]':
        print(item[1][-1])

contacted
mentioned
mention
going
do
watched
told
wrote
like
have
expect
tell
like
told
embraced
have
have
trying
have
asked


### KK's Quick Experiment

In [None]:
test_doc = []
for item in SVOs:
    if str(item[0]) == '[I]':
        test_doc.append(item[1][-1])

In [None]:
test_doc = ""
for item in SVOs:
    if str(item[0]) == '[I]':
        test_doc = test_doc + " " + str(item[1][-1])
test_doc = test_doc[1:]

In [None]:
test_doc

In [None]:
for item in SVOs:
    if str(item[0]) == '[I]':
        print(item[1][-1], item[2])

In [None]:
for item in SVOs:
    if str(item[0]) == '[She]':
        print(item[1][-1:], item[2])
    if str(item[0]) == '[she]':
        print(item[1][-1:], item[2])

### Useful Code

**Next steps:**

- Rewrite code to return appended lists for I, He, She.
- Rewrite code to produce a pandas dataframe and then use `groupby`.
- Work on adaptation for objective cases. 
- Work on code to compile / visualize this as a network graph (?). So count up repeated verbs, etc.

- *Do we need NLTK code to compare results?*

- Possibly create a document per term set and run `CountVectorizer`

In [None]:
def actions(terms, doc):
    svos = []
    svotriples = list(textacy.extract.triples.subject_verb_object_triples(doc))
    for term in terms:
        for item in svotriples:
            if str(item[0]) == term:
                svos.append(
                    {
                        'subject': item[0][-1], 
                        'verb': item[1][-1], 
                        'object': item[2]
                    }
                )

In [None]:
# Next Step: load these into a list and then let the code iterate through that list
first_s = ['[I]']
first_p = ['[We]', '[we]']
third_f = ['[She]', '[she]']
third_m = ['[He]', '[he]']

terms = first_s + first_p + third_f + third_m
print(terms)

In [None]:
# A test of the function above:
actions(terms, docs[0])
print(type(svos), len(svos))

In [None]:
df = pd.DataFrame(svos)
df.head()

If we are interested only in the pronouns above, we can use the function as written to create a dataframe:

In [None]:
for doc in docs:
    actions (terms, doc)

In [None]:
df = pd.DataFrame(svos)
df.shape

In [None]:
df.head()

In [None]:
# df.to_csv('../output/talks_f_svos-pn.csv')

### All SVOs

In [None]:
svos = []

def allactions(doc):
    svotriples = list(textacy.extract.triples.subject_verb_object_triples(doc))
    for item in svotriples:
        svos.append(
            {
                'subject': item[0][-1], 
                'verb': item[1][-1], 
                'object': item[2]
            }
        )

In [None]:
for doc in docs:
    allactions(doc)
    
df2 = pd.DataFrame(svos)
df2.shape

In [None]:
# df2.to_csv('../output/talks_f_svos-all.csv', index=False)

## Dataframe Ops

In [None]:
import pandas as pd

In [None]:
# Read the SVO dataframe
df = pd.read_csv('../output/talks_f_svos-all.csv', index_col=False)

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.iloc[4]

In [None]:
she = df.loc[df["subject"] == "she"]

In [None]:
she.shape

In [None]:
by_subject.head()

In [None]:
by_subject.iloc[1]

## Gendered SVOs Dataframe

By lowercasing everything in the texts going into the spaCy doc, we have reduced the number of pronouns by not quite half.

In [84]:
# Create the lists of gendered pronouns
# pronouns = ['[i]', '[we]', '[she]', '[he]', '[they]', '[it]', '[you]']
pronouns = ['i', 'we', 'she', 'he', 'they', 'it', 'you']

Our function will remain much the same, though I would like to find a way to get the brackets out of the objects.

In [95]:
# Define the function which will get the SVOs
def actions(terms, doc, svo_list):
    svotriples = list(textacy.extract.triples.subject_verb_object_triples(doc))
    for term in terms:
        for item in svotriples:
            if str(item[0][-1]) == term:
                svo_list.append(
                    {
                        'subject': str(item[0][-1]), 
                        'verb': str(item[1][-1]), 
                        'object': item[2]
                    }
                )

In [96]:
svos_ = []

for doc in docs:
    actions(pronouns, doc, svos_)

In [116]:
df_ = pd.DataFrame(svos_)
df_.shape

(18602, 3)

We have replaced the individual dataframes, one for each of the pronouns available to speakers of English:

First Person: I / we
Second Person: you
Third Person: she, he, they
Neuter: it

FTR: *you* and *it* each added 3000 SVO triples to our list for a total of 18,410.

In [None]:
# svos_he = []
# for doc in docs:
#     actions(third_m, doc, svos_he)    
# dfm = pd.DataFrame(svos_he)

# svos_I = []
# for doc in docs:
#     actions(first_s, doc, svos_I)
# dfi = pd.DataFrame(svos_I)

In [20]:
# Save this to a CSV so that we can quickly come back to working on this.
# df_.to_csv('../output/svos-pronouns.csv')

In [117]:
df_

Unnamed: 0,subject,verb,object
0,i,contacted,[parks]
1,i,mentioned,[that]
2,i,mention,[that]
3,i,going,"[to, exchange, marriage, vows, with, my, beloved]"
4,i,do,[which]
...,...,...,...
18597,you,have,[spoon]
18598,you,have,[pen]
18599,you,have,[shoes]
18600,you,have,"[phone, toys]"


The first thing we want to do is simply survey the pronouns: make sure they are present and then to count the number of verbs associated with each one. The total here should match the total length of the dataframe, 18,602. 

In [130]:
df_.groupby(["subject"]).count()

Unnamed: 0_level_0,verb,object
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
he,739,739
i,6220,6220
it,1342,1342
she,636,636
they,1919,1919
we,4645,4645
you,3101,3101


The square brackets seem to be optional. If you run df_.groupby("subject").count(), you still get:

| subject | verb | object |
|---------|------|--------|
| he      | 739  |  739   |
| i	      | 6220 | 6220   |
| it      | 1342 | 1342   |
| she     | 636  | 636    |
| they    | 1919 | 1919   |
| we      | 4645 | 4645   |
| you     | 3101 | 3101   |

In [128]:
df_.groupby("subject").groups

{'he': [48, 49, 141, 142, 143, 144, 145, 146, 147, 148, 305, 306, 391, 433, 493, 494, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 866, 974, 975, 1280, 1281, 1282, 1283, 1284, 1285, 1286, 1287, 1288, 1289, 1290, 1291, 1292, 1293, 1294, 1295, 1296, 1297, 1298, 1299, 1300, 1301, 1466, 1467, 1468, 1469, 1470, 1571, 1572, 1573, 1574, 1575, 1645, 1646, 1647, 1648, 1649, 1650, 1651, 1652, 1736, 1737, 1738, 1739, 1878, 1879, 1880, 1881, 1927, 2119, 2296, 2297, 2365, 2366, 2367, 2368, 2369, 2492, 2493, 2494, 2495, 2496, 2497, 2733, 2734, 2735, 2736, 2737, 2738, ...], 'i': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136

In [129]:
df_.groupby("subject").get_group('he')

Unnamed: 0,subject,verb,object
48,he,does,"[which, time]"
49,he,married,[mom]
141,he,led,[them]
142,he,stopped,[nephites]
143,he,visited,[nephites]
...,...,...,...
18244,he,having,[conversation]
18245,he,doing,[it]
18246,he,goes,[you]
18247,he,turns,[you]


In [137]:
# This gives you a dataframe with just the index
# and the verb
df2 = df_.groupby(['subject'])[['verb']] 

In [147]:
df3 = df_.groupby(
    ['subject', 'verb']).size().groupby(level=0).nlargest(5).reset_index(level=0, drop=True).reset_index(name='Count')

In [148]:
df3.head()

Unnamed: 0,subject,verb,Count
0,he,had,44
1,he,said,25
2,he,going,24
3,he,got,17
4,he,has,17
