## Subject-Verb-Objects

In this notebook, we conduct a series of experiments in order: 

- **First**, we isolate the subject-verb-object (SVO) triples in the texts of speakers we have gendered male or female. (Using a pandas dataframe, we save the results to s CSV files for later re-use.)
- **Second**, we compare the SVO count against the overall sentence count to determine how much of the texts have been included for analysis. See [Counts of Sentences vs SVOs](#sentences).
- **Third**, we explore usage of male and female pronouns and nouns as subjects in both corpus: first by raw count, and then by actions (verbs) associated with those nouns and pronouns. See: [Gendered Subjects](#genderedsubjects)

<div class="alert alert-block alert-info">It might be useful to find a way to combine verbs via WordNet.</div>

- **Fourth**, we map the objects associated with those actions. 

<div class="alert alert-block alert-info">The same wish, to compress a variety of words under some form of hypernym.</div>

then to explore the *character spaces* they establish for gendered entities within their speech as well as the nature of the *character space* they create for themselves as speakers. 

**Note**: We are not excluding parentheticals in this notebook.

**Next Steps**: Work on code to compile / visualize this as a network graph (?).

In [1]:
# IMPORTS
import re, spacy, textacy
import pandas as pd
from nltk import sent_tokenize

# Loading the Data in a gendered partitioned fashion: 
talks_m = pd.read_csv('talks_male.csv', index_col='Talk_ID')
talks_f = pd.read_csv('talks_female.csv', index_col='Talk_ID')
talks_nog = pd.read_csv('talks_nog.csv', index_col='Talk_ID')
talks_all = pd.concat([talks_m, talks_f, talks_nog])

# And then grabbing on the texts of the talks:
texts_all = talks_all.text.tolist()
texts_women = talks_f.text.tolist()
texts_men = talks_m.text.tolist()

print(f"From our {talks_all.shape[0]}x{talks_all.shape[1]} CSV, \
we have a list of {len(texts_all)} talks: {len(texts_women)} by women and \
{len(texts_men)} by men.")

From our 992x14 CSV, we have a list of 992 talks: 260 by women and 720 by men.


Lowercasing everything upfront because we don't care whether it is *She* or *she*. 

In [2]:
# Lowercase everything before we create spaCy doc and Textacy SVO triple
texts_w = [text.lower() for text in texts_women]
texts_m = [text.lower() for text in texts_men]

### 1a. Create the SVOs

spaCy has three different English language models: small, medium, and large. We use the large model here because our corpus is small and the syntax may be a bit more involved. 

<div class="alert alert-block alert-warning"> We need to make sure we understand the difference between the models.</div>

After determining telling spaCy which model to use, we then use its conventions for feeding a set of texts as a list of strings, to it. 

The preview simply checks that everything went as planned: it gives us a word count and the first 50 characters -- which is weird because in theory it has converted the string to a series of spacy objects. 

In [3]:
# Load the Space pipeline to be used
nlp = spacy.load('en_core_web_lg')

# Use the pipe method to feed documents 
docs_w = list(nlp.pipe(texts_w))
docs_m = list(nlp.pipe(texts_m))

# A quick check of our work:
docs_m[0]._.preview

'Doc(2690 tokens: "  thank you so much, chris. and it\'s truly a gr...")'

### 1b. SVOs to Dataframe

Since we create SVOs for every sentence in the two subcorpora, why not save both to two dataframes?

In [4]:
def actions(doc, svo_list):
    svotriples = list(textacy.extract.triples.subject_verb_object_triples(doc))
    for item in svotriples:
        svo_list.append(
            {
                'subject': str(item[0][-1]), 
                'verb': str(item[1][-1]), 
                'object': str(item[2])
            }
        )

In [5]:
# Create the two lists
all_svos_m = []
all_svos_w = []

# Populate the lists with SVO triples
for doc in docs_m:
    actions(doc, all_svos_m)

for doc in docs_w:
    actions(doc, all_svos_w)

# Convert the lists to dataframes
svos_w = pd.DataFrame(all_svos_w)
svos_m = pd.DataFrame(all_svos_m)

print(svos_m.shape, svos_w.shape)

(80331, 3) (26527, 3)


In [15]:
# Save to CSV files
svos_w.to_csv("../output/svos_w.csv")
svos_m.to_csv("../output/svos_m.csv")

### Comment

In a prior code run, we had fed the function a list of pronouns asked it to output only those SVOs: `pronouns = ['i', 'we', 'she', 'he', 'they', 'it', 'you']`.

Comparing the two outputs: there are 80,331 SVOs in total in the male speaker subcorpora and 56,781 begin with on of the pronouns listed above and 26,527 total SVOs for the female speaker subcorpus with 18,602 beginning with pronouns, then the preponderance of sentences in TED talks begin with a rather small set of pronouns:

```
male:   56,781 / 80,331 = .706
female: 18,602 / 26,527 = .701
```

## 2: Counts of Sentences vs SVOs <a id='sentences'></a>

The code above suggests that 70% of the SVOs in TED talks have `'i', 'we', 'she', 'he', 'they', 'it', 'you'` as their subject. It's not clear, however, how much the SVO pattern represents all sentences in the talks. In this section we explore counting sentences, both through NLTK and spaCy, but also a hand count of a few sample texts to see how well our code is reflecting underlying realities.

### NLTK

In [6]:
sents_w = [ sent_tokenize(text) for text in texts_w ]    
sents_m = [ sent_tokenize(text) for text in texts_m ]

print(len(sents_w[0]))

187


In [17]:
sent_count_m = 0
for text in texts_m:
    sent_count_m += len(sent_tokenize(text))

sent_count_w = 0
for text in texts_w:
    sent_count_w += len(sent_tokenize(text))

print(f" Female corp sent count: {sent_count_w}\n Male corp sent count: {sent_count_m}")

 Female corp sent count: 30799
 Male corp sent count: 96342


That results in the following percentages of SVOs out of the total number of sentences:

In [18]:
print(f"Female subcorpora: {svos_w.shape[0] / sent_count_w}")
print(f"Male subcorpora: {svos_m.shape[0] / sent_count_m}")

Female subcorpora: 0.861294197863567
Male subcorpora: 0.8338107990284611


### spaCy

Our spaCy documents already exist, so we just need to use the `.sents` method to call the sentences and count them.

In [23]:
snt_cnt_w = 0
for doc in docs_w:
    snt_cnt_w += len(list(doc.sents))

snt_cnt_m = 0
for doc in docs_m:
    snt_cnt_m += len(list(doc.sents))

print(f"F: {snt_cnt_w}, M: {snt_cnt_m}.")

F: 31673, M: 99039.


In [24]:
print(f"F: {svos_w.shape[0] / snt_cnt_w}")
print(f"M: {svos_m.shape[0] / snt_cnt_m}")

F: 0.8375272313958261
M: 0.8111047163238724


The total sentence counts are:
```
Women - NLTK : 30,799 with SVO ratio of 86%
        spaCy: 31,673 with SVO ratio of 84%
Men -   NLTK : 96,342 with SVO ratio of 83%
        spaCy: 99,039 with SVO ratio of 81%
```

## Gendered Subjects <a id='genderedsubjects'></a>

### Gendered Subject Counts

In [32]:
m_he = svos_m[svos_m["subject"] == "he"]
m_he.shape

(2529, 3)

In [108]:
m_she = svos_m[svos_m["subject"] == "she"]
m_she.shape

(842, 3)

In [111]:
w_he = svos_w[svos_w["subject"] == "he"]
w_he.shape

w_she = svos_w[svos_w["subject"] == "she"]
w_she.shape

print(f"""
Women said "he" {w_he.shape[0]} times, and
women said "she" {w_she.shape[0]} times.
"""
)


Women said "he" 739 times, and
women said "she" 636 times.



In [116]:
m_he = svos_m[svos_m["subject"] == "he"]
w_he = svos_w[svos_w["subject"] == "he"]

print(f"""
Men said "i" {m_he.shape[0]} times, or {m_he.shape[0]/svos_m.shape[0]:.3f} percent of SVOs.
Women said "i" {w_he.shape[0]} times, or {w_he.shape[0]/svos_w.shape[0]:.3f} percent of SVOs.
"""
)


Men said "i" 2529 times, or 0.031 percent of SVOs.
Women said "i" 739 times, or 0.028 percent of SVOs.



In [None]:
def compare (subject):
    # Create  name:
    

In [114]:
m_i = svos_m[svos_m["subject"] == "i"]
w_i = svos_w[svos_w["subject"] == "i"]

print(f"""
Men said "i" {m_i.shape[0]} times, or {m_i.shape[0]/svos_m.shape[0]:.3f} percent of SVOs.
Women said "i" {w_i.shape[0]} times, or {w_i.shape[0]/svos_w.shape[0]:.3f} percent of SVOs.
"""
)


Men said "i" 15502 times, or 0.193 percent of SVOs.
Women said "i" 6220 times, or 0.234 percent of SVOs.



In [115]:
m_we = svos_m[svos_m["subject"] == "we"]
w_we = svos_w[svos_w["subject"] == "we"]

print(f"""
Men said "we" {m_we.shape[0]} times, or {m_we.shape[0]/svos_m.shape[0]:.3f} percent of SVOs.
Women said "we" {w_we.shape[0]} times, or {w_we.shape[0]/svos_w.shape[0]:.3f} percent of SVOs.
"""
)


Men said "we" 15517 times, or 0.193 percent of SVOs.
Women said "we" 4645 times, or 0.175 percent of SVOs.



In [107]:
m_he = m_he.groupby(["verb"]).size().reset_index(
    name='obs').sort_values(['obs'], ascending=False).iloc[:20]
print(m_he)

        verb  obs
297      had  144
588     said   85
288      got   57
285    going   51
301      has   50
718     took   46
522      put   44
764   wanted   43
168      did   37
717     told   36
405     made   36
790    wrote   35
303     have   31
657  started   30
26     asked   29
179       do   27
182    doing   24
765    wants   23
73    called   22
596      saw   21


Below is my attempt to create a function that would return an appropriately named dataframe which was 20 rows long and contained the top 20 verbs for a given subject. It does not work in the `for` loop in the cell below. It returns the dataframe, but the name of the dataframe does not come along for the ride.

In [100]:
def verbCount (dataframe, subject):
    # This first line just makes sure the space is empty     
    name = ''
    # Create a unique name for the subset of the dataframe
    name = str(dataframe.name+'_'+subject)
    # Populate the sub-dataframe
    name = dataframe[dataframe["subject"] == subject].groupby(["verb"]).size().reset_index(
    name='obs').sort_values(['obs'], ascending=False).iloc[:20]
    # Output the dataframe
    return name

In [102]:
genderedSubjects = ['she', 'he', 'man', 'men', 'woman', 'women']
for i in genderedSubjects:
    verbCount(svos_m, i)

In [105]:
print(svos_m_she)

NameError: name 'svos_m_she' is not defined

In [101]:
m_she = verbCount(svos_m, 'she')

In [85]:
m_man = verbCount(svos_m, 'man')

In [95]:
m_woman = verbCount(svos_m, 'woman')

In [99]:
pd.concat([m_he, m_she], axis = 1)

Unnamed: 0,subject,verb,object,verb.1,obs
29,he,waving,[piece],,
30,he,yelling,"[call, washington]",,
74,,,,did,18.0
79,,,,does,8.0
80,,,,doing,9.0
...,...,...,...,...,...
80229,he,accompanied,[sharon],,
80231,he,lost,[life],,
80232,he,lost,[identity],,
80234,he,failed,"[what, to, recognize]",,
