## spaCy / Textacy

Textacy is fussy about the size of texts being fed it, responding with `ValueError`s for `nlp.maxlength`. The workaround here is to create a `docs` object which is a list of spaCy `doc`s. The preview below demonstrates that each item in the list has the characteristics of a spaCy doc.

Textacy does have a `corpus` object, but it is not straightforward to implement.

```python
corpus = textacy.Corpus("en_core_web_sm", data=docs)
```

[spaCy documentation](https://spacy.io/)

Spacy has built-in PoS tagging, accessing it looks like this:

```python
for token in docs[0][0:5]:
    print (token, token.tag_, token.pos_) # spacy.explain(token.tag_)
```

[Textacy documentation]()

- We are not excluding parentheticals in this notebook.

**Next steps:**

- Rewrite code to return appended lists for I, He, She.
- Rewrite code to produce a pandas dataframe and then use `groupby`.
- Work on adaptation for objective cases. 
- Work on code to compile / visualize this as a network graph (?). So count up repeated verbs, etc.

- *Do we need NLTK code to compare results?*

- Possibly create a document per term set and run `CountVectorizer`

## Load Libraries & Data

In [2]:
# IMPORTS
import re, spacy, textacy
import pandas as pd

# Loading the Data in a gendered partitioned fashion: 
talks_m = pd.read_csv('talks_male.csv', index_col='Talk_ID')
talks_f = pd.read_csv('talks_female.csv', index_col='Talk_ID')
talks_nog = pd.read_csv('talks_nog.csv', index_col='Talk_ID')
talks_all = pd.concat([talks_m, talks_f, talks_nog])

# And then grabbing on the texts of the talks:
texts_all = talks_all.text.tolist()
texts_women = talks_f.text.tolist()
texts_men = talks_m.text.tolist()

print(f"From our {talks_all.shape[0]}x{talks_all.shape[1]} CSV, \
we have a list of {len(texts_all)} talks: {len(texts_women)} by women and \
{len(texts_men)} by men.")

From our 992x14 CSV, we have a list of 992 talks: 260 by women and 720 by men.


In [3]:
# Lowercase everything before we create spaCy doc and Textacy SVO triple
# (by lowercasing everything we reduce the number of pronouns by not quite half)

texts_w = [text.lower() for text in texts_women]
texts_m = [text.lower() for text in texts_men]

In [4]:
# Load the Space pipeline to be used
nlp = spacy.load('en_core_web_lg')

# Use the pipe method to feed documents 
docs_w = list(nlp.pipe(texts_w))
docs_m = list(nlp.pipe(texts_m))

In [5]:
docs_m[0]._.preview

'Doc(2690 tokens: "  thank you so much, chris. and it\'s truly a gr...")'

### Working through the Textacy SVO Triple

In [None]:
# Now to test the textacy SVO functionality.
# Note we are only extracting triples from the first document:
SVOs = list(textacy.extract.triples.subject_verb_object_triples(docs[0]))

# How many triples did we get?
print(len(SVOs))
print("---")

# What do they look like?
for item in SVOs[0:5]:
    print(item)

In [None]:
# If we want to see all the nouns used 
# as subjects in the test document:
subjects = [str(item[0]) for item in SVOs]
subjects_set = set(subjects)

print(f"There are {len(subjects_set)} unique subjects out of {len(subjects)}.")
print(subjects_set)

In [None]:
# Get out just the first person singular triples:
for item in SVOs:
    if str(item[0]) == '[i]':
        print(item)

It looks like the verb "contents" -- the verb phrase -- contains more material than we want. If all we want is the very itself, we will need to target the last item in the verb list.

In [None]:
for item in SVOs:
    if str(item[0]) == '[i]':
        print(item[1][-1])

## Gendered SVOs Dataframes

In [6]:
# Create the lists of gendered pronouns
pronouns = ['i', 'we', 'she', 'he', 'they', 'it', 'you']

Our function will remain much the same, though I would like to find a way to get the brackets out of the objects.

In [7]:
# Define the function which will get the SVOs
def actions(terms, doc, svo_list):
    svotriples = list(textacy.extract.triples.subject_verb_object_triples(doc))
    for term in terms:
        for item in svotriples:
            if str(item[0][-1]) == term:
                svo_list.append(
                    {
                        'subject': str(item[0][-1]), 
                        'verb': str(item[1][-1]), 
                        'object': item[2]
                    }
                )

In [8]:
# Create the two lists
svos_m, svos_w = []

# Populate the lists with SVO triples
for doc in docs_m:
    actions(pronouns, doc, svos_m)

for doc in docs_w:
    actions(pronouns, doc, svos_w)

# Convert the lists to dataframes
df_w = pd.DataFrame(svos_w)
df_m = pd.DataFrame(svos_m)

print(df_m.shape, df_w.shape)

The first thing we want to do is simply survey the pronouns: make sure they are present and then to count the number of verbs associated with each one. The total here should match the total length of the dataframe, 18,602. 

In [40]:
# pf = pronoun frequency

# Count the rows with each pronoun as the subject:
pf_m = df_m.groupby(["subject"]).count()
pf_w = df_w.groupby(["subject"]).count()

# Drop the OBJECT column
pf_w.drop('object', axis=1, inplace=True)
# Create PERCENTAGE column
pf_w['percentage'] = pf_w['verb'] /  pf_w['verb'].sum()

# Repeat above for men speakers
pf_m.drop('object', axis=1, inplace=True)
pf_m['percentage'] = pf_m['verb'] /  pf_m['verb'].sum()

# Merge the two dataframes
pf_compare = pf_w.merge(pf_m, 
                        left_on='subject', 
                        right_on='subject',
                        suffixes=('_w', '_m'))

# See the results
pf_compare

Unnamed: 0_level_0,verb_w,percentage_w,verb_m,percentage_m
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
he,739,0.039727,2529,0.04454
i,6220,0.334373,15502,0.273014
it,1342,0.072143,4646,0.081823
she,636,0.03419,842,0.014829
they,1919,0.103161,5780,0.101795
we,4645,0.249704,15517,0.273278
you,3101,0.166703,11965,0.210722


<div class="alert alert-block alert-warning"> The code below works, but it gives raw counts and it probably needs to be a percentage so that one can compare the mens' and womens' subcorpora. </div>

In [45]:
# Grab the top 20 verbs for each pronoun
pv_w = df_w.groupby(["subject", "verb"]).size().groupby(level=0).nlargest(20).reset_index(level=0, drop=True).reset_index(name='Count')

# Save to CSV for easier viewing
pv_w.to_csv('../output/pv_w.csv')

# Repeat for the men
pv_m = df_m.groupby(["subject", "verb"]).size().groupby(level=0).nlargest(20).reset_index(level=0, drop=True).reset_index(name='Count')
pv_m.to_csv('../output/pv_m.csv')

In [None]:
df_.groupby("subject").groups

In [None]:
df_.groupby("subject").get_group('he')

In [None]:
# This gives you a dataframe with just the index
# and the verb
df2 = df_.groupby(['subject'])[['verb']] 

In [None]:
df3 = df_.groupby(
    ['subject', 'verb']).size().groupby(level=0).nlargest(5).reset_index(level=0, drop=True).reset_index(name='Count')

In [None]:
df3.head()