# Subjective Verbs

In the previous notebook we looked at large-scale trends in the use of subjects and verbs in the two subcorpora. In this notebook, we turn to understanding possible relationships between the subjects and the verbs that follow them. That is, what kinds of actions are available to the subjects *she* and *he*. How do those actions compare between male speakers and female speakers. Finally, is there a continuity between the actions projected onto a gendered subject and the speaking subject (her or himself) as reflected in the verbs paired with *I*?

In [1]:
# IMPORTS
import pandas as pd
import numpy as np

# LOAD DATAFRAMES
# the `lem` suffix indicates the verbs have been lemmatized
svos_m = pd.read_csv("../output/svos_m_lem.csv", index_col=0)
svos_w = pd.read_csv("../output/svos_w_lem.csv", index_col=0)

print(svos_m.shape[0], svos_w.shape[0])

80460 26610


## Possible Approaches

What we want to explore is both the usual ways that speakers (men or women) pair the pronouns *he*, *she*, *i* with verbs and also, perhaps, the significant pairings.

The usual way can be approached via **counts** (see below), which can be readily visualized with Sankey plots. (Those plots are now in a separate notebook.) 

Relative frequencies would let us compare across the two subcorpora ... or would this be approached better by some form of TF-IDF? (And would we need to determine some sort of lower threshold of the number of sentences in which a verb must occur? We're not interested in verb only used in a single sentence but verbs used often in a pairing in one subcorpus and not in the other.)

### Previously-Used Code

Elsewhere we have used the following code to give us a verb count associated with a particular subject. 

In [2]:
# If the top N verbs associated with "he" is wanted
# iloc can be commented out if not desired

N = 20
m_he = svos_m[svos_m["subject"] == "he"].groupby(["verb"]).size().reset_index(
    name='obs').sort_values(['obs'], ascending=False).iloc[:N] 

# Get a sense of our results
m_he.head()

Unnamed: 0,verb,obs
200,have,232
387,say,132
125,do,111
188,get,90
190,go,78


### Subset of the Dataframe

This gives us all the subjects and verb pairings. 

Could we combine this with some lower cutoff?

In [3]:
# Subset
m_pp_subset = svos_m.value_counts(subset=['subject', 'verb'])
m_pp_subset.shape

(19993,)

In [4]:
m_pp_subset.head()

subject  verb
we       have    2259
you      have    1505
i        have    1318
         want    1210
         go      1076
dtype: int64

### Filtered Dataframe

One approach might look like this, where we simply, in effect, filter the dataframe only for the subjects, here the pronouns *he*, *she*, and *i*, in which we are interested.

Another approach would take advantage of pandas `subset` functionality. Used on the entire mens subcorpus, it suggests that **the most common subject-verb pairing is *we have***, by a pretty large margin, so let's mark that as something worth exploring further.

In [5]:
# Filter
m_pp_filter = svos_m.loc[(
    svos_m['subject'] == 'he') | (
    svos_m['subject'] == 'she') | (
    svos_m["subject"] == "i")
]
m_pp_filter.shape

(18836, 3)

In [6]:
m_pp_filter.head()

Unnamed: 0,subject,verb,object
0,i,blow,[conference]
1,i,want,"[to, thank, all, of, you, for, the, many, nice..."
2,i,need,[that]
4,i,fly,[two]
5,i,have,"[to, take, off, my, shoes, or, boots, to, get,..."


If we want a summary of just the pronouns and the verbs, it looks like we need to use `value_counts()` but we can combine with at with `isin()` and a list:

In [7]:
# Create a list of the pronouns we want to see
pronouns = ["he", "she", "i"]

# And then count the number of times 
# those pronouns are paired with particular verbs
m_pp_list = svos_m[svos_m["subject"].isin(
    pronouns)].value_counts(
    subset=['subject', 'verb']).reset_index()
m_pp_list.shape

(1723, 3)

In [8]:
# Re-label the new column from "0" to something human-readable
m_pp_list.rename(columns={0:'v_count'}, inplace=True)

# See the results
m_pp_list.head()

Unnamed: 0,subject,verb,v_count
0,i,have,1318
1,i,want,1210
2,i,go,1076
3,i,do,650
4,i,get,586


In [9]:
# Add a column with relative frequency
m_pp_list['v_freq'] = m_pp_list['v_count'] / m_pp_list['v_count'].sum()
m_pp_list.head()

Unnamed: 0,subject,verb,v_count,v_freq
0,i,have,1318,0.069972
1,i,want,1210,0.064239
2,i,go,1076,0.057125
3,i,do,650,0.034508
4,i,get,586,0.031111


In [10]:
# m_pp_list.to_csv("../output/m_pp_list.csv")

---
IGNORE BELOW THIS LINE

---

## Character Spaces as Verb-Feature Spaces

The goal in this section is to:

1. Collect all the verbs associated with the specified subjects
2. Weight the verbs (by normalization)
3. Compare the verbs manually
4. Visualize a comparison using PCA or t-SNE

First we explore the total number of verbs involved:

In [11]:
# # Collect all the verbs from the women's subcorpus
# verbs_w = svos_w.groupby(["verb"]).size().reset_index(name='obs').sort_values(
#         ['obs'], ascending=False)

# # Select only the verbs that occur more than once
# verbs_gt_w = verbs_w[verbs_w.obs > 2]

# # What's our counts?
# print(f"♀︎: {verbs_w.shape[0]} unique verbs; {verbs_gt_w.shape[0]} occur more than once")

In [12]:
# # Repeat for the men's subcorpus
# verbs_m = svos_m.groupby(["verb"]).size().reset_index(name='obs').sort_values(
#         ['obs'], ascending=False)
# verbs_gt_m = verbs_m[verbs_m.obs > 2]

# print(f"♂︎: {verbs_m.shape[0]} unique verbs; {verbs_gt_m.shape[0]} occur more than once")

Now we need to grab the verbs associated with the subjects: