# SVO Gendered Subjects

In this notebook, we explore pronouns as subjects in sentences across the two subcorpora. We start with gendered third person pronouns and then examine the use of "I".

To be added: total percentage of pronouns appearing in SVOs in the two subcorpora. (Early calculations indicated pronouns -- `['i', 'we', 'she', 'he', 'they', 'it', 'you']` -- are the subjects of 70% of SVOs in both subcorpora. (Is this comparable to other forms of discourse?)

In [1]:
# IMPORTS
import pandas as pd
import networkx as nx
import numpy as np
import plotly.graph_objects as go

In [2]:
# LOAD DATAFRAMES
# the `lem` suffix indicates the verbs have been lemmatized
svos_m = pd.read_csv("../output/svos_m_lem.csv", index_col=0)
svos_w = pd.read_csv("../output/svos_w_lem.csv", index_col=0)

print(svos_m.shape[0], svos_w.shape[0])

80460 26610


## Subjects

The function below allows us to compare the usage of subjects across the two subcorpora, returning both a raw count and a percentage of the SVO count of the subcorpus.

In [3]:
def compare (subject):
    # Create name:
    m_tmp = svos_m[svos_m["subject"] == subject]
    w_tmp = svos_w[svos_w["subject"] == subject]
    print(f'''
    | ♂︎ | "{subject}" | {m_tmp.shape[0]} | {m_tmp.shape[0]/svos_m.shape[0]:.3f} |
    | ♀︎ | "{subject}" | {w_tmp.shape[0]} | {w_tmp.shape[0]/svos_w.shape[0]:.3f} |''')

### Third Person Perspective

In [4]:
thirdPerson = ['he', 'she', 'man', 'woman', 'men', 'women', 'actor', 'actress']

for i in thirdPerson:
    compare(i)


    | ♂︎ | "he" | 2548 | 0.032 |
    | ♀︎ | "he" | 757 | 0.028 |

    | ♂︎ | "she" | 848 | 0.011 |
    | ♀︎ | "she" | 643 | 0.024 |

    | ♂︎ | "man" | 80 | 0.001 |
    | ♀︎ | "man" | 13 | 0.000 |

    | ♂︎ | "woman" | 24 | 0.000 |
    | ♀︎ | "woman" | 35 | 0.001 |

    | ♂︎ | "men" | 40 | 0.000 |
    | ♀︎ | "men" | 23 | 0.001 |

    | ♂︎ | "women" | 37 | 0.000 |
    | ♀︎ | "women" | 62 | 0.002 |

    | ♂︎ | "actor" | 3 | 0.000 |
    | ♀︎ | "actor" | 1 | 0.000 |

    | ♂︎ | "actress" | 0 | 0.000 |
    | ♀︎ | "actress" | 0 | 0.000 |


### First & Second Person

In [5]:
firstSecond = ["i", "we", "you"]

for i in firstSecond:
    compare(i)


    | ♂︎ | "i" | 15440 | 0.192 |
    | ♀︎ | "i" | 6185 | 0.232 |

    | ♂︎ | "we" | 15458 | 0.192 |
    | ♀︎ | "we" | 4652 | 0.175 |

    | ♂︎ | "you" | 11949 | 0.149 |
    | ♀︎ | "you" | 3117 | 0.117 |


### Other Subjects

In [None]:
# This is just a placeholder for more interesting words
otherSubjects =  ['subject1', 'subject2']

for i in otherSubjects:
    compare(i)

## Verbs

We need either a collection of dataframes or one dataframe which has just the subjects above along with the most common verbs: this will give us a sense of the actions associated with particular subjects, the active spaces characters occupy.

The first thing we need to do is get a count of the verbs available in each of the subcorpora:

In [7]:
# Verbs in mens' subcorpus
verbs_m = svos_m.groupby(
    ["verb"]).size().reset_index(name="count").sort_values(["count"], ascending=False)
verbs_m.head()

Unnamed: 0,verb,count
1078,have,8897
1024,go,3969
710,do,3815
1013,get,3065
2551,want,2830


The code below was sort of a ***doh!*** moment: we want a total number of verbs used, but then when the total SVO count comes back, you realize you already know that number. *Sigh.*

In [31]:
verbs_m_total = verbs_m['count'].sum()
verbs_w_total = verbs_w['count'].sum()
print(verbs_m_total, verbs_w_total)

80460 26610


In the cells that follow, we explore the space of verbs for the two subcorpora. 

In the case of the male speakers, there are a total of 2631 verbs used, with 959 of those verbs occurring only once. That leaves us with 1672 verbs that occur at least twice. (This step can be repeated for any count of verbs we are interested in.)

For female speakers, there are 1687 verbs in total, with 692 used only once, leaving 995 to be used twice or more.

Once we normalize by dividing the count of each verb by the total number of verbs and then limit the frequency to the thousandth, each subcorpus has only a little over 200 verbs that occur greater than 0.001 times: 216 for the male speakers and 225 for the female speakers.

In [9]:
# Commented out so we don't create files unless it's desired.
# verbs_m.to_csv("~/Desktop/verbs_m.csv")

verbs_m.shape

(2631, 2)

In [11]:
# Unique verbs in womens' subcorpus
verbs_w = svos_w.groupby(
    ["verb"]).size().reset_index(name="count").sort_values(["count"], ascending=False)

# verbs_w.to_csv("~/Desktop/verbs_w.csv")

verbs_w.shape

(1687, 2)

In [34]:
verbs_w['freqency'] = verbs_w['count'] / verbs_w_total
verbs_m['freqency'] = verbs_m['count'] / verbs_m_total

In [22]:
def match_and_subtract(list1, list2):
    result = {}
    for item1 in list1:
        for item2 in list2:
            if item1[0] == item2[0]:
                if item1[0] in result:
                    result[item1[0]] = item1[1] - item2[1]
                else:
                    result[item1[0]] = item1[1] - item2[1]
    return result

In [39]:
m_verbs = [tuple(r) for r in verbs_m.drop(['count'], axis=1).to_numpy().tolist()]
w_verbs = [tuple(r) for r in verbs_w.drop(['count'], axis=1).to_numpy().tolist()]

compare_verbs = match_and_subtract(m_verbs, w_verbs)
verb_list = list(compare_verbs.items())
verbs_sorted = sorted(verb_list, key=lambda x: x[1])
print("Used more by women:")
print(verbs_sorted[0:10])
print("Used more by men:")
print(verbs_sorted[-10:])

Used more by women:
[('tell', -0.004080497119017734), ('love', -0.003386390711133641), ('find', -0.0026590761520356044), ('know', -0.0020973679807846715), ('meet', -0.0016205717911187672), ('need', -0.001607961100784358), ('say', -0.0015899278136061509), ('hear', -0.0015586533015768127), ('share', -0.001465838620715553), ('play', -0.0014015661356445086)]
Used more by men:
[('give', 0.0018606513113296405), ('have', 0.0019708967686086903), ('try', 0.0019772161256540396), ('build', 0.0022252123570192925), ('see', 0.0023376156435333355), ('put', 0.0035064444831172277), ('take', 0.003737556401312521), ('do', 0.006640719470709711), ('get', 0.008217476118855479), ('go', 0.009268731288888214)]


In [41]:
print([x[0] for x in verbs_sorted[0:20]])

['tell', 'love', 'find', 'know', 'meet', 'need', 'say', 'hear', 'share', 'play', 'feel', 'ask', 'marry', 'spend', 'help', 'come', 'choose', 'think', 'stand', 'kill']


In [49]:
# Because of the way this list is ordered, 
# the most difference is at the end of this list.
# Reversed here to be more easily compared to list above.
print(list(reversed([x[0] for x in verbs_sorted[-20:]])))

['go', 'get', 'do', 'take', 'put', 'see', 'build', 'try', 'have', 'give', 'want', 'show', 'turn', 'develop', 'start', 'invent', 'design', 'solve', 'move', 'cost']


Code for various kinds of counts. 

In [None]:
# Verbs that occur only once
# verbs_m_gt1 = verbs_m.loc[verbs_m["count"] == 1]
# verbs_w_gt1 = verbs_w.loc[verbs_w["count"] == 1]
# print(f"One-off verbs for ♂︎ - {verbs_m_gt1.shape[0]}; ♀︎ - {verbs_w_gt1.shape[0]}.")

# Verbs that occur more than twice
# verbs_m_gt2 = verbs_m.loc[verbs_m["count"] >= 2]
# verbs_w_gt2 = verbs_w.loc[verbs_w["count"] >= 2]
# print(f"Verbs that occur more than twice: for ♂︎ - {verbs_m_gt2.shape[0]}; ♀︎ - {verbs_w_gt2.shape[0]}.")

## Subjects with Verbs

In [21]:
# This gives us the top 20 verbs associated with "he" in the mens subcorpus
m_he = svos_m[svos_m["subject"] == "he"].groupby(["verb"]).size().reset_index(
    name='obs').sort_values(['obs'], ascending=False).iloc[:20]
m_he

Unnamed: 0,verb,obs
200,have,232
387,say,132
125,do,111
188,get,90
190,go,78
450,take,78
494,want,71
259,make,67
455,tell,58
336,put,54


Below is my attempt to create a function that would return an appropriately named dataframe which contained the top 20 verbs for a given subject. It was intended to be worked into a `for` loop:
```python
genderedSubjects = ['she', 'he', 'man', 'men', 'woman', 'women']
for i in genderedSubjects:
    verbCount(svos_w, "w", i, 30)
```
But it doesn't work as intended, creating a bunch of smaller dataframes, as a print statement reveals:
```python
print(w_man)
```
```
NameError: name 'w_man' is not defined
```

We can either run this code a dataframe at a time or go with something more pandas-y.

One approach might look like this:

In [None]:
svos_m_pro = svos_m.loc[(
    svos_m['subject'] == 'he') | (
    svos_m['subject'] == 'she') | (
    svos_m["subject"] == "i")
]
svos_m_pro.shape

Another approach would take advantage of pandas `subset` functionality. Used on the entire mens subcorpus, it suggests that the most common subject-verb pairing is *we have*, by a pretty large margin, so let's mark that as something worth exploring further.

<div class="alert alert-block alert-warning"> <b>TO DO</b>: Take a look at "we have" in the mens subcorpus. </div>

In [None]:
svos_m.value_counts(subset=['subject', 'verb'])

In [None]:
# Create a list of the pronouns we want to see
pronouns = ["he", "she", "i"]

# Here's the code all in one block
m_pronouns = svos_m[svos_m["subject"].isin(
    pronouns)].value_counts(
    subset=['subject', 'verb']).reset_index()
m_pronouns.head()

In [None]:
# svos_m_iheshe.to_csv("../output/m_iheshe.csv")

In [None]:
m_pronouns.rename(columns={0:'v_count'}, inplace=True)
m_pronouns.head()

In [None]:
m_pronouns.shape

In [None]:
m_pronouns['v_freq'] = m_pronouns['v_count'] / m_pronouns['v_count'].sum()

In [None]:
m_pronouns.head()

## Character Spaces as Verb-Feature Spaces

The goal in this section is to:

1. Collect all the verbs associated with the specified subjects
2. Weight the verbs (by normalization)
3. Compare the verbs manually
4. Visualize a comparison using PCA or t-SNE

First we explore the total number of verbs involved:

In [None]:
# Collect all the verbs from the women's subcorpus
verbs_w = svos_w.groupby(["verb"]).size().reset_index(name='obs').sort_values(
        ['obs'], ascending=False)

# Select only the verbs that occur more than once
verbs_gt_w = verbs_w[verbs_w.obs > 2]

# What's our counts?
print(f"♀︎: {verbs_w.shape[0]} unique verbs; {verbs_gt_w.shape[0]} occur more than once")

In [None]:
# Repeat for the men's subcorpus
verbs_m = svos_m.groupby(["verb"]).size().reset_index(name='obs').sort_values(
        ['obs'], ascending=False)
verbs_gt_m = verbs_m[verbs_m.obs > 2]

print(f"♂︎: {verbs_m.shape[0]} unique verbs; {verbs_gt_m.shape[0]} occur more than once")

Now we need to grab the verbs associated with the subjects:

In [None]:
# Create a list of the subjects for which we want SVOs
subjects = ['she', 'he', 'i']

# Filter the dataframe
subjects_w = svos_w[svos_w['subject'].isin(subjects)]

# We don't want the objects for this
subjects_w = subjects_w.drop('object', axis=1)

# Count the unique combinations of two columns
subj_w_ct = subjects_w[['subject', 'verb']].value_counts().reset_index(name='count')

# Check our work
subj_w_ct.head()

In [None]:
subj_w_ct.value_counts(subset=['subject', 'verb']).sort_index(ascending=False)

In [None]:
# Repeat for the mens' subcorpus
subjects_m = svos_m[svos_m['subject'].isin(subjects)]
subjects_m = subjects_m.drop('object', axis=1)
subj_m_ct = subjects_m[['subject', 'verb']].value_counts().reset_index(name='count')
subj_m_ct.shape

In [None]:
subj_m_ct.head(10)

In [None]:
# See the total number of verbs above
# This could have been done with verbs_w.shape[0]
subj_m_ct['weight'] = subj_m_ct['count']/5307
subj_w_ct['weight'] = subj_w_ct['count']/3161

In [None]:
subj_w_ct.head()

Now we have 2 dataframes, each with three subjects -- *she*, *he*, and *i*. Each subject has hundreds of verbs associated with it, and each verb has a weight normalized to its subcorpus so that it *should* be comparable to verbs in the other subcorpus. The goal is to see how close or far the six subjects are. 

In [None]:
# Add our columns to attribute subject and verbs to a particular gender
subj_w_ct['speaker'] = "female"
subj_m_ct['speaker'] = "male"