# SVO 2: Gendered Subjects

In this notebook, we explore usage of male and female pronouns and nouns as subjects in both subcorpora: first by raw count, and then by actions (verbs) associated with those nouns and pronouns.

**Next Steps**: Work on code to compile / visualize this as a network graph (?).

We begin by loading the SVOs saved to CSVs.

---

**Possible worth looking into**: in a prior code run, we had fed the function a list of pronouns asked it to output only those SVOs: `pronouns = ['i', 'we', 'she', 'he', 'they', 'it', 'you']`. Comparing the two outputs: there are 80,331 SVOs in total in the male speaker subcorpora and 56,781 begin with on of the pronouns listed above and 26,527 total SVOs for the female speaker subcorpus with 18,602 beginning with pronouns, then the preponderance of sentences in TED talks begin with a rather small set of pronouns:

```
male:   56,781 / 80,331 = .706
female: 18,602 / 26,527 = .701
```
*The counts are not precise, but they represent a possible trend worth investigating.*

In [1]:
# IMPORTS
import pandas as pd
import networkx as nx
import numpy as np

# import plotly.graph_objects as go
# import plotly.express as pex

import plotly.graph_objects as go

# from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot init_notebook_mode(connected=True)

# pd.options.plotting.backend = "plotly"

In [3]:
# LOAD DATAFRAMES
svos_m = pd.read_csv("../output/svos_m.csv", index_col=0)
svos_w = pd.read_csv("../output/svos_w.csv", index_col=0)

print(svos_m.shape[0], svos_w.shape[0])

80460 26610


## Subjects

The function below allows us to compare the usage of subjects across the two subcorpora, returning both a raw count and a percentage of the SVO count of the subcorpus.

In [None]:
def compare (subject):
    # Create name:
    m_tmp = svos_m[svos_m["subject"] == subject]
    w_tmp = svos_w[svos_w["subject"] == subject]
    print(f'''
    | ♂︎ "{subject}" | {m_tmp.shape[0]} | {m_tmp.shape[0]/svos_m.shape[0]:.3f} |
    | ♀︎ "{subject}" | {w_tmp.shape[0]} | {w_tmp.shape[0]/svos_w.shape[0]:.3f} |''')
    
compare("he")

### Third Person Perspective

In [None]:
thirdPerson = ['he', 'she', 'man', 'woman', 'men', 'women', 'actor', 'actress']

for i in thirdPerson:
    compare(i)

### First & Second Person

In [None]:
firstSecond = ["i", "we", "you"]

for i in firstSecond:
    compare(i)

### Other Subjects

In [None]:
# This is just a placeholder for more interesting words
otherSubjects =  ['subject', 'object']

for i in otherSubjects:
    compare(i)

## Verbs

We need either a collection of dataframes or one dataframe which has just the subjects above along with the most common verbs: this will give us a sense of the actions associated with particular subjects, the active spaces characters occupy. 

The code below is a sample based on your earlier work. The question is how to do this *at scale*: feeding a list of subjects and then getting 

In [None]:
m_he = svos_m[svos_m["subject"] == "he"].groupby(["verb"]).size().reset_index(
    name='obs').sort_values(['obs'], ascending=False).iloc[:20]
m_he

Below is my attempt to create a function that would return an appropriately named dataframe which was 20 rows long and contained the top 20 verbs for a given subject. It does not work in the `for` loop in the cell below. It returns the dataframe, but the name of the dataframe does not come along for the ride.

In [None]:
def verbCount(dataframe, prefix, subject, num_top_verbs):
    # Create a unique name for the dataframe
    name = (prefix+'_'+subject)
    # Create the [temp] dataframe
    name = dataframe[dataframe["subject"] == subject].groupby(
        ["verb"]).size().reset_index(name='obs').sort_values(
        ['obs'], ascending=False).iloc[:num_top_verbs]
    return name

In [None]:
m_she = verbCount(svos_m, "m", "she", 20)
m_she.head()

In [None]:
genderedSubjects = ['she', 'he', 'man', 'men', 'woman', 'women']
for i in genderedSubjects:
    verbCount(svos_w, "w", i, 20)

The code above is not working: it is not creating a bunch of smaller dataframes, the following print statement reveals:
```python
print(w_man)
```
```
NameError: name 'w_man' is not defined
```

But I also realized that this is not necessary. One could use the larger dataframe and filter things there or move the SVOS into a network and manipulate things there. (See section below.)

## SVO Networks

### A Small Test Network

While we eventually might like to have a network of `subject > verb > object` for now, let's work with the `m_she` dataframe and build a network with sources, targets, and edge attributes.

In [None]:
# Re-insert a column for "she"
m_she["subject"] = "she"

# Re-arrange columns so that they are in a more obvious order
m_she = m_she[["subject", "verb", "obs"]]

m_she.shape

In [None]:
# Create the graph
# Reference: https://stackoverflow.com/questions/53937259/converting-a-pandas-dataframe-to-a-networkx-graph
G = nx.from_pandas_edgelist(m_she, source='subject', target='verb', edge_attr=True)

In [None]:
nx.draw_networkx(G)

This is not a very clear graph, and, honestly, I think we would rather be able to choose the number of nodes at the network level rather than at the dataframe level: being able to adjust the network visualization is a real boon.

In [None]:
G.edges.data()

### Sankey Diagrams

### Plotly

Ken Lok has a terrific [Sankey diagram generator](https://medium.com/kenlok/how-to-create-sankey-diagrams-from-dataframes-in-python-e221c1b4d6b0). The solution arrived at here is based on a [Stackoverflow answer](https://stackoverflow.com/questions/70335771/how-do-i-make-a-sankey-diagram-with-plotly-with-one-layer-that-goes-only-one-lev) by [Rob Raymond](https://stackoverflow.com/users/9441404/rob-raymond), which offered two useful answers to questions I had: first, how to compile the subjects and verbs into lists that could be fed into Ploty's Sankey function. Raymond offers the following -- the `m_she` below is left over from an earlier experiment:
```
nodes = np.unique(m_she[["subject", "verb"]], axis=None)
nodes = pd.Series(index=nodes, data=range(len(nodes)))
```
And the second was a quick solution to creating a figure:
```
go.Figure(
    go.Sankey(
        node={"label": nodes.index},
        link={
            "source": nodes.loc[m_she["subject"]],
            "target": nodes.loc[m_she["verb"]],
            "value": m_she["obs"],
        },
    )
)
```
Combining those two pieces with the `verbCount` function above results in `sankify`:

In [6]:
def sankify(dataframe, subject, num_top_verbs):
    
    # Create the subject-focused dataframe
    df = dataframe[dataframe["subject"] == subject].groupby(
        ["verb"]).size().reset_index(name='obs').sort_values(
        ['obs'], ascending=False).iloc[:num_top_verbs]
    
    # Re-insert a column for the subject
    df["subject"] = subject

    # Re-arrange columns so that they are in a more obvious order
    df = df[["subject", "verb", "obs"]]

    nodes = np.unique(df[["subject", "verb"]], axis=None)
    nodes = pd.Series(index=nodes, data=range(len(nodes)))

    fig = go.Figure(
        go.Sankey(
            node={"label": nodes.index},
            link={
                "source": nodes.loc[df["subject"]],
                "target": nodes.loc[df["verb"]],
                "value": df["obs"],
            },
        )
    )
    fig.update_layout(legend_title_text = "This Doesn't Work")
    fig.show()

In [7]:
sankify(svos_w, "she", 20)

That functionality gives us one subject and the X number of top verbs associated with it. Since the visualization code works well enough, we break it out from the part that counts the verbs and the re-creates a dataframe with the subject.  

In [8]:
def SVOverbs(dataframe, subject, num_top_verbs):
    
    # Create the subject-focused dataframe
    df = dataframe[dataframe["subject"] == subject].groupby(
        ["verb"]).size().reset_index(name='obs').sort_values(
        ['obs'], ascending=False).iloc[:num_top_verbs]
    
    # Re-insert a column for the subject
    df["subject"] = subject

    # Re-arrange columns so that they are in a more obvious order
    df = df[["subject", "verb", "obs"]]
    return df

def sankay (df):
    nodes = np.unique(df[["subject", "verb"]], axis=None)
    nodes = pd.Series(index=nodes, data=range(len(nodes)))

    fig = go.Figure(
        go.Sankey(
            node={"label": nodes.index},
            link={
                "source": nodes.loc[df["subject"]],
                "target": nodes.loc[df["verb"]],
                "value": df["obs"],
            },
        )
    )
    fig.update_layout(legend_title_text = "F-String Here")
    fig.show()

In [9]:
w_she = SVOverbs(svos_w, "she", 30)
w_he = SVOverbs(svos_w, "he", 30)

w_she_he = pd.concat([w_she, w_he], ignore_index=True)
w_she_he.shape

sankay(w_she_he)

In [10]:
w_i = SVOverbs(svos_w, "i", 30)

w_she_he_i = pd.concat([w_she_he, w_i], ignore_index=True)
w_she_he_i.shape

(90, 3)

In [11]:
sankay(w_she_he_i)