# SVO 2: Gendered Subjects

In this notebook, we explore usage of male and female pronouns and nouns as subjects in both subcorpora: first by raw count, and then by actions (verbs) associated with those nouns and pronouns.

**Next Steps**: Work on code to compile / visualize this as a network graph (?).

We begin by loading the SVOs saved to CSVs.

---

**Possible worth looking into**: in a prior code run, we had fed the function a list of pronouns asked it to output only those SVOs: `pronouns = ['i', 'we', 'she', 'he', 'they', 'it', 'you']`. Comparing the two outputs: there are 80,331 SVOs in total in the male speaker subcorpora and 56,781 begin with on of the pronouns listed above and 26,527 total SVOs for the female speaker subcorpus with 18,602 beginning with pronouns, then the preponderance of sentences in TED talks begin with a rather small set of pronouns:

```
male:   56,781 / 80,331 = .706
female: 18,602 / 26,527 = .701
```
*The counts are not precise, but they represent a possible trend worth investigating.*

In [1]:
# IMPORTS
import pandas as pd
import networkx as nx

In [2]:
# LOAD DATAFRAMES
svos_m = pd.read_csv("../output/svos_m.csv", index_col=0)
svos_w = pd.read_csv("../output/svos_w.csv", index_col=0)

## Subjects

The function below allows us to compare the usage of subjects across the two subcorpora, returning both a raw count and a percentage of the SVO count of the subcorpus.

In [3]:
def compare (subject):
    # Create name:
    m_tmp = svos_m[svos_m["subject"] == subject]
    w_tmp = svos_w[svos_w["subject"] == subject]
    print(f'''
    | ♂︎ "{subject}" | {m_tmp.shape[0]} | {m_tmp.shape[0]/svos_m.shape[0]:.3f} |
    | ♀︎ "{subject}" | {w_tmp.shape[0]} | {w_tmp.shape[0]/svos_w.shape[0]:.3f} |''')
    
compare("he")


    | ♂︎ "he" | 2548 | 0.032 |
    | ♀︎ "he" | 757 | 0.028 |


### Third Person Perspective

In [4]:
thirdPerson = ['he', 'she', 'man', 'woman', 'men', 'women', 'actor', 'actress']

for i in thirdPerson:
    compare(i)


    | ♂︎ "he" | 2548 | 0.032 |
    | ♀︎ "he" | 757 | 0.028 |

    | ♂︎ "she" | 848 | 0.011 |
    | ♀︎ "she" | 643 | 0.024 |

    | ♂︎ "man" | 80 | 0.001 |
    | ♀︎ "man" | 13 | 0.000 |

    | ♂︎ "woman" | 24 | 0.000 |
    | ♀︎ "woman" | 35 | 0.001 |

    | ♂︎ "men" | 40 | 0.000 |
    | ♀︎ "men" | 23 | 0.001 |

    | ♂︎ "women" | 37 | 0.000 |
    | ♀︎ "women" | 62 | 0.002 |

    | ♂︎ "actor" | 3 | 0.000 |
    | ♀︎ "actor" | 1 | 0.000 |

    | ♂︎ "actress" | 0 | 0.000 |
    | ♀︎ "actress" | 0 | 0.000 |


### First & Second Person

In [5]:
firstSecond = ["i", "we", "you"]

for i in firstSecond:
    compare(i)


    | ♂︎ "i" | 15440 | 0.192 |
    | ♀︎ "i" | 6185 | 0.232 |

    | ♂︎ "we" | 15458 | 0.192 |
    | ♀︎ "we" | 4652 | 0.175 |

    | ♂︎ "you" | 11949 | 0.149 |
    | ♀︎ "you" | 3117 | 0.117 |


### Other Subjects

In [6]:
# This is just a placeholder for more interesting words
otherSubjects =  ['subject', 'object']

for i in otherSubjects:
    compare(i)


    | ♂︎ "subject" | 9 | 0.000 |
    | ♀︎ "subject" | 2 | 0.000 |

    | ♂︎ "object" | 7 | 0.000 |
    | ♀︎ "object" | 5 | 0.000 |


## Verbs

We need either a collection of dataframes or one dataframe which has just the subjects above along with the most common verbs: this will give us a sense of the actions associated with particular subjects, the active spaces characters occupy. 

The code below is a sample based on your earlier work. The question is how to do this *at scale*: feeding a list of subjects and then getting 

In [7]:
m_he = svos_m[svos_m["subject"] == "he"].groupby(["verb"]).size().reset_index(
    name='obs').sort_values(['obs'], ascending=False).iloc[:20]
m_he

Unnamed: 0,verb,obs
293,had,146
581,said,96
284,got,60
282,going,49
297,has,48
711,took,46
519,put,44
758,wanted,44
402,made,37
165,did,36


Below is my attempt to create a function that would return an appropriately named dataframe which was 20 rows long and contained the top 20 verbs for a given subject. It does not work in the `for` loop in the cell below. It returns the dataframe, but the name of the dataframe does not come along for the ride.

In [8]:
def verbCount(dataframe, prefix, subject, num_top_verbs):
    # Create a unique name for the dataframe
    name = (prefix+'_'+subject)
    # Create the [temp] dataframe
    name = dataframe[dataframe["subject"] == subject].groupby(
        ["verb"]).size().reset_index(name='obs').sort_values(
        ['obs'], ascending=False).iloc[:num_top_verbs]
    return name

In [9]:
verbCount(svos_m, "m", "she", 20)

Unnamed: 0,verb,obs
126,had,57
243,said,32
129,has,25
74,did,19
119,going,19
315,told,17
338,wanted,16
286,started,15
121,got,14
316,took,13


In [10]:
genderedSubjects = ['she', 'he', 'man', 'men', 'woman', 'women']
for i in genderedSubjects:
    verbCount(svos_w, "w", i, 20)

The code above is not working: it is not creating a bunch of smaller dataframes (see print statement below), but I also realized that this is not necessary. One could use the larger dataframe and filter things there or move the SVOS into a network and manipulate things there. (See section below.)

In [11]:
print(w_man)

NameError: name 'w_man' is not defined

## SVO Networks

### A Small Test Network

In [None]:
# Filtering the dataframe
w_she = svos_w[svos_w["subject"] == "she"]
w_she.shape

In [None]:
# Create the graph
# Reference: https://stackoverflow.com/questions/53937259/converting-a-pandas-dataframe-to-a-networkx-graph
G = nx.from_pandas_edgelist(w_she, source='subject', target='verb')

In [None]:
nx.draw_networkx(G)

That is not a pretty graph. Can we filter out the minor nodes?

In [None]:
edge_list = G.edges()
len(edge_list)

# Not sure of the third position in this syntax
# list(G.edges)[0:323:3]

The few lines of code above reveal that NetworkX has compiled the verbs into a shorter list based on count: the number of entries in the list has dropped from 643 to 323. The question is can we filter on what would be the weight of the edge, which would be the number of times a verb occurs, e.g., `("she", "has" 5)`

The code below is only exploratory.

In [None]:
edge_list[0:10]

In [None]:
edge_list=[]

threshold = 10

edges=list(G.edges(data=True))

for e in edges:
    if e[2][] > threshold: 
        edge_list.append('tab:blue') 
    else:
        G.remove_edge(*e[:2]) 

# Draw the network
nx.draw_networkx( G, with_labels=True )