# SVO Contexts

**If using this notebook for KWiC, proceed to *Load NLTK Texts*.**

This notebook is intended primarily to "hand" inspect the contexts for results in other explorations. We are going to try two approaches:

- using the SVOs (This turns out to be rather underwhelming.)
- using the NLTK's concordance method

In the *SVO as Context* below, we end up loading only the mens subcorpus to explore how useful the complete SVOs would be. We don't take this exploration any further.

Afterwards, we turn to the texts themselves as context, loading the two gendered dataframes and then collecting the texts from each. We process the two lists of texts with spacy's NLP pipeline to produce featureful documents which make it easy to get the lemmas out. Using the spacy lemmatization parallels our usage in building the SVOs, so we are going to get the same results and thus enable us to explore the two subcorpora much more effectively. 

Finally, we take the texts as strings of lemmas and create a single NLTK Text, which allows us to use the NLTK's concordance functionality to see words in context. To do this, we create a function that takes spacy doc, lemmatizes the words within it, build a list of lemmas, compiles those lists into a list for the subcorpus, flattens that subcorpus list into a single, very long, string of tokens, from which an NLTK Text is created. We create two NLTK Texts: one for female speakers, `women`, and one for male speakers, `men`. 

**TO DO**: Find a way to save either the list of spacy docs or the NLTK Text. The spacy NLP pipeline takes time to run, and we shouldn't need to create space docs or NLTK Texts every time we want to use this notebook.

## SVOs as Context

While this was at first appealing because of the simplicity and accuracy, since we would be loading the SVOs themselves, the resulting contexts were too impoverished to be much use for hand inspection. Only the men's subcorpus is loaded here.

In [None]:
import csv, re
with open('../output/svos_m_lem.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='|')
    # Drop the first row
    next(reader)
    # Skip the first column
    contexts_m = [row[1:4] for row in reader]

Each line in **contexts** is a list of three strings. The list comprehension below first joins all the items in each line into a single string, it then replaces any square brackets, that sometimes occur in the third item on each line, with nothing. It does this for all the lines in the list.

In [None]:
sentences = [re.sub("[\([{})\]]", "", ' '.join(item)) for item in contexts_m]
print(sentences[0:10])

## Lemmatized Contexts

In [None]:
# IMPORTS
import nltk, pandas as pd, spacy

# Load the Space pipeline to be used
nlp = spacy.load('en_core_web_sm')

# Use the pipe method to feed documents 
docs_w = list(nlp.pipe(texts_w))
docs_m = list(nlp.pipe(texts_m))

In [None]:
def contextualize(spacy_docs):
    '''contextualize takes a list of spaCy docs
       and converts them to a single NLTK text'''
    all_lemmas = []
    # Grab the lemmas from each of the documents
    # and append to a list of all the lemmas
    for doc in spacy_docs:
        lemmas = [token.lemma_ for token in doc]
        all_lemmas.append(lemmas)
    # all_lemmas is a list of lists that needs to be flattened
    flattened = [item for sublist in all_lemmas for item in sublist]
    # all our texts are now one long list of words to be fed into NLTK Text
    contextualized = nltk.Text(flattened)
    return contextualized

In [None]:
women = contextualize(docs_w)
men = contextualize(docs_m)

## Fuller Contexts

In [2]:
# IMPORTS
import nltk, pandas as pd
from nltk.tokenize import word_tokenize

# from nltk.stem import WordNetLemmatizer
# wnl = WordNetLemmatizer()

# Loading the Data in a gendered partitioned fashion: 
talks_m = pd.read_csv('../output/talks_male.csv', index_col='Talk_ID')
talks_f = pd.read_csv('../output/talks_female.csv', index_col='Talk_ID')

In [3]:
# Create NLTK texts for concordances
words_w = word_tokenize(" ".join(talks_f.text.tolist()))
women = nltk.Text(words_w)

words_m = word_tokenize(" ".join(talks_m.text.tolist()))
men = nltk.Text(words_m)

In [7]:
women_tagged = nltk.pos_tag(words_w)

In [53]:
women_tagged[40:80]

[('often', 'RB'),
 ('told', 'VBN'),
 ('that', 'IN'),
 ('a', 'DT'),
 ('real', 'JJ'),
 ('sustainability', 'NN'),
 ('policy', 'NN'),
 ('agenda', 'NN'),
 ('is', 'VBZ'),
 ('just', 'RB'),
 ('not', 'RB'),
 ('feasible', 'JJ'),
 (',', ','),
 ('especially', 'RB'),
 ('in', 'IN'),
 ('large', 'JJ'),
 ('urban', 'JJ'),
 ('areas', 'NNS'),
 ('like', 'IN'),
 ('New', 'NNP'),
 ('York', 'NNP'),
 ('City', 'NNP'),
 ('.', '.'),
 ('And', 'CC'),
 ('that', 'DT'),
 ("'s", 'VBZ'),
 ('because', 'IN'),
 ('most', 'JJS'),
 ('people', 'NNS'),
 ('with', 'IN'),
 ('decision-making', 'JJ'),
 ('powers', 'NNS'),
 (',', ','),
 ('in', 'IN'),
 ('both', 'DT'),
 ('the', 'DT'),
 ('public', 'NN'),
 ('and', 'CC'),
 ('the', 'DT'),
 ('private', 'JJ')]

In [11]:
sents_w = nltk.tokenize.sent_tokenize(" ".join(talks_f.text.tolist()))

In [20]:
sents_w[0:5]

["  If you're here today — and I'm very happy that you are — you've all heard about how sustainable development will save us from ourselves.",
 "However, when we're not at TED, we are often told that a real sustainability policy agenda is just not feasible, especially in large urban areas like New York City.",
 "And that's because most people with decision-making powers, in both the public and the private sector, really don't feel as though they're in danger.",
 "The reason why I'm here today, in part, is because of a dog — an abandoned puppy I found back in the rain, back in 1998.",
 "She turned out to be a much bigger dog than I'd anticipated."]

In [62]:
def find_sentences_with_noun(subject_noun, sentences):
    noun_subjects = []
    noun_sentences = []
    for sentence in sentences:
        words = word_tokenize(sentence)
        tagged_words = nltk.tag.pos_tag(words)
        for word, tag in tagged_words:
            if "NN" in tag and word == subject_noun and tagged_words.index((word, tag)) == 0:
                noun_subjects.append(word)
                noun_sentences.append(sentence)
    return noun_sentences

In [63]:
find_sentences_with_noun("man", sents_w)

[]

In [44]:
subject_noun = "policy"
noun_subjects = []

for sentence in sents_w[0:2]:
    words = word_tokenize(sentence)
    tagged_words = nltk.tag.pos_tag(words)
    for word, tag in tagged_words:
        if tag == "NN" and word == subject_noun and tagged_words.index((word, tag)) == 0:
            noun_subjects.append(word)
            noun_sentences.append(sentence)
    
print(noun_sentences)

["However, when we're not at TED, we are often told that a real sustainability policy agenda is just not feasible, especially in large urban areas like New York City."]


In [18]:
noun = "men"
noun_sentences = find_sentences_with_noun(noun, sents_w)
print(noun_sentences)

[]


In [19]:
import spacy
from spacy.lang.en import English

In [21]:
nlp = English()
def find_sentences_with_noun(subject_noun, sentences):
    noun_sentences = []
    for sentence in sentences:
        doc = nlp(sentence)
        for token in doc:
            if token.dep_ == "nsubj" and token.text == subject_noun:
                noun_sentences.append(sentence)
    return noun_sentences

In [24]:
noun = "woman"
noun_sentences = find_sentences_with_noun(noun, sents_w)
print(noun_sentences)

[]


### Save NLTK Texts

In [4]:
import dill
dill.dump(women, file = open("../output/contexts_w.pickle", "wb"))
dill.dump(men, file = open("../output/contexts_m.pickle", "wb"))

### Load NLTK Texts

In [None]:
# To pickup where we left off without having to re-run everything:
import dill, nltk
women = dill.load(open("../output/contexts_w.pickle", "rb"))
men = dill.load(open("../output/contexts_m.pickle", "rb"))

## Words in Context

In [5]:
women.concordance('man', lines=50)

Displaying 50 of 195 matches:
ting any and all sins against God and man . '' ( Laughter ) Now , I had heard t
ting any and all sins against God and man . '' And Bill said , `` So ? '' And I
and evil . Every single one of them : man , woman , child , infant , fetus . An
uicide bombers . ] [ Paradise Now ] [ Man : As long as there is injustice , som
nce between victim and occupier . ] [ Man : If we had airplanes , we would n't 
aeli military is still stronger . ] [ Man : Then let us be equal in death . ] [
e ! It only exists in your head ! ] [ Man : God forbid ! ] [ May God forgive yo
ive Israel an alibi to carry on ? ] [ Man : So with no alibi , Israel will stop
ave to turn it into a moral war . ] [ Man : How , if Israel has no morals ? ] [
sic ] [ Amandla ] ( Music ) ( Video ) Man : Song is something that we communica
roject that 's done by an Argentinean man and his wife . And he 's basically ta
 '' is a woman , or a vagina-friendly man , who has witnessed incredible violen
 it 's an 

In [6]:
men.concordance("man", lines=50)

Displaying 50 of 635 matches:
re and his wife , Tipper . '' And the man said , `` He 's come down a long way 
st family restaurant chain , what the man said — they laughed . I gave my speec
 air , and I looked , and there was a man running across the runway . And he wa
' I call it `` software rage . '' And man , let me tell you , whoever figures o
shirt recently , which said , `` If a man speaks his mind in a forest , and no 
n her hands for 20 minutes while this man talked to her mother about the proble
blocked people 's ability to get this man 's intellect and capacity . And the w
ou value most , or uncertainty ? This man could n't be a certainty freak if he 
er stop you . '' She finishes , and a man stands up , and he says , `` I 'm fro
ll I can tell you is , I brought this man on stage with a man from New York who
 , I brought this man on stage with a man from New York who worked in the World
 did an indirect negotiation . Jewish man with family in the occupied territory
if he was 

To change display results, the contents of the concordance method are: `("word", window width, lines=#[25, all])`. The `window width` is an integer specifying the number of characters to the left and right of a word to display. The default for `lines` is 25, but it can be set to any integer or to `all` (no quotation marks).