# Contexts

This notebook allows us to examine the contexts in which words occur in sentences throughout the two subcorpora. Contexts vary according to the method used:

- Using spaCy we can explore contexts by specifying both a word and its part-of-speech.
- Using NLTK's `concordance` functionality we can explore all a word's contexts in the conventional KWiC format -- there is also code here for those interested in lemmatizing words before generating an NLTK `text`.
- Using the already-generated SVOs, we can quickly glimpse the related subjects, verbs, and objects for a particular word, though it has to be appear somewhere as an S, V, or O.

## spaCy

The code below uses spaCy's `child` functionality to determine what are the subjects of a sentence and then to return the sentences in which a particular subject appears. It could be adapted to a wide variety of uses. 

Development of this code was based on insights from this Stackoverflow thread: [How to get the dependency tree with spaCy?](https://stackoverflow.com/questions/36610179/how-to-get-the-dependency-tree-with-spacy) thread on Stack Overflow. 

In [4]:
# IMPORTS
import pandas as pd, spacy
# from spacy.lang.en import English

# Loading the Data in a gendered partitioned fashion: 
talks_m = pd.read_csv('../output/talks_male.csv', index_col='Talk_ID')
talks_f = pd.read_csv('../output/talks_female.csv', index_col='Talk_ID')

# And then grabbing on the texts of the talks:
texts_women = talks_f.text.tolist()
texts_men = talks_m.text.tolist()

# Lowercase everything before we create spaCy docs
texts_w = ' '.join([text.lower() for text in texts_women])
texts_m = ' '.join([text.lower() for text in texts_men])

In [None]:
nlp = spacy.load('en_core_web_sm')

# spaCy is fussy about memory allocation
# Use the pipe method to feed documents 
docs_w = nlp.pipe(texts_w)
# docs_m = list(nlp.pipe(texts_m))

In [None]:
# This function allows us to specify the subject of the sentence
# and to see all the sentences in which it appears as the subject.
def find_subject(subject, doc):
    subject_sents = []
    sentences = doc.sents
    for sentence in sentences:
        root_token = sentence.root
        for child in root_token.children:
            if child.dep_ == 'nsubj':
                subj = child
                if subj.text == subject:
                    subject_sents.append(sentence)
    return subject_sents

In [None]:
for doc in docs_w:
    find_subject("father", doc)

Not sure why **below** works and **above** does not.

In [None]:
doc_one = nlp(texts_women[1])

In [None]:
find_subject('father', doc_one)

In [None]:
texts_women[1]

In [None]:
def find_root(subject, doc):
    subject_sents = []
    sentences = doc.sents
    for sentence in sentences:
        root_token = sentence.root
        for child in root_token.children:
            if child.dep_ == 'nsubj':
                subj = child
                if subj.text == subject:
                    subject_sents.append(sentence)
    return subject_sents

In [None]:
for sentence in sentences:
    print(sentence)

In [None]:
find_subject("i", docs_w[0])

## NLTK

To change display results, the contents of the concordance method are: `("word", window width, lines=#[25, all])`. The `window width` is an integer specifying the number of characters to the left and right of a word to display. The default for `lines` is 25, but it can be set to any integer or to `all` (no quotation marks).

In [None]:
import nltk
from nltk import word_tokenize

# Create NLTK texts for concordances
words_w = word_tokenize(" ".join(talks_f.text.tolist()))
women = nltk.Text(words_w)

words_m = word_tokenize(" ".join(talks_m.text.tolist()))
men = nltk.Text(words_m)

# Test
women.concordance('kill', lines=50)

In [None]:
import nltk
# nltk.download('punkt')  # Download the necessary tokenizer data

def find_sentences(text, noun, verb):
    sentences = nltk.sent_tokenize(text)
    matching_sentences = []
    for sentence in sentences:
        if (noun in sentence.split()) and (verb in sentence.split()):
            matching_sentences.append(sentence)
    return matching_sentences

In [None]:
find_sentences(texts_w, "he", "kill")

In [6]:
import nltk
# nltk.download('punkt')  # Download the necessary tokenizer data
# nltk.download('averaged_perceptron_tagger')  # Download 

In [5]:
def find_sents(text, noun, verb):
    sentences = nltk.sent_tokenize(text)
    matching_sentences = []
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        pos_tags = nltk.pos_tag(words)
        verb_in_sentence = False
        for word, tag in pos_tags:
            if word == verb and tag.startswith('VB'):
                verb_in_sentence = True
                break
        if noun in sentence.split() and verb_in_sentence:
            matching_sentences.append(sentence)
    return matching_sentences

In [7]:
find_sents(texts_w, "he", "kill")

['(laughter)    and if he knew i was showing this right now — i put this in today — he would kill me.',
 "this took place in egypt in january 2011, and as president hosni mubarak attempted a desperate move to quash the rising revolution on the streets of cairo, he sent his personal troops down to egypt's internet service providers and had them physically kill the switch on the country's connection to the world overnight.",
 'now, my dad is my biggest fan, so in that crushing moment where he wanted to kill my new little life form, i realized that actually i had failed him, both as a daughter and a scientist.']

## SVO

While this was at first appealing because of the simplicity and accuracy, since we would be loading the SVOs themselves, the resulting contexts were too impoverished to be much use for hand inspection.

In [None]:
# IMPORTS
import pandas as pd

# LOAD DATAFRAMES
# the `lem` suffix indicates the verbs have been lemmatized
svos_m = pd.read_csv("../output/svos_m_lem.csv", index_col=0)
svos_w = pd.read_csv("../output/svos_w_lem.csv", index_col=0)

In [None]:
svos_w.query(' subject=="he" & verb=="kill" ')