# Contexts

---
*Re-write this header note.*

---

**If using this notebook for KWiC, proceed to *Load NLTK Texts*.**

This notebook is intended primarily to "hand" inspect the contexts for results in other explorations. We are going to try two approaches:

- using the SVOs (This turns out to be rather underwhelming.)
- using the NLTK's concordance method

In the *SVO as Context* below, we end up loading only the mens subcorpus to explore how useful the complete SVOs would be. We don't take this exploration any further.

Afterwards, we turn to the texts themselves as context, loading the two gendered dataframes and then collecting the texts from each. We process the two lists of texts with spacy's NLP pipeline to produce featureful documents which make it easy to get the lemmas out. Using the spacy lemmatization parallels our usage in building the SVOs, so we are going to get the same results and thus enable us to explore the two subcorpora much more effectively. 

Finally, we take the texts as strings of lemmas and create a single NLTK Text, which allows us to use the NLTK's concordance functionality to see words in context. To do this, we create a function that takes spacy doc, lemmatizes the words within it, build a list of lemmas, compiles those lists into a list for the subcorpus, flattens that subcorpus list into a single, very long, string of tokens, from which an NLTK Text is created. We create two NLTK Texts: one for female speakers, `women`, and one for male speakers, `men`. 

**TO DO**: Find a way to save either the list of spacy docs or the NLTK Text. The spacy NLP pipeline takes time to run, and we shouldn't need to create space docs or NLTK Texts every time we want to use this notebook.

## Full Contexts

### spaCy Parts-of-Speech Focused Contexts

The code below uses spaCy's `child` functionality to determine what are the subjects of a sentence and then to return the sentences in which a particular subject appears. It could be adapted to a wide variety of uses. 

Development of this code was based on insights from this Stackoverflow thread: [How to get the dependency tree with spaCy?](https://stackoverflow.com/questions/36610179/how-to-get-the-dependency-tree-with-spacy) thread on Stack Overflow. 

In [None]:
# IMPORTS
import pandas as pd, spacy
# from spacy.lang.en import English

nlp = spacy.load('en_core_web_sm')

# Loading the Data in a gendered partitioned fashion: 
talks_m = pd.read_csv('../output/talks_male.csv', index_col='Talk_ID')
talks_f = pd.read_csv('../output/talks_female.csv', index_col='Talk_ID')

# And then grabbing on the texts of the talks:
texts_all = talks_all.text.tolist()
texts_women = talks_f.text.tolist()
texts_men = talks_m.text.tolist()

# Lowercase everything before we create spaCy docs
texts_w = [text.lower() for text in texts_women]
texts_m = [text.lower() for text in texts_men]

doc_w = nlp(texts_w)
doc_m = nlp(texts_m)

def find_subject(subject, doc):
    subject_sents = []
    sentences = list(doc.sents)
    for sentence in sentences:
        root_token = sentence.root
        for child in root_token.children:
            if child.dep_ == 'nsubj':
                subj = child
                if subj.text == subject:
                    subject_sents.append(sentence)
    return subject_sents

In [None]:
find_subject("I", doc)

### NLTK Key Word in Context

---
*Need to re-build this section.*

---

In [None]:
women.concordance('man', lines=50)

In [None]:
men.concordance("man", lines=50)

To change display results, the contents of the concordance method are: `("word", window width, lines=#[25, all])`. The `window width` is an integer specifying the number of characters to the left and right of a word to display. The default for `lines` is 25, but it can be set to any integer or to `all` (no quotation marks).

## SVOs as Context

While this was at first appealing because of the simplicity and accuracy, since we would be loading the SVOs themselves, the resulting contexts were too impoverished to be much use for hand inspection. Only the men's subcorpus is loaded here.

In [None]:
import csv, re
with open('../output/svos_m_lem.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='|')
    # Drop the first row
    next(reader)
    # Skip the first column
    contexts_m = [row[1:4] for row in reader]

Each line in **contexts** is a list of three strings. The list comprehension below first joins all the items in each line into a single string, it then replaces any square brackets, that sometimes occur in the third item on each line, with nothing. It does this for all the lines in the list.

In [None]:
sentences = [re.sub("[\([{})\]]", "", ' '.join(item)) for item in contexts_m]
print(sentences[0:10])

## Lemmatized Contexts

In [None]:
# IMPORTS
import nltk, pandas as pd, spacy

# Load the Space pipeline to be used
nlp = spacy.load('en_core_web_sm')

# Use the pipe method to feed documents 
docs_w = list(nlp.pipe(texts_w))
docs_m = list(nlp.pipe(texts_m))

In [None]:
def contextualize(spacy_docs):
    '''contextualize takes a list of spaCy docs
       and converts them to a single NLTK text'''
    all_lemmas = []
    # Grab the lemmas from each of the documents
    # and append to a list of all the lemmas
    for doc in spacy_docs:
        lemmas = [token.lemma_ for token in doc]
        all_lemmas.append(lemmas)
    # all_lemmas is a list of lists that needs to be flattened
    flattened = [item for sublist in all_lemmas for item in sublist]
    # all our texts are now one long list of words to be fed into NLTK Text
    contextualized = nltk.Text(flattened)
    return contextualized

In [None]:
women = contextualize(docs_w)
men = contextualize(docs_m)