<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [J.D. Porter](https://) for the 2023 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email porterjd@upenn.sas.edu<br />
____

# `Finding Word Meaning Through Context` `3`

This is lesson `3` of 3 in the educational series on `finding word meaning in context`. This notebook is intended `to show how to find the collocates of any given target term—that is, the words that tend to occur near a word that interest you—and then how to find the distinctive collocates in particular`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` 

**Difficulty:** `Beginner`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, functions, lists, dictionaries)
* A basic familiarty with the material from the first two sessions:
    * How to extract the words from a .txt file and count them
    * How to find the Most Distinctive Words in one text, measured againt a broader corpus

```

**Knowledge Recommended:**
```
* n/a
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Find the key words in context (or KWIC) for any given text and target term
2. Identify the most common collocates of a target term
3. Find the distinctive collocates of a target term
4. Apply these procedures to multiple target terms across multiple texts
```
**Research Pipeline:**
```
1. Gather a file in the .txt format and save it somewhere on your machine
2. Use whatever steps you're interested in from this notebook
3. If you have written out some of your data, explore it in a program like Excel or Google Sheets
4. Interpret!
```
___

# Required Python Libraries

* To keep things simple, we will try to work with very few libraries in this notebook. We'll use os, as well as one function from scipy.stats

## Install Required Libraries

In [1]:
### Import Libraries ###
import os

from scipy.stats import fisher_exact

# Required Data

**Data Format:** 
* plain text (.txt)

**Data Source:**
* included files (though you may supplement these whenever you feel comfortable doing so!)

**Data Quality/Bias:**

Included texts are from freely available sources like Project Gutenberg and Wikipedia. They have not been vetted for textual accuracy relative to, say, a scholarly edition. They are subject to the usual biases of those sites, and (especially in the case of Wikipedia) may not reflect the current state of the material online. The F. Scott Fitzgerald novels also do not reflect his entire corpus, since two novels are subject to copyright law (1934's _Tender is the Night_, as well as the posthumously published _The Last Tycoon_).

**Data Description:**

This lesson uses textual data in .txt format (utf-8 encoding) from various sources.

# Introduction...

So far, we've been working primarily with individual words and their counts and frequencies, without much attention to word _order_. Text miners sometimes call this a "bag-of-words" approach: Each text is treated as a collection of words, many of which occur more than once, but features like syntax, dependency, deixis, narratology, and context are ignored. There's nothing wrong with analyzing a text this way—for instance, we learned a lot about the words that distinguished our Beatles articles—but clearly there is plenty more to learn if we treat it less like a bag of words and more like, well, a text! 

In today's lesson, we take a few relatively simple steps to put word order back into the mix. We have already begun thinking about words in their contexts—for instance, in an article about Paul McCartney vs one about John Lennon—but now we will narrow our focus considerably, often to the level of a sentence or a phrase. As a result, we will likewise narrow our object of comparison. By finding collocates, the words that appear near any given target term, we will be able to identify the words that distinguish _words_. We know "bass" is distinctive of Paul and "peace" of John—but what distinguishes "bass" from "peace", or from everything else? We probably won't arrive at a final theory of word meaning 90 minutes from now, but hopefully we'll have some tools to help us think about it in new ways!

# Getting a list of words from a file

Since we've already covered this material, we can use a few familiar functions

In [2]:

# Takes a word and returns a "cleaned up" version of the word
def alphaclean(someword):
    chars = [i for i in someword if i.isalpha()]
    cleanword = ''.join(chars)
    cleanword = cleanword.lower()
    return cleanword

# Takes a filename and returns a list of words (optionally cleaned)
def file2words(somefile,clean=True):
    with open(somefile) as source:
        text = source.read()
    words = text.split()
    if clean:
        words = [alphaclean(i) for i in words]
    return words

# Takes an items and adds it to a specified count dictionary
def addtocountdict(something,somedict,weight=1):
    if something not in somedict:
        somedict[something] = 0
    somedict[something] += weight
    return

In [22]:
# Create a variable for your filename, using a full path if need be
fn = 'Austen_PrideAndPrejudice.txt'

# Turn the file into words
words = file2words(fn)

# Getting KWIC

Because lists in Python preserve order, we can use them to extract the context that surrounds any given target word. Let's say we want to find the contexts in which Austen writes the word "estate" in _Pride and Prejudice_. The basic method is simple: 
 * Decide how big our "window" should be. Different sizes capture different kinds of relationships among words. A window of about 10 words before and after the target term is fairly standard, since it's a nice round number near the scale of a sentence.[$^{1}$](#1) 
 * Find all occurrences of "estate" in the text.
 * Grab the whole window, from 10 words before "estate" to 10 words after.

If we simply use these results in this form, we'll basically have Key Words in Context, or KWIC. People use the term/acronym KWIC in different ways, sometimes referring to a way of making a concordance, but for our purposes it just means grabbing the context that surrounds target terms!

<hr></hr>

__Footnote__

1. <a id="1"></a> Baroni, Bernardi and Zamparelli say that if the context is very large—e.g., entire documents—then analyses tend to capture "topical" relationships (like that between "war" and "Afghanistan"), whereas small contexts capture "taxonomical" relationships (like that between "dog" and "hyena") (251). They're discussing more complex methods of distance analysis than we're getting into here, but I still think that's an interesting example of the ways context size can affect our results, or even our sense of what "meaning" means. Baroni, Marco, Raffaella Bernardi, and Roberto Zamparelli. "Frege in space: A program for compositional distributional semantics." _Linguistic Issues in Language Technology_ 9 (2014): 241-346.

In [30]:
# Pick a target term
target_term = 'prejudice'

# Pick a window
window = 10

# Create an object that can store your results
kwics = []

# Go through a list of words and find the target term
# Grab the context surrounding the target term
for n,w in enumerate(words):
    if w == target_term:
        start = max(0,n-window)
        end = min(n+window+1,len(words))
        k = words[start:end]
        kwics.append(k)

# Let's turn these steps into a function!
def get_kwic(sometargetterm,somelistofwords,window=10,excl_target = True):
    kwics = []
    for n,w in enumerate(somelistofwords):
        if w == sometargetterm:
            start = max(0,n-window)
            end = min(n+window+1,len(somelistofwords))
            if excl_target:
                k = somelistofwords[start:n] + somelistofwords[n+1:end]
            else:
                k = somelistofwords[start:end]
            kwics.append(k)
    return kwics

In [20]:
fn = 'Chopin_Awakening.txt'
words = file2words(fn)

kwics = get_kwic(target_term,words)

In [33]:
kwics = get_kwic(target_term,words,excl_target=False)

for k in kwics:
    print(len(k),k)

13 ['pride', 'and', 'prejudice', 'by', 'jane', 'austen', 'chapter', '', 'it', 'is', 'a', 'truth', 'universally']
21 ['firm', 'voice', 'and', 'never', 'allow', 'yourself', 'to', 'be', 'blinded', 'by', 'prejudice', 'i', 'hope', 'not', 'it', 'is', 'particularly', 'incumbent', 'on', 'those', 'who']
21 ['just', 'sense', 'of', 'shame', 'would', 'not', 'conceal', 'with', 'a', 'strong', 'prejudice', 'against', 'everything', 'he', 'might', 'say', 'she', 'began', 'his', 'account', 'of']
21 ['rest', 'of', 'his', 'conduct', 'who', 'will', 'believe', 'me', 'the', 'general', 'prejudice', 'against', 'mr', 'darcy', 'is', 'so', 'violent', 'that', 'it', 'would', 'be']
21 ['vain', 'mr', 'gardiner', 'highly', 'amused', 'by', 'the', 'kind', 'of', 'family', 'prejudice', 'to', 'which', 'he', 'attributed', 'her', 'excessive', 'commendation', 'of', 'her', 'master']
21 ['the', 'world', 'she', 'knew', 'it', 'was', 'a', 'circumstance', 'which', 'must', 'prejudice', 'her', 'against', 'him', 'i', 'am', 'certainly',

In [16]:
animals = ['ant','bug','cat']

for n,i in enumerate(animals):
    print(n,i)
    
animals[0:3]

0 ant
1 bug
2 cat


['ant', 'bug', 'cat']

# From KWIC to collocates

For our purposes, a "collocate" just means any word that appears within some context window for a target term. The term collocates is sometimes used in more specialized ways. For instance, some researchers will only consider a word to be a collocate if it occurs _significantly_ more often near the target term. But for now, we'll keep it simple!

In [34]:
# Pick a target term
target_term = 'estate'

# Create a count dictionary for your collocates
counts = {}

# Get some kwics
kwics = get_kwic(target_term,words)

# Count the collocates
for k in kwics:
    for w in k:
        addtocountdict(w,counts)

In [None]:
# Example of a sentence to show why we can't just delete all occurrences of the target term!
"Rose is a rose is a rose is a rose"

In [None]:
# This isn't important to learn today, but for convenience we can use it to sort a count dictionary

for i in sorted(counts,key = lambda i:counts[i],reverse=True):
    print(i,counts[i])

# Getting interesting frequencies for target terms

In [38]:
# Filter stopwords
stops = file2words('stopwords.txt')

int_counts = {}

# Consider converting to frequency

total_wc = sum(counts.values())

for word,count in counts.items():
    if word in stops:
        continue
    int_counts[word] = count/total_wc

In [None]:
# Here's the sorting function again

for i in sorted(int_counts,key = lambda i:int_counts[i],reverse=True):
    print(i,"\t",int_counts[i])

# Testing a collocate for distinctiveness

Here we cash in on the methods we learned in session 2. The trick is to adapt our Fisher's Exact Test so that it can handle a comparison between collocates and the larger text, rather than between one text and some others. Let's work through what we need to build the table that we'll use in the test. We can start with the general principles we covered last time:

* a = the number of times the word appeared in the corpus we're interested in
* b = the number of times any other word appeared in that corpus
* c = the number of times we would have expected the word to appear if it was evenly distributed across all our corpora
* d = the number of times any other word would have appeared if the word we're examining had been appearing at its expected rate

For collocates, this will translate to:

* a = the number of times the word appeared _within the window surrounding our target term_
* b = the total number of collocates minus a
* c = the number of times we would have expected the word to appear if it appeared _within the window surrounding our target term_ at the same rate as _within the text as a whole_
* d = the total number of collocates minus c

As with our Beatles example in session 2, the tricky part is figuring out __c__. We need to know the expected rate of occurrence for the word we're testing (let's call it __r__ again). If we want to know the significant collocates of "estate" within _Pride and Prejudice_, we can set that rate using the full novel. The important thing to remember is that we need the rate not for "estate", but for whatever collocate we want to test. 

For example, let's say we want to see if "derbyshire" is a significant collocate for "estate" in _Pride and Prejudice_, even though it only appears 3 times in our KWIC. The word "derbyshire" appears 24 times in the novel, which is 121,567 words long. So our __r__ will be:

```Python
r = 24 / 121567
```

That comes to about 0.0002. Meanwhile the total number of collocates for "estate" is 400 (one easy way to get this is to sum the values of our collocate count dictionary). So that gives us:

* a = 3
* b = 400 - 3
* c = __r__ * 400
* d = 400 - c

Which comes to:

* a = 3
* b = 397
* c = 1
* d = 399

If we run our Fisher's Exact Test on this, we get a p-value of .62. So, even though Darcy's estate is in Derbyshire, the word "derbyshire" doesn't seem to be a significant collocate of "estate"!

That's kind of a boring result, so let's pick a different target term, and build a method for finding all of the distinctive collocates!

# Testing _all_ of the collocates for distinctiveness

In [40]:
# Here's the function we used last time for getting the p value via a Fisher's Exact Test

def get_fishers(someword,somecountdict,someratedict,alternative='greater'):
    r = someratedict[someword]
    wc = sum(somecountdict.values())
    a = somecountdict[someword]
    b = wc - a
    c = round(r*wc)
    d = wc-c
    p = fisher_exact([[a,b],[c,d]],alternative=alternative)[1]
    return p

In [68]:
# In this cell we'll set up everything we need to find the p value

# Find the counts of all words for the text
corpus_counts = {}

words = file2words('Austen_PrideAndPrejudice.txt')

for w in words:
    addtocountdict(w,corpus_counts)

# Make a rate dictionary for the corpus

total_wc = sum(corpus_counts.values())

rates = {}

for word,count in corpus_counts.items():
    rates[word] = count/total_wc

# Pick the target term
target_term = 'love'

# Get the KWIC of the target term
kwics = get_kwic(target_term,words,window=10)

# Count the collocates
coll_counts = {}

for k in kwics:
    for w in k:
        addtocountdict(w,coll_counts)


In [69]:
# Get MDC (most distinctive collocates) output

output_table = [['token_','count','p-value','obs/exp']]

alpha = .05

for word,count in coll_counts.items():
    p = get_fishers(word,coll_counts,rates)
    if p < alpha:
        exp = rates[word] * sum(coll_counts.values())
        new_row = [word,count,p,count/exp]
        output_table.append(new_row)

In [70]:
output_table

[['token_', 'count', 'p-value', 'obs/exp'],
 ['in', 62, 0.00018574988485609898, 2.2253054307326203],
 ['you', 33, 0.04796215281579177, 1.6351905142335408],
 ['him', 22, 0.039390892465470895, 1.931000274364269],
 ['i', 51, 0.016603229196204907, 1.6609204301305716],
 ['much', 14, 0.03144303770857081, 2.8597271230298755],
 ['love', 14, 0.0004777899180742989, 10.276162299239221]]

# Checking collocates in a larger corpus

You may have noticed that we're not getting a _ton_ of results for any given target term. That's because we're dealing with a fairly small amount of text—we only collect text when we find our target term, and even then we might not collect _much_ text. This means that collocate analysis often gets more interesting as we scale up the amount of text we examine. In the context of this session, we don't want to go _too_ far with that, since large corpora require more labor (and processing time) to store, share, analyze, and interpret. But we can get a little bit of a sense by zooming out to Austen's entire corpus.

In [71]:
# Name the directory containing the novels
sdir = 'AustenNovels'

# Get the filenames out of it
files = os.listdir(sdir)
files = [i for i in files if i.endswith('txt')]
files = [os.path.join(sdir,i) for i in files]

files

['AustenNovels/Austen_PrideAndPrejudice.txt',
 'AustenNovels/Austen_Emma.txt',
 'AustenNovels/Austen_LadySusan.txt',
 'AustenNovels/Austen_SenseAndSensibility.txt',
 'AustenNovels/Austen_Persuasion.txt',
 'AustenNovels/Austen_MansfieldPark.txt',
 'AustenNovels/Austen_NorthangerAbbey.txt']

In [82]:
# In this cell we'll set up everything we need to run our Fisher's Exact Test in a bit

# Find the counts of all words for the text across all the novels

corpus_counts = {}

for f in files:
    words = file2words(f)
    for w in words:
        addtocountdict(w,corpus_counts)
    
# Make a rate dictionary for the corpus
total_wc = sum(corpus_counts.values())

rates = {}

for word,count in corpus_counts.items():
    rates[word] = count/total_wc

# Pick the target term
target_term = 'marriage'

# Get the KWIC of the target term across the whole corpus
kwics = []

for f in files:
    words = file2words(f)
    k = get_kwic(target_term,words,window=10)
    kwics.extend(k)

# Count the collocates
coll_counts = {}

for k in kwics:
    for w in k:
        addtocountdict(w,coll_counts)

In [83]:
# Get MDC (most distinctive collocates) output

output_table = [['token_','count','p-value','obs/exp']]

alpha = .05
stops = file2words('stopwords.txt')
cutoff = 5

for word,count in coll_counts.items():
    if count < cutoff:
        continue
    if word in stops:
        continue
    p = get_fishers(word,coll_counts,rates)
    if p < alpha:
        exp = rates[word] * sum(coll_counts.values())
        new_row = [word,count,p,count/exp]
        output_table.append(new_row)

In [84]:
output_table

[['token_', 'count', 'p-value', 'obs/exp'],
 ['happiness', 10, 0.019225294891341806, 4.354745778655754],
 ['offer', 9, 0.010708476050916701, 13.791397715988083],
 ['place', 11, 0.028592756055140594, 3.6096003898635476],
 ['imprudent', 5, 0.031215736529766322, 27.07200292397661],
 ['early', 7, 0.03509628893358559, 5.8913166984819565],
 ['since', 11, 0.011184564536754883, 5.076000548245614],
 ['fortune', 7, 0.03509628893358559, 5.860949086634112]]

# Checking several target terms

In [85]:
target_terms = ['father','mother','sister','brother','cousin','uncle','aunt','nephew','niece','grandfather','grandmother',
               'son','daughter','husband','wife']

In [86]:
# Some things won't change!

# Find the counts of all words for the text across all the novels
corpus_counts = {}

for f in files:
    words = file2words(f)
    for w in words:
        addtocountdict(w,corpus_counts)

# Make a rate dictionary for the corpus
total_wc = sum(corpus_counts.values())

rates = {}

for word,count in corpus_counts.items():
    rates[word] = count/total_wc

In [87]:
# For the KWIC, we'll now need to associate the results with specific target terms
kwic_d = {}

for f in files:
    words = file2words(f)
    for term in target_terms:
        if term not in kwic_d:
            kwic_d[term] = []
        k = get_kwic(term,words)
        kwic_d[term].extend(k)

In [94]:
# Same goes for our collocate counts!

all_coll_counts = {}

for term,kwic_data in kwic_d.items():
    if term not in all_coll_counts:
        all_coll_counts[term] = {}
    for k in kwic_data:
        for w in k:
            addtocountdict(w,all_coll_counts[term])

In [98]:
# Get MDC (most distinctive collocates) output

output_table = [['target_term','token_','count','p-value','obs/exp']]

alpha = .05
stops = file2words('stopwords.txt')
cutoff = 5

for term,data in all_coll_counts.items():
    for word,count in data.items():
        if count < cutoff:
            continue
        if word in stops:
            continue
        p = get_fishers(word,data,rates)
        if p < alpha:
            exp = rates[word] * sum(data.values())
            new_row = [term,word,count,p,count/exp]
            output_table.append(new_row)

In [101]:
# Writing out the results for easier perusal

with open('austen_coll_data.csv','w') as output:
    for row in output_table:
        str_row = [str(i) for i in row]
        output_str = ",".join(str_row) + "\n"
        output.write(output_str)

# Comparing terms across corpora

Comparing terms across different corpora raises some complicated, even profound questions. If, for instance, we try to compare Austen’s language to that of F. Scott Fitzgerald, we quickly run into some insurmountable difficulties. When it comes time to analyze “distinctiveness” for any given target term, on what should we base our expectations?  Let’s consider two possibilities. First, we can set the base rate by considering all of the novels in both corpora. (For simplicity's sake, we'll go back to thinking about one target term at a time.)

In [None]:
target_term = 'love'

In [None]:
# First, let's grab all of the file names from both directories
# Let's also keep track of which corpus contains which files


# Then we can proceed as before

# Find the counts of all words for the text across all the novels

# Make a rate dictionary for the corpus

# For the KWIC, we'll now need to associate the results with specific corpora
    
# For the collocate counts, we'll likewise separate by corpus


In [None]:
alpha = .05
cutoff = 5

output_table_1 = [['target_corpus','collocate','count','p-value','obs/exp']]

These results are interesting, but it’s difficult to say what they’re showing us about the target term, vs what they’re telling us about the difference between Austen and Fitzgerald. After all, one is a British novelist from the turn of the nineteenth century, and the other is an American writing in the Jazz Age (even the name of the era involves a word Austen didn’t have). Do the differences we find reflect different understandings of the target term, or differences between the two authors and their language contexts? Just consider the difference between "cannot" and "can't"!

To resolve this, we could try to base our expectations on each author’s corpus. 

In [None]:
# First, we get counts and rates for each corpus

# Since all we needed to change was our expectations, or rates, we can then proceed exactly as before
# For the KWIC, we'll still need to associate the results with specific corpora
    
# For the collocate counts, we'll likewise separate by corpus


In [None]:
alpha = .05
cutoff = 5

output_table_2 = [['target_corpus','collocate','count','p-value','obs/exp']]


In [None]:
# You don't need to learn this stuff, but here we make sets showing the unique (corpus,collocate) pairs from each output table
# This enables us to see the difference between the two methods more easily

table_1_pairs = set()
table_2_pairs = set()

for row in output_table_1:
    table_1_pairs.add((row[0],row[1]))
    
for row in output_table_2:
    table_2_pairs.add((row[0],row[1]))

In [None]:
# Here we see (corpus,collocate) pairs that appear in the first output_table, but not the second
table_1_pairs - table_2_pairs

In [None]:
# Here we see (corpus,collocate) pairs that appear in the second output_table, but not the first
table_2_pairs - table_1_pairs

Now we can see how each author uses the target term relative to the rest of that author’s corpus. Fitzgerald isn’t saying "beauty" because it’s a new word, but because he uses "beauty" near the target term more than he uses it in the rest of his corpus. But what does that tell us? Does "love" mean something different for Austen and Fitzgerald? Does "beauty"? If the language has changed, does any shared collocate mean the same thing? Does any word?

With MDW, we could skip a lot of these questions by comparing the novels in their entirety, rather than zooming in on target terms. But that wouldn’t solve the problem; it would only ignore it. 

Now we've arrived, more or less, back at the beginning. We can find the words that go with any target term that interests us. We can even understand that target term with respect to these surrounding words. Have we then arrived at the meaning of our target terms? I think we're at least in range of it. But when you consider the decisions we have to make about expectations, comparisons, corpora, word cleaning, counting... and much more — when you consider all of this, you see how many different ways there are to _find_ meaning. In other words (no pun intended), it's tough to say what meaning means. When we look for it, we will always have to consider context.