<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [J.D. Porter](https://) for the 2023 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email porterjd@upenn.sas.edu<br />
____

# `Finding Word Meaning Through Context` `2`

This is lesson `2` of 3 in the educational series on `finding word meaning in context`. This notebook is intended `to teach the basic steps involved in working with text files in Python, and introduce the concepts of programatically reading a text file, analyzing the words it contains, and writing out some results`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` 

**Difficulty:** `Beginner`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, functions, lists, dictionaries)
* The content from Session 1:
    * Open a text file
    * Turn its text into a list of words
    * Clean up the words a little (remove punctuation, change the case, etc.)
    * Count the occurrences of each word

```

**Knowledge Recommended:**
```
* n/a
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Understand the basic ideas of distinctive differences between texts
2. Choose parameters for running a "most distinctive words" function
3. Analyze the results of a "most distinctive words" check

```
**Research Pipeline:**
```
1. Gather some files, as well as a metadata table
2. Compare the word frequencies, and the statistical significance of their variance, in those files
3. If you have written out some of your data, explore it in a program like Excel or Google Sheets
4. Interpret!
```
___

# Required Python Libraries

* To keep things simple, we will try to work with very few libraries in this notebook. This time we'll be using:
    * os (to work with files and directories)
    * scipy.stats (to measure statistical significance for some of our results)

## Install Required Libraries

In [64]:
### Import Libraries ###
import os
from scipy.stats import fisher_exact

# Required Data

**Data Format:** 
* plain text (.txt)
* maybe basic spreadsheet (.tsv)

**Data Source:**

Included files (though you may supplement these whenever you feel comfortable doing so!)

**Data Quality/Bias:**

Today's texts come from Wikipedia. They are subject to the usual biases associated with Wikipedia editors. I downloaded them on July 13, 2023, so they probably won't look exactly the same if you visit the site.

**Data Description:**

This lesson uses textual data in .txt format (utf-8 encoding) from Wikipedia.

# Introduction to Session 2

Today's session might feel like a bit of a detour, but I want to cover this material today for a few reasons. 

First, we're going to need it at some point. The skills we learned in Session 1 have put us pretty close to being able to extract collocates—the context words that, we suspect, may help us interpret the meanings of words that interest us (let's call those 'target words'). But when we've found our collocates, how will we know which ones matter? After all, for any given target word, there will surely be many collocates, some of them the workaday words that tend to show up everywhere ("the","and","of", etc.), and, on the other end of the spectrum, some of them so infrequent that we might hesitate to ascribe much importance to them. To make sense of our data, we're going to want to be able to find words that are showing up _significantly_ more often near our target words.

Second, this material is more complicated than most of what we're doing. On one hand, it might be a little bit of a challenge to jump right into this stuff, especially for those of you who are newer to programming or to Python. On the other hand, if we need to slow down, we can pick up the slack in Session 3!

The last reason is more rhetorical. This material takes a little time to cover. By doing it now, we ensure that we get to collocates in the last session. It just feels more fun to _arrive_ at collocates on the last day, right?

# Review: Getting word counts from texts

In [71]:
# Start with a filename


In [None]:
# Open it and get the text


In [None]:
# Turn the text into words


In [None]:
# Clean the words

# We can use this function for today
def alphaclean(someword):
    chars = [i for i in someword if i.isalpha()]
    cleanword = ''.join(chars)
    cleanword = cleanword.lower()
    return cleanword

In [None]:
# Count the words!


# Comparing files

In [None]:
# Getting the filenames from a directory


In [None]:
# Seeing how long each file is


In [None]:
# Counting a target word


In [None]:
# Moving from counts to frequencies


# Figuring out which counts _count_...

Raw counts have pretty limited use when it comes to comparing a word across texts. They're not _nothing_: If someone uses a word a lot, they really did use the word a lot! Even if the text is long, you as the reader really did get exposed to all of those uses! But they don't tell us much about the _importance_ of a word in a text, especially if we want to compare across works of different lengths.

Frequencies are a big improvement, since they scale for length. With frequencies, we can say that the Wikipedia article for Paul McCartney uses the word "Liverpool" more often _per word_ —that is, "Liverpool" tales up more of the text for McCartney than it does for the other lads. Yet even frequencies can be somewhat difficult to interpret. Just how big a difference is it to say "Liverpool" 14 times per 10,000 words compared to 10.3? Or to 8.7?

There are several ways to approach this problem, and to my knowledge, none is exactly standard yet, at least in my corner of text analysis (Digital Humanities with a literary critical focus). You can find a nice summary of several well-established approaches on the [Zeta Project website](https://zeta-project.eu/en/keyness-measures/). You may have heard of some of these: chi-squared tests, TF-IDF scores, Burrow's Zeta... the list goes on.

Although the approaches differ in important ways, they have in common the goal of finding _distinctive_ words by measuring which words show up in which places more often than we'd expect. More often? Expect? What do these terms mean? What exactly is our expectation based on? Well, it depends! We have so many methods precisely because of the ambiguity of this question, along with various practical considerations (e.g., some methods are better at finding "content" words, some are better at handling corpora of radically different sizes, and so on). 

The method I prefer was developed at the Stanford Literary Lab, and involves a modified form of a Fisher's Exact Test. For today, we'll focus on this method, for a couple reasons:
* It's the method I know best, so I'm better positioned to discuss it.
* To me the results are usually quite legible—this is one of those "the proof is in the pudding" situations where I've continued using this approach because I've liked the output.



# Fisher's Exact Test

Fisher's Exact Test was developed by Ronald Fisher when a colleague of his, Muriel Bristol, claimed that she could tell whether milk was added to a tea cup before or after the tea was poured. Fisher tested her with eight cups and she got them all right. In a roundabout way, this led Fisher to develop a test that could measure the statistical significance of results for experiments like this. (Everyone who teaches the Fisher's Exact Test is required to share this story.)


The math behind this test is complicated, so we won't go into much detail here. If you're interested, I think [Wolfram MathWorld](https://mathworld.wolfram.com/FishersExactTest.html) has a pretty approachable explanation. They give the example of trying to figure out whether math and biology papers are appearing significantly more often in _Mathematics Magazine_ and _Science_ (respectively). The upshot is that running a Fisher's Exact Test requires creating a 2x2 grid, which in Python winds up being a list of lists (this is the same principle behind our approach to making an output table in Session 1). It looks roughly like this:
```Python
grid = [[a,b],[c,d]]
```
which, if you printed each row, would be:
```Python
[a,b]
[b,c]
```

Figuring out what we need in order to fill in our variables (a,b,c, & d) will give us a decent frame for understanding what sort of information the test measures, and what its results mean for us. In the example above, from Wolfram MathWorld:

* a = number of math papers in _Math Magazine_
* b = number of math papers in _Science_
* c = number of biology papers in _Math Magazine_
* d = number of biology papers in _Science_

For our purposes in text analysis, I'll be following an approach laid out in several Stanford Literary Lab pamphlets, perhaps most notably ["Style at the Scale of the Sentence"](https://litlab.stanford.edu/LiteraryLabPamphlet5.pdf).[$^{1}$](#1) It's a little unusual compared to the example above, but essentially it's trying to get at the question: Does the target word appear more often in one corpus than another? So to that end:

* a = the number of times the word appeared in the corpus we're interested in
* b = the number of times any other word appeared in that corpus
* c = the number of times we would have expected the word to appear if it was evenly distributed across all our corpora
* d = the number of times any other word would have appeared if the target word had been appearing at its expected rate

Three of these values are pretty easy to find; only **c** is a little difficult. To get **c**, we need some kind of expected rate of appearance for our target word. My usual approach is to take the total appearances of the target word across all corpora and divide by the total word count of all the corpora. That's basically getting us the general frequency of the word. The idea is that if the target word were evenly distributed, then we'd expect it to appear at the same rate _per-word_ in every corpus. Let's call the rate **r**. So, for our Beatles example above, "liverpool" appears 46 times, and the total wordcount of all four articles is 47,688. To get **r**, we would just do:

```Python
r =  46 / 47,688
```
Or in other words, **r** is about **.00096**. Once we have our **r**, it becomes easy to get **c**, which means all of our variables are now pretty easy. Let's see if Paul's article really does mention Liverpool _significantly_ more often than the other articles. And just as a reminder, Paul's article has a total wordcount of 14,254 (let's call that **wc**):
* a = The number of times Paul's article says "liverpool"
* b = wc - a
* c = r * wc (we actually round this number to make the future calculations work)
* d = wc - c

Thus:

```Python
a = 20
b = 14234
c = 14
d = 14240
```

At this point, the complex calculation happens. But fortunately for us, Python makes this much easier. This brings us to the end of this wall of text, because now we're going to run some actual code!




##### *Footnotes* #####

1. <a id="1"></a> Allison, Sarah, Marissa Gemma, Ryan Heuser, Franco Moretti, Amir Tevel, Irena Yamboliev. "Style at the Scale of the Sentence". Stanford Literary Lab pamphlet series, no. 5.


In [None]:
# Running the scipy.stats fisher_exact test on our "liverpool" data


In this case the resulting p-value is about .19. 

We won't go into p-values very much in this session, but the general idea (at least as we're using it) is that a p-value captures the likelihood that your result might have occurred by chance. A lower p-value indicates that the result probably didn't occur randomly, or in other words that our hypothesized explanation is more apt to be a _good_ explanation.

Generally speaking, people set their p-value cutoff (often called _alpha_) at .05. Anything below that is considered "statistically significant". There is no hard and fast reason why alpha needs to be .05, and some have argued that in some cases it should be much lower. Still, as far as I'm aware, .05 remains a fairly standard alpha.

The nuances of proper alpha thresholds don't even matter in this case, though: Our p-value for "liverpool" in Paul's article is _well_ above alpha! We can be pretty confident that Paul's article _doesn't_ meaningfully say "liverpool" more than those for the other lads.

Let's try another word for practice: "bass"

In [None]:
# Let's copy some of our "liverpool" code from above and switch the target word


# Finding all of the distinctive words

In [None]:
# First, let's go through this function that will get the p-value for a target word and figure out what we need

def get_fishers(someword,somecountdict,someratedict,alternative='greater'):
    r = someratedict[someword]
    wc = sum(somecountdict.values())
    a = somecountdict[someword]
    b = wc - a
    c = round(r*wc)
    d = wc-c
    p = fisher_exact([[a,b],[c,d]],alternative=alternative)[1]
    return p

In [63]:
# To make life easier later, we may want to make a nested dictionary with results for every subcorpus (here, just documents)

# As we go, if we record the overall count of each word, it will be easier to make a rate dictionary later

        
# Now we can make a rate dictionary using the info we've already collected


### Make some decisions

* Pick a cutoff (words must appear x times or we ignore them)
    * A higher cutoff makes the process much faster
    * You may also find infrequent words less interesting to analyze (although maybe not!)
* Choose an alpha (.05 is standard)
* Decide whether you want to exclude stopwords

In [None]:
cutoff = 5
exclude_stops = True
alpha = .05

# Working with the results