<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by J.D. Porter  for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).
<br />
____

# `Small Language Models` `2`

This is lesson `2` of 3 in the educational series on `small language models`. This notebook is intended `to teach methods for finding key words in context (KWIC), moving from there to counting collocates, and restructuring collocate data as an array`. 

**Skills:** 
* Text analysis
* Python

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial`

**Difficulty:** `Beginner`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, functions, lists, dictionaries)
```

**Knowledge Recommended:**
```
n/a
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Find the context for a target term in a text
2. Count the collocates for a target term in a text
3. Structure collocate information as an array
```
**Research Pipeline:**
```
If you want to try some of this on your own, you can:
1. Gather some texts and save them on your machine as .txt files
2. Use whatever steps you're interested in from this notebook
3. Interpret!
```
___

# Required Python Libraries
 * To keep things simple, we will try to work with very few libraries in this notebook (just 'os', which you do not need to install, though you do need to import it)

In [None]:
### Import Libraries ###
import os

# Required Data

**Data Format:** 
* plain text (.txt)

**Data Source:**
* included files (though you may supplement these whenever you feel comfortable doing so!)

**Data Quality/Bias:**

Included texts are from freely available sources like Project Gutenberg and Wikipedia. They have not been vetted for textual accuracy relative to, say, a scholarly edition.

**Data Description:**

This lesson uses textual data in .txt format (utf-8 encoding) from various sources.

# Introduction...

Here are a few ideas mentioned in Monday's session:
 * "we can capture information about the meanings of words by measuring **how they go together in texts**"
 * "if you know **how words are distributed across a big corpus** of language, you can determine their semantic properties (and possibly other components of meaning as well)"
 * "we look up **how often the words** "penguin", "duck", and "eagle" **appear near the word** "flies""

Today we'll get more specific about what it means for words to "go together", to "appear near" each other, or to be "distributed across a corpus". On a philosophical level, these ideas are rich and complex. On a practical level, though, exploring them is surprisingly simple.

We'll begin by thinking about Key Words in Context (KWIC), move on to collocates, and then end by rearranging our data into an array (if this term is less intuitive for you, you can basically think of it as a table, and you'll have the idea). The first two steps are useful in their own right; you can do a lot of productive analysis even if you stop at either. The last is really more of preparatory step for Friday. But all three help us see how we can quantify the theoretical ideas expressed above.

# KWIC


Key Words in Context, or KWIC, is a term for finding the words that surround some target term. A list of KWICs is similar to a concordance. The steps for producing KWICs are pretty straightforward:

1. Decide on a window size, *n*
2. Find every instance of a target term
3. Grab all the words *n* to the left and *n* to the right of the target term

Let's say our target term is "truth", and we have the following passage:

"It is a **truth** universally acknowledged, that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this **truth** is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters."

If our window size is 3, we get the following KWICS:

    it is a truth universally acknowledged that
    a neighbourhood this truth is so well

Or with a window size of 10, we'd have:

    it is a truth universally acknowledged that a single man in possession of a
    man may be on his first entering a neighbourhood this truth is so well fixed in the minds of the surrounding

Not too complicated! There's a lot to decide (how big should the window be? Should we include the target term? How do we handle beginnings and endings?). But the gist is simple, and the payoff is pretty clear too: This helps us isolate a term and see something about how it's used. So let's dig in!

We'll begin by turning a text file into some words

In [None]:
# Here are some useful functions we built on Monday

# Takes a txt filename and returns a list of its words
def file2words(somefilename,clean = True):
    with open(somefilename) as source:
        text = source.read()
    words = text.split()
    if clean:
        words = [cleanword(w) for w in words]
    return words

# Takes a word and returns a cleaned up version of it
# This works a little differently than the version from Monday
def cleanword(someword,remove_apostrophe=True):
    word = someword.casefold()
    while word and not word[0].isalpha():
        word = word[1:]
    while word and not word[-1].isalpha():
        word = word[:-1]
    return word

# Takes a count dictionary and returns a sorted list of its key,value pairs
def sortdict(somecountdict):
    s = sorted(somecountdict,key = lambda i:somecountdict[i],reverse=True)
    sorted_counts = [(word,somecountdict[word]) for word in s]
    return sorted_counts

In [8]:
# We'll work with a small passage for now

passage = "It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters."

In [None]:
# You can use 'enumerate' to keep track of where you are as you iterate through something

a = 'apple'
l = ['ant','bug','cat','dog','elk']

In [None]:
# Let's build a KWIC function together
# Remember that in Python, lists retain order!



In [None]:
# Seeing KWICs in action in a file

# Getting KWICs for multiple terms


In [None]:
# Seeing KWICs in action for a larger corpus


# Collocates

The **collocates** of a term are the words that appear near it in some corpus. That is, they're the words that "co-located" with the term. Yet another way to put it is that for any target term, its collocates are all of the words in its KWICs. For instance, looking at the KWICs for "truth" in our passage with window size 3:

    it is a [truth] universally acknowledged that
    a neighbourhood this [truth] is so well

We'd get the following collocates (leaving out the target term itself):

collocate | count
----|----
it | 1
is | 2
a | 2
universally | 1
acknowledged | 1
that | 1
neighborhood | 1
this | 1
so | 1
well | 1

As you can see, what counts as a collocate depends on the window size. We'd get a very different table with a window of 10 or 100 words, and in some applications people even zoom out to the document level. In that case, "rightful" would be a collocate of "uniting" in *Pride and Prejudice*, even though the former never appears again after the first two sentences, and the latter never appears until the last sentence!

Since we already have a way to find KWICs, we'll use it as a step in finding collocates. But it's worth noting that you don't *have* to go through KWICs to get to collocates. You're always collecting collocates from the window surrounding a word, but you don't necessarily need to record all of the relevant contexts—it's fine to skip straight to the counts, if that's all you want.

In [None]:
# Let's build a collocate function together, using our existing KWICs function

# Take a list of KWICs and return a dictionary of word counts

In [None]:
# What about the target term itself?
sentence = "Rose is a rose is a rose is a rose."

In [None]:
# Running the collocate procedure on a file


In [None]:
# Working with multiple target_terms


In [None]:
# Looking at some of the results from Gatsby


In [None]:
# Run the procedure on a corpus in a directory


In [None]:
# Looking at some of the results from the jazz corpus


# Toward an array

Let's return for a second to an example from our first session. When we were discussing the distance between various bird words, we modeled their data with a little table:

| term | "flies" | "swims" | "hockey" |
| ---- | ---- | ---- | ---- |
|penguin | 1 | 9 | 9
|duck | 8 | 7 | 8
|eagle | 10 | 1 | 0

This could also be depicted as a series of lists:

```Python
    penguin = [1,9,9]
    duck = [8,7,8]
    eagle = [10,1,0]
```

Now, all we're really showing here is data about a few collocates for each of the bird terms. But we've arranged that data a little differently than in our work in the collocates section, above. The arrangement consists of a series of **vectors**, which, as we learned on Monday, are just lists of numbers. Calculating relationships between words is much easier when we use vectors. We'll get into the details about distance on Friday, but producing the table or vector structure is an important first step.

In [None]:
# Before we dig in, an important concept
# A list of lists works like a table (and kind of looks like one, too)

bird_array = [['term','flies','swims','hockey'],[1,9,9],[8,7,8],[10,1,0]]

for row in bird_array:
    print(row)

In [None]:
# First, let's produce a table that captures the data for every word in our passage


In [None]:
# Next, let's try a table that shows how a few words relate to each other in The Great Gatsby
