<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [J.D. Porter](https://) for the 2023 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email porterjd@upenn.sas.edu<br />
____

# `Finding Word Meaning Through Context` `1`

This is lesson `1` of 3 in the educational series on `finding word meaning in context`. This notebook is intended `to teach the basic steps involved in working with text files in Python, and introduce the concepts of programatically reading a text file, analyzing the words it contains, and writing out some results`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` 

**Difficulty:** `Beginner`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, functions, lists, dictionaries)

```

**Knowledge Recommended:**
```
* n/a
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Open a text file and convert it to a list of words
2. "Clean" the words using a function
3. Find counts and frequencies for the words
4. Write out the results of their analysis
```
**Research Pipeline:**
```
1. Gather a file in the .txt format and save it somewhere on your machine
2. Use whatever steps you're interested in from this notebook
3. If you have written out some of your data, explore it in a program like Excel or Google Sheets
4. Interpret!
```
___

# Required Python Libraries

* To keep things simple, we will try to work with very few libraries in this notebook (just 'os', which you do not need to install, though you do need to import it)

## Install Required Libraries

# This is a change that i am making to push to github


In [None]:
### Import Libraries ###
import os

# Required Data

**Data Format:** 
* plain text (.txt)

**Data Source:**
* included files (though you may supplement these whenever you feel comfortable doing so!)

**Data Quality/Bias:**

Included texts are from freely available sources like Project Gutenberg. They have not been vetted for textual accuracy relative to, say, a scholarly edition.

**Data Description:**

This lesson uses textual data in .txt format (utf-8 encoding) from various sources.

# Introduction...

## ...to the course

The basic idea behind this course is that we can tell a lot about the meaning of a given word by looking at the words that are used "near" it. In our case, this will mean looking at digital copies of texts, treating them as sequences of words, and observing which words tend to appear within a few positions of each other in those sequences.

The notion that meaning relies heavily upon context has a long, rich history in linguistics, philosophy of language, natural language processing, and other fields. One important early thinker in this domain (and we'll cast a pretty broad net in describing the domain here) is Ferdinand de Saussure, who argues that words should not be understood as a bunch of names for objects. Rather, they are signs (and often pretty arbitrary signs), and their meaning is determined not by pointing at objects, but by the internal relationships of the broader system in which they exist.[$^{1}$](#1) 

We find a similar argument at the beginning of Ludwig Wittgenstein's _Philosophical Investigations_, published about a half century after Saussure's work. Wittgenstein begins with a quote from Augustine of Hippo, suggesting that language is learned by pointing at objects and saying their names (an "ostensive" model). Wittgenstein point out that a lot of language doesn't seem to function that way (think of someone yelling "Help!"), and argues that the Augustinian model is flawed. His ideas (in this period at least) are often summarized with his claim that "the meaning of a word is its use in the language".[$^{2}$](#2)  This "use" can be quite complex, and involves components that go beyond the presence of other words—for instance, when you yell "Help!", the meaning will depend on factors like whether you are flailing in a swimming pool vs. about to drop a stack of plates, your relationships to the people around you, and so on. Still, in a broad sense it does seem like "use" must incorporate at least some information about how a word tends to be deployed relative to other words—or in any case, many people have understood Wittgenstein that way.

I mention Saussure and Wittgenstein because of their importance in thinking about language as a non-ostensive system in which meaning seems to have something to do with the entire system of the language. But the most relevant context for our work in this course is probably the distributional hypothesis, which has its home more in linguistics. As you might remember from the course description, the linguist J.R. Firth famously said, "You shall know a word by the company it keeps". This basic idea arguably sits at the foundation of quite a lot of modern work in linguistics and natural language processing.[$^{3}$](#3) For instance, word vector models generally depend on counting how frequently words appear near each other in a large corpus of text.

This leads us to the work of this course, because most distributional models (or contextual theories of language meaning in general) at some point involve _collocates_. "Collocate" is a term of art for any word that appears near some target word (if you think of the name as co-locate, you've basically got its meaning). For example, if we looked at a large collection of 1960's and 70's movie reviews, we would probably find that the collocates of "Connery" would include words like "Sean", "Bond", "007", and so on—depending on how we measured them, they might even includes things like "spy" or "martini". 

With enough of these collocates, would we begin to understand what "Connery" _means_ in the context of Cold War cinema? Maybe! By learning to find collocates, we will set ourselves up to think more deeply about questions like these, which remain lively areas of investigation in several disciplines. And we will also make it easier to examine what's going on with specific words or texts (e.g., if you wanted to know how the Supreme Court has discussed "privacy" over time, you could use collocates in your investigation). 

<hr></hr>

##### *Footnotes* #####

1. <a id="1"></a> The classic Saussure text is his _Course in General Linguistics_.
2. <a id="2"></a> _Philsophical Investiations_, 43. For the idea that Wittgenstein's idea of use is _not_ well captured by a distributional semantic model, see Bender, Emily. "Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data". _Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics_. 10.18653/v1/2020.acl-main.463
3. <a id="3"></a> For a nice overview of distributional semantics in general, see Lenci, Alessandro. "Distributional Semantics in Linguistic and Cognitive Research". _Rivista di Linguistica_ 20.1 (2008), pp. 1-31.

## ...to this notebook

To get to collocates, we'll need to know how to get to words in the first place. That's the focus of this notebook.

There are many ways to go about the tasks described below. Some are arguably more efficient than what we'll learn today; others are matters of preference. The aim here is to give you very basic, flexible, lightweight tools, usable with a pretty minimal level of familiarity with Python.

The section headers will give you a good sense of the steps involved, so let's dive in!

# Opening a file

And reading the text in it

The first thing we need is a path to the file. If the file is in the same directory as your notebook, you can just use the name of the file:

```Python
fn = 'Austen_PrideAndPrejudice.txt'
```

If it's anywhere else, you need to give a _path_ to the file, which usually looks something like this:

```Python
fn = '/Users/porterjd/Documents/TAP/Austen_PrideAndPrejudice.txt'
```

On a Mac, you can right click a file in Finder, hold "option", and select "Copy 'Austen_PrideAndPrejudice.txt' as Pathname".

To do the equivalent in Windows, hold shift _before_ right clicking on the file.

In [None]:
# Put your filename here, as the value of the variable "fn"

fn = 'Austen_PrideAndPrejudice.txt'

## The verbose way

In [None]:
# Open the file
file = open(fn)

# Read the file
text = file.read()

# Close the file
file.close()

## The 'Pythonic' way

In [None]:
with open(fn,encoding='utf8') as source:
    text = source.read()

# Turning a text string into words

The **.split()** method turns a string into a list. Basically, the method breaks up the string into sections, each section being demarcated by a specified character. The character itself is removed. By default, it splits on white space (including tabs and new lines), but you can use other characters, too!

In [None]:
words = text.split()

# Counting the words

In [None]:
# First, set up a dictionary. It will take words as keys and store counts as values (though for now it's empty)
counts = {}

# Next, use a for loop to iterate through our words, counting them as we go
# Remember that you can use n += x to increment some variable n by amount x
for w in words:
    if w not in counts:
        counts[w] = 0
    counts[w] += 1

# Notice any problems with our results? That's why we need the next section!

In [None]:
# Compare counts of 'she' to counts of 'She', or 'said' and 'said,'

# Cleaning the words

In this section, we're going to build a function that takes in a (potentially messy) word and returns a "cleaned up" version of the word. We're going to build the function together in an iterative way, testing it out to see what it can do along the way. This can be a useful way to build functions of all kinds!

In [None]:
# Run this cell to get a useful example sentence

sentence = 'She said, "We should leave," and so - with some reservations - we did.'

In [None]:
sentence_words = sentence.split()

#### Some common case methods
* .upper()
* .lower()
* .title()
* .casefold()
    * Basically a more aggressive form of .lower()
    
#### Some common character checks
 * .isalpha()
 * .isdigit()
 * .isalnum()
 
#### Some common string checks
 * .endswith()
 * .startswith()

In [None]:
# Examples for testing the case methods and character checks above

a = 'apple123'

phrase = 'Ich weiß nicht.'

phrase.casefold()

In [None]:
# Takes a word and returns a cleaned up version of the word
def wordcleaner(someword):
    cleanword = someword.casefold()
    while cleanword and not cleanword[-1].isalnum():
        cleanword = cleanword[:-1]
    while cleanword and not cleanword[0].isalnum():
        cleanword = cleanword[1:]
    return cleanword

# Takes a word and returns an aggressively cleaned version of the word
def alphaclean(someword):
    letters = [i for i in someword if i.isalpha()]
    cleanword = "".join(letters)
    cleanword = cleanword.casefold()
    return cleanword

In [None]:
# Showing one way to get just the alphabetic characters out of a string

letters = [i for i in sample if i.isalpha()]
letters

In [None]:
# Getting a sample word in order to test our wordcleaner function as we build it

sample = sentence_words[4]

print(sample)

In [None]:
# Seeing our wordcleaner function in action on our sample word

print(wordcleaner(sample))

In [None]:
# Seeing our wordcleaner in action on our original sentence words

for w in sentence_words:
    print(w,"\t",wordcleaner(w))

In [None]:
# Seeing our alphaclean function in action on our sample word

print(alphaclean(sample))

In [None]:
# Getting our alphacleaned words

alpha_counts = {}

for w in words:
    w = alphaclean(w)
    if w not in alpha_counts:
        alpha_counts[w] = 0
    alpha_counts[w] += 1

In [None]:
# Word counting results from our wordcleaner method

for w in counts:
    if "elizabeth" in w:
        print(w)

In [None]:
# Word counting results from our alphaclean method

for w in alpha_counts:
    if "elizabeth" in w:
        print(w)

# Counting the words redux

In [None]:
# Let's get our word counts again!
counts = {}

# Using a list comprehension to clean my words
cleanwords = [wordcleaner(w) for w in words]

# This is what the list comprehension would look like as a verbose for loop
# for w in words:
#     cleanwords.append(wordcleaner(w))

for w in cleanwords:
    if w not in counts:
        counts[w] = 0
    counts[w] += 1

In [None]:
# An alternative way to make your count dictionary
for w in cleanwords:
    if w in counts:
        counts[w] += 1
    else:
        counts[w] = 1

## Sidebar: The hardtack/cherry/electron problem!

_Moby-Dick_ 

Chapter 23, "The Lee Shore"

Some chapters back, one Bulkington was spoken of, a tall, newlanded mariner, encountered in New Bedford at the inn.

When on that shivering winter’s night, the Pequod thrust her vindictive bows into the cold malicious waves, who should I see standing at her helm but Bulkington! I looked with sympathetic awe and fearfulness upon the man, who in mid-winter just landed from a four years’ dangerous voyage, could so unrestingly push off again for still another tempestuous term. The land seemed scorching to his feet. Wonderfullest things are ever the unmentionable; deep memories yield no epitaphs; this six-inch chapter is the stoneless grave of Bulkington. Let me only say that it fared with him as with the storm-tossed ship, that miserably drives along the leeward land. The port would fain give succor; the port is pitiful; in the port is safety, comfort, hearthstone, supper, warm blankets, friends, all that’s kind to our mortalities. But in that gale, the port, the land, is that ship’s direst jeopardy; she must fly all hospitality; one touch of land, though it but graze the keel, would make her shudder through and through. With all her might she crowds all sail off shore; in so doing, fights ‘gainst the very winds that fain would blow her homeward; seeks all the lashed sea’s landlessness again; for refuge’s sake forlornly rushing into peril; her only friend her bitterest foe!

Know ye now, Bulkington? Glimpses do ye seem to see of that mortally intolerable truth; that all deep, earnest thinking is but the intrepid effort of the soul to keep the open independence of her sea; while the wildest winds of heaven and earth conspire to cast her on the treacherous, slavish shore?

But as in landlessness alone resides highest truth, shoreless, indefinite as God- so better is it to perish in that howling infinite, than be ingloriously dashed upon the lee, even if that were safety! For worm-like, then, oh! who would craven crawl to land! Terrors of the terrible! is all this agony so vain? Take heart, take heart, O Bulkington! Bear thee grimly, demigod! Up from the spray of thy ocean-perishing- straight up, leaps thy apotheosis!

# Writing out some results

If you're new to coding, this is where things start to seem kind of magical. We can make a new file on our machines just by writing some code in this notebook. Here, we'll start out learning how to write out any old file, and then we'll try writing out a spreadsheet containing our information about word frequencies.

In future work, we'll learn how to write files to different locations on our machines. For now, though, we're going to write things out to the working directory, or in other words the directory containing this notebook. 

In [None]:
# First, let's create a filename for our initial tests of the method
output_fn = 'hello_world.txt'

# Then let's make a string to write out to the file
output_string = "Hello world!"

# Now we can create the file using a modified version of our "Pythonic" approach to opening files
with open(output_fn,'w') as output_file:
    output_file.write(output_string)

# Go check out the file on your machine!
# Then close it, and let's try writing more than one thing. Here's a list of animals:

animals = ['ant','bug','cat','dog','elk','fly','gnu','hog']

# Now let's try with this list:
numbers = [512,1729,1864,11251,6]

There are many ways to write out our data. For example, we could write our dictionary, pretty much in its current form, as a .json file. However, spreadsheets are a very useful form for DH (and other kinds of) work. And one useful skill to have in Python is creating a 'table' from some data you have in another format. To do that, we'll use a list of lists.

In [None]:
# First, let's set up our output table with the header row we want
output_table = [['token_','count']]

# Then, let's iterate through our dictionary using a for loop, filling out the table as we go
for word,count in counts.items():
    new_row = [word,count]
    output_table.append(new_row)

# Last, let's write out the table!
output_fn = 'pp_counts.csv'

with open(output_fn,'w') as output_file:
    for row in output_table:
        str_version = [str(i) for i in row]
        output_str = ",".join(str_version)
        output_file.write(output_str + "\n")

In [None]:
# A list of lists can replicate the structure of a table

table = [[1,2],[3,4]]

# Here's an example of a list of lists looking kind of like a table!
for row in table:
    print(row)

In [None]:
# You can grab the key and value from a dictionary using .items()

for word,count in counts.items():
    print(word,count)

In [None]:
# A list comprehension to turn each item in a list into the string version of that item
str_version = [str(i) for i in output_table[5]]

# A join that connects everything in a list into one big string (with commas between the original items)
",".join(str_version)