# Python for Digital Humanities

## Unit #3: Beginning Text Processing

* Overview
* NLTK


<font color=blue>---------------------------------------------------------------</font>

### 3.1 Overview 
#### Basic Text Processing

One of the most basic techniques for processing text is to look at the text as a collection of words and perform statistics on the collection.  How many unique words are used? What words are used the most often? Are there word combinations (e.g., inner sanctum, gray skies) that are repeated throughout the text?

In this unit, we will look at tools within Python that will help us to do.  We are fortunate that other people have written the code that does the majority of the work for us.  But, to use the code, we need to ensure that these tools are available with our version of Python.

One of the advantages of using the Anaconda distribution of Python is that it comes with several of the most used packages or modules already installed.


In the iPython console of Spyder, you can type
```
!pip list
``` 
and look through the alphabetical list of the packages/modules.

Or, you can import with the name of the package/module to see if it loads correctly.
For example:
```
import math
```


### 3.2 The NLTK Package

In Python, we have the `nltk` (Natural Language Toolkit) package the provides some very nice statistical tools that can be used with text.

Before we can take full advantage of `nltk`, we need to have it install some data sets some datasets that we can use in our statistical analysis.

<font color='red'>Caution:</font> 
* Before doing this download, make sure that you have **at least 3.5GB of free space on your laptop!**
* The download will take approximately 30 minutes.  Make sure that you won't need your computer for that chunk of time.
* Only do the download once! 

```
import nltk
nltk.download()
```
When a dialog box appears, select "all". 



## Activity:  Using some features of nltk


Make sure that you have downloaded the file "emma_chapter_one.txt".
Type in the code snippets from below to see what `nltk` can do for you.

**1. Read in the file _emma_chapter_one.txt_ into the variable `raw_text`**
```
import nltk
with open('emma_chapter_one.txt') as f:
    raw_text = f.read()
```


**2. Split the text into words by using `word_tokenize`.**

```
words = nltk.word_tokenize(raw_text)
```

**3. Use `FreqDist` to have the program count how many times each word is used in the text.**

By save the results to the variable `fdist`, you can look up how many times the word occurs.  For example, `fdist['Emma']` will tell us the number of times that the word "Emma" occurs.

```
fdist = nltk.FreqDist(words)
print(fdist['Emma'])
```
You only have to use `FreqDist` one time.  The results are saved in the variable `fdist`. Try different words that you are curious about.  Do they occur more or less often than "Emma"?

```
print(fdist['REPLACE_WITH_YOUR_WORD'])
```

**4. Convert the words back to a text.**  
For the next set of features or tools, we want to have the text back together.  It is very important that we use `nltk`'s command `Text` to do this.
```
text = nltk.Text(words)
```
Now, `text` is a special variable that has some tools associated with it.  Let's look at a handful of those tools.

**5. Suppose we wnat to know how the word "Emma" is used throughout the text.**
We can use the `concordance` tool to look at the words surrounding "Emma":
```
text.concordance("Emma")
```
Use the `concordance` tool on words of your choice to see how they are used.  (Maybe you noticed another word when you used the concordance tool and you want to see how that word is used throughout the text!)

**6. We may want to know if other words are used in a way similar to how "Emma" was used.**
To look these words, we can use the `similar` tool:
```
text.similar('Emma')
```
Again, try words other than "Emma" to see what you can discover.

**7. Suppose we want to determine if there are common contexts between the word "Emma" and the word "he".**
We can use the `common_contexts` tool:
```
text.common_contexts(['Emma', 'he'])
```
Notice that the output is in the form of two words joined by an underscore (e.g., _but_was_).  This means that the phrases "but Emma was" and "but he was" both occur in the text.


**8.  Sometimes it helps to visualize how words are placed throughout the text.**

For example, we may want to know how the words "marry", "marriage", "Knightly" and "Weston" are scattered throughout the text.  We can create a plot to help us visualize this with the `dispersion_plot` tool:

```
text.dispersion_plot(["marry", "marriage","Knightly", "Weston"])
```
Think of the horizontal line as being the text from beginning to end.  The lines of tick marks show where in the text the words appeared. (Tick marks close to the left mean the word was near the beginning of the text; whereas, tick marks on the right mean the word is near the end of the text.  We hope that we will see patterns in the tick marks (e.g., Is there a place in the text where "marriage" is discussed more frequently?  Does a name appear in similar locations as the word "marriage"?)


**9. Now for some really fancy stuff, let's see what words are commonly paired together.**

We will discuss how the following lines work later.  If you carefully type the code snippet in, you will see the 25 most-frequently paired words.  Interestingly, the tokenizer (the tool that split the raw text into words) often counts punctuation as a separate "token".

```
finder = nltk.BigramCollocationFinder.from_words(words)
scored = finder.score_ngrams(nltk.collocations.BigramAssocMeasures().likelihood_ratio)

for i in range(25):
    print(scored[i])
```

    
<font color=blue>---------------------------------------------------------------</font>

### Summary

In this section, we learned about some basic tools that you can use to explore your text. 

Some of the tools may be beyond the scope of what you know about Python, but they are exposing you to some nice features.  By the end of the workshop, you will have many more Python skills.



<font color=blue>---------------------------------------------------------------</font>