# [Introduction to Text Analysis with Python](https://github.com/michellejm/NLTK_DHRI)
### Digital Humanities Research Institute <br> June 13, 2018
#### Michelle A. McSweeney, PhD 



In [None]:
import nltk
import matplotlib

%matplotlib inline

In [None]:
from nltk.book import *

**Concordance**: Words in their contexts

In [None]:
text1.concordance("whale")

In [None]:
text1.concordance("love")

**Similar**: Words that appear in a similar environment as the target word
This is determined based on concordance, so it returns words that appear in similar contexts

Might do this with two different texts

In [None]:
#How does Melville use "love"
text1.similar("love")

In [None]:
#How does Austen use "love"
text2.similar("love")

In [None]:
text5.similar('lol')

In [None]:
#For the next module, we need a plotting package

import matplotlib as plt
%matplotlib inline

In [None]:
text1.dispersion_plot(["whale", "monster"])

Count a specific word - how many times does this sequence of characters occur in my document?

In [None]:
text1.count("Whale")

How many **tokens** are in my text?

**tokens** are unique sequences, let's start with an example:

"love", "bowie", "Bowie", "!" and ":)" are all unique **tokens**

In [None]:
#remove punctuation by only capturing the items that 
#"are alpha" and then lowering those
text1_tokens = []
for t in text1:
    if t.isalpha():
        t=t.lower()
        text1_tokens.append(t)

In [None]:
#First figure out how many words are in our text. 

len(text1_tokens)

How many **unique** words are in my text? 

* first make a set that groups all the "words" together (numbers, punctuation sequences, etc.) - this groups together **types**. 
* Token = instance
* Type = more general ("bowie" and "Bowie" are different types - why?)

In [None]:
#set tells us how many unique items - it makes a set
set(text1_tokens)
len(set(text1_tokens))

In [None]:
len(set(text1_tokens))/(len(text1_tokens))

In [None]:
text1_slice = text1_tokens[0:10000]

In [None]:
len(set(text1_slice))/(len(text1_slice))

To clean our corpus, we need to:
1. Remove capitalizaton and punctuation (DONE)
2. Remove the stop words
3. Lemmatize the words

In [None]:
#import the stopwords from nltk.corpus
from nltk.corpus import stopwords
stops = stopwords.words('english')

In [None]:
#Remove the stop words
for t in text1_tokens:
    if t in stops:
        text1_tokens.remove(t)
    else:
        pass
        

In [None]:
len(text1_tokens)

In [None]:
len(set(text1_tokens))

In [None]:
# import the lemmatizer function from nltk.stem 
from nltk.stem import WordNetLemmatizer

#the lemmatizer requires that an instance be called before it is used
wordnet_lemmatizer = WordNetLemmatizer()

In [None]:
#Lemmatize the words
text1_clean = [wordnet_lemmatizer.lemmatize(t) for t in text1_tokens]


In [None]:
len(text1_clean)

In [None]:
len(set(text1_clean))

In [None]:
sorted(set(text1_clean))

In [None]:
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()
t1_porter = [porter_stemmer.stem(t) for t in text1_tokens]
sorted(set(t1_porter))

In [None]:
my_dist = FreqDist(text1_clean)

Usually nothing happens here, so check the **type** to be sure it worked

In [None]:
type(my_dist)

Now let's **plot** the graph

In [None]:
my_dist.plot(50,cumulative=False)

It may be a little easier to look at as a list

In [None]:
my_dist.most_common(20)

Now let's check to see if some words we are interested in appear in our text

In [None]:
b_words = ['god', 'apostle', 'angel']

In [None]:
my_list = []
for word in b_words:
    if word in text1_clean:
        my_list.append(word)
    else:
        pass

In [None]:
print(my_list)

Let's pull a book in from the Internet

Project Gutenburg is a great source! www.gutenberg.org
Let's say I'm interested in writings about East Africa from the early 20th century. I find the book, Zanzibar Tales, translated by George W. Bateman. In Project Gutenberg's database, this is text number 37472 



To make a book into a Text NLTK can deal with, we have to:

* open the file from a location
* read it/decode it 
* tokenize it (go from a string to a list of word)
* nltk.Text() 

In [None]:
#import the urlopen command
from urllib.request import urlopen
#set the url to a variable 
#DO NOT NAVIGATE TO THE SITE AND JUST COPY THE LINK - THIS IS THE TXT VERSION! BETTER TO TYPE THIS LINK
my_url = "http://www.gutenberg.org/cache/epub/996/pg996.txt"


In [None]:
#open the file from the url
file = urlopen(my_url)
#read the opened file
raw = file.read()

In [None]:
don=raw.decode()


In [None]:
#check the type to be sure it worked. I expect a string now.
type(don)

In [None]:
import nltk

In [None]:
#split the string into words with word_tokenize (uses spaces to distinguish words)
don_tokens = nltk.word_tokenize(don)

In [None]:
#check to make sure it worked
type(don_tokens)

In [None]:
#get an idea of how big the file is
len(don_tokens)

In [None]:
#look at the first 10 words to be sure its correct
don_tokens[:10]

In [None]:
dq_text = don_tokens[120:]

In [None]:
#turn the list of words into a text nltk can recognize
dq_nltk_text = nltk.Text(don_tokens[120:])

In [None]:
#Remove stop words
mystops = stopwords.words('english')
dq_clean = [w for w in dq_text if w not in mystops]
#Lowercase and remove punctuation
dq_clean = [t.lower() for t in dq_clean if t.isalpha()]

In [None]:
#Lemmatize
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
dq_clean = [wordnet_lemmatizer.lemmatize(t) for t in dq_clean]

In [None]:
dq_clean

One step further! 
**Part-of-Speech Tagging**

NLTK uses the Penn Tag Set. 

There are other options as well that you many want to look at (i.e., tree tagger and polyglot).

In [None]:
#make a new object that has all the words and tags in it
dq_tagged = nltk.pos_tag(dq_clean)

In [None]:
print(dq_tagged[:10])

We have a **list of tuples**


I'm going to put it in a for-loop

Have to deal with this in a special way 
<br>(a, b) in my_list


In [None]:
#create an empty dictionary
tag_dict = {}
#for every word/tag combo in my list, 
for (word, tag) in dq_tagged:
    if tag in tag_dict: 
        tag_dict[tag]+=1
    else:
        tag_dict[tag] = 1


In [None]:
tag_dict

In [None]:
from collections import OrderedDict
tag_dict = OrderedDict(sorted(tag_dict.items(), key=lambda t: t[1], reverse=True))
tag_dict

How do you know to put all that other stuff in (i.e., sorted, lambda, etc)?!?

*Read the docs*

https://docs.python.org/3.1/whatsnew/3.1.html

So far, we have counted things in our texts by looking at 

* Concordance
* Words in similar environments
* Words in common contexts
* Unique words 
* Length of words 


Then we performed some operations, but still counted things:

* Frequency Distributions
* Lexical Density 
* Found words from a list in a text
* Part-of-Speech Tags

Now we will perform operations on the Text itself before doing those operations

In [None]:
#Let's check out the lexical density
len(set(zt_clean))/len(zt_clean)

What if I want to read in my OWN corpus?

In [None]:
f = open("/Users/mam/books/hungerGames/catchingFire.txt", 'r')
my_file = f.read()


In [None]:
hg_tokens = nltk.word_tokenize(my_file)
hg_tokens[:100]

**Going Forward**

* Use a text editor to write complete programs
    * Run these in the terminal
* Use Spyder to write complete programs
* Often save the program you write in the same file as the file you will be working with to shorted the path.

How do I know where to go?!?
* http://www.nltk.org/book_1ed
* http://www.nltk.org/

