<a href="https://colab.research.google.com/github/hlapin/DigitalMishnah-Public/blob/gh-pages/Basic_NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic natural language operations using Python NLTK

***Natural Language Processing (NLP)*** involves the use of computers to understand spoken or written language. There have been amazing advances in technology in the last few years involving neural networks and super large datasets and massive computing power. Here we are focusing on basic operations:
* Tokenizing
* Viewing the language set whole
* Stemming
* Lemmatization
* Collocations

Our source data will be the US Supreme Court's Heller decision (554 U.S. 570 (2008)), which ruled that the Second Amendment is not tied to the maintaining of a militia.

We will be using the Python `NLTK` (natural language toolkit) library to do the heavy lifting, so that we can examine some of the results.


## Getting and cleaning the data
We will be using the US Supreme Court's Heller decision (554 U.S. 570 (2008)), which ruled that the Second Amendment protection of the right to bear arms is not tied to the maintaining of a militia.

I created a csv file hosted on GitHub with the three responses.
We will read these into a pandas dataframe (think: very sophisticated spreadsheet) to make our work easier.

In [None]:
# Download the data
!wget https://raw.githubusercontent.com/hlapin/DHTeaching/master/data/heller_2008.csv

Check in your files in Colab if the document downloaded.

In [None]:
from pandas.io.parsers.readers import read_csv
import pandas as pd

# this is the dataframe that will hold our responses
dfHeller = pd.DataFrame(read_csv('heller_2008.csv'))

In [None]:
## what does our dataframe look like?
dfHeller

Remove punctuation, convert all lower case to upper case. This has costs and benefits for us:
* benefit: it reduces the number of distinct tokens (words) that we need to deal with. (It will treat `And` and `and,` and `and` identically.) 
* cost: It equates tokens that should be distinguished, notably proper nouns and regular nouns (`Miller` and `miller`)

Also `tokenize` the text (break the text up into a list of its constituent parts: words, in our case)

In [None]:
import nltk
import matplotlib.pyplot as plt

from nltk import RegexpTokenizer

# create a column of tokenized text
# lower case all
dfHeller['tokens'] = dfHeller['text'].apply(lambda x: x.lower())

# tokenize while removing  punctuation
punct_to_exclude = r'[.;:,\n\'\"‘’“”!?\(\)\[\]]*'
tokenizer = RegexpTokenizer(punct_to_exclude + '\s+' + punct_to_exclude, gaps=True) 

dfHeller['tokens'] = dfHeller['tokens'].apply(tokenizer.tokenize)

# what does this column look like?
dfHeller['tokens']

## What does our data look like "whole"?

What are the most frequent words across the corpus?

In [None]:
# create a composite list of all the tokens
concat_list = sum(dfHeller.tokens.to_list(),[])

# # uncomment to view first ten words
# print(concat_list[:10])

freq = nltk.FreqDist(concat_list)

# find the most frequent words (we are choosing 100)
X = freq.most_common(100)

# Show the ten most common
X[:10]

What is the distribution of the most common words?
What does this tell us about extracting meaningful information from this body of words?

Let's graph the frequencies of the most frequent words.

How far down the list do we need to go to get to "meaningful" words?

In [None]:
# set up for a graphic representation.
plt.figure(figsize=(20,20)) 
plt.ylabel("Frequency")
plt.xlabel("Words")
plt.xticks(rotation=90)    # rotates x-axis values

for word , freq in X:
    plt.bar(word, freq)    
plt.show()



In [None]:
# Let's replace our words by their rank-order 
# and put rank and frequencies into lists for more graphing
# [on review, there are better ways to do this]

freq_dict = dict()
for i in range(0,100):
  freq_dict[i + 1] = X[i][1]


ranks = list(freq_dict.keys())
counts = list(freq_dict.values())

Our second graph should look a lot like the first, except in this case we are plotting two series of numbers against each other.

In [None]:
plt.figure(figsize=(20,20)) 
plt.ylabel("Frequency")
plt.xlabel("Rank")

plt.bar(ranks, counts)


plt.show()

**"Zipf's Law"**: That in a natural language group the freqeuncy is such that the second word has half the frequency of the first; the third, one third that amount; the tenth, one tenth the first, and so on. (No surprise: it is a bit more complicated.)

If this is a correct prediction of our set of words, they should create a descending straight line curve when plotted on a log-log scale.

In [None]:
plt.figure(figsize=(20,20)) 
plt.ylabel("Frequency")
plt.xlabel("Rank")

plt.scatter(ranks, counts)
# set up the axes on log scale
plt.xscale("log")
plt.yscale("log")

plt.show()

## Part of Speech Tagging (POS)

We will be using a Part-of-Speach tagger provided by NLTK. It uses a machine learning model trained on English (more specifically, the Wall Street Journal). 

**Tags** ([source](https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/))

Tag | Gloss
---|---
CC |coordinating conjunction
CD | cardinal digit
DT | determiner
EX | existential
FW | foreign word
IN | preposition/subordinating conjunction
JJ | adjective 'big'
JJR | adjective, comparative 'bigger'
JJS | adjective, superlative 'biggest'
LS | list marker
MD | modal could, will
NN | noun, singular 'desk'
NNS | noun plural 'desks'
NNP | proper noun, singular 'Harrison'
NNPS |  proper noun, plural 'Americans'
PDT | predeterminer 'all the kids'
POS | possessive ending parent's
PRP | personal pronoun I, he, she
PRP\$ |  possessive pronoun my, his, hers
RB |  adverb very, silently,
RBR |  adverb, comparative better
RBS | adverb, superlative best
RP | particle give up
TO | to go 'to' the store.
UH | interjection errrrrrrrm
VB| verb, base form take
VBD | verb, past tense took
VBG | verb, gerund/present participle taking
VBN | verb, past participle taken
VBP | verb, sing. present, non-3d take
VBZ | verb, 3rd person sing. present takes
WDT | wh-determiner which
WP | wh-pronoun who, what
WP\$ | possessive wh-pronoun whose
WRB | wh-abverb where, when

In [None]:
# get the nltk's default tagger
# Not sure why it needed to be downloaded explicitly but it did
nltk.download('averaged_perceptron_tagger')

# let's use the tokenized scalia ruling
scalia = dfHeller['tokens'][0]
pos = nltk.pos_tag(scalia)

# test on first 20 
pos[:20]

# Results are far from perfect:
# The first `scalia` and `columbia` are adjectives (JJ)
# To experiment: Would pre-chunking by sentence first improve the results?

## Stemming

Stemming is the shortening of word forms to a single common form. Initially, I had used the scalia data for this (and you can), but what the stemmer was doing was clearer if we used a select list of words.

In [None]:
# select one of the standard stemming algorithms from NLTK

from nltk.stem.porter import *
# set up our stemmer
stemmer = PorterStemmer()

# our words to stem and lemmatize
tokens = ['space','spacing','spaces','spacer','spacers',
          'choose','chose', 'choice','chosen','choosers', 
          'walk', 'walks', 'walker','walking','walked']

# apply stemmer for each token in our list of words
stemmed = [stemmer.stem(token) for token in tokens]
stemmed

## Lematization

Lemmatization provides a single common form for all the inflected forms of a word. 
We are using the same toy set of words in this example as for stemming.

In [None]:
from nltk.stem import WordNetLemmatizer
  
nltk.download('wordnet')  
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokens]
lemmatized

## Collocation
What words appear together with others, and what might that tell us about the texts we are studying. 
We are looking at very simple case:
* ordered bigrams (pairs of words)
* most common "stopwords" filtered out 

What kinds of questions can we ask on the basis of  the resulting listing of most common bigrams from our three authors/texts? 


In [None]:
# download a default set of stopwords. We could create our own.
nltk.download('stopwords')

# we are going to exclude `stopwords` in order to avoid bigrams
# like `and the`, `and if`, `or if`, `or the`
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

dfHeller['filtered'] = dfHeller['tokens'].apply(lambda x: 
                       [token for token in x if token not in stop_words] )

dfHeller['bigrams'] = dfHeller['filtered'].apply(lambda x: 
                      nltk.FreqDist(nltk.bigrams(x)).most_common(20))

# dfHeller[['tokens','filtered','bigrams']]
dfHeller[['authors','bigrams']]