<a href="https://colab.research.google.com/github/todnewman/coe_training/blob/master/Basic_NLP_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Natural Language Processing
**Author**: W. Tod Newman

**Updates**: New release

## Learning Objectives


*   Learn the basics of the Python Natural Language Toolkit
*   Explore concepts of language processing: parts of speech, corpora, stemming, lemmatizing, etc.
*   Overview simple neural network classification


# About Python's Natural Language Toolkit (NLTK)

NLTK is the most widely used NLP module for Python.  It comes with the Anaconda distribution, so it's very easy to start working once Anaconda is in place.  From the NLTK site:

*NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.*

NLTK has a very large set of open data that can be used to train the NLTK learner.  This NLTK data includes a lot of corpora, grammars, models and etc. Without NLTK Data, NLTK is not extremely useful. You can find the complete nltk data list here: http://nltk.org/nltk_data/

The simplest way to install NLTK Data is run the Python interpreter and type the commands:
'>>> import nltk
'>>> nltk_download()

This should open the NLTK Downloader window and you can select which modules to download.  The Brown University corpus is one of the most cited artifacts in the field of corpus linguistics.  We'll start by exploring how we can make use of it in our own text classification tasks.


In [None]:
# use natural language toolkit
import nltk

#
# Use the nltk downloader to download corpora, tools, and dictionaries
#
nltk.download('brown')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('names')
nltk.download('tagsets')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('omw-1.4')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already

True

## Corpora

### Brown University corpus

The Brown Corpus was compiled in the 1960s by Henry Kučera and W. Nelson Francis at Brown University in Providence, Rhode Island, as a general corpus (text collection) in the field of corpus linguistics. It contains 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961.

for more information: https://en.wikipedia.org/wiki/Brown_Corpus

### What will we do here?

We will load the corpus (which we downloaded with the nltk downloader above) and print the first 10 works along with their parts-of-speech (POS) tags.


In [None]:
# Import the Brown University Corpus and print the first ten words
from nltk.corpus import brown
print ("\nPrinting the first 10 words in the Brown University Corpora:\n")
print (brown.words()[0:10])
print ("\nNow printing the POS tags for the first 10 words:\n")
print (brown.tagged_words()[0:10])
print ("\nNote the u'WORD' is the UNICODE UTF-8 encoding")
print (len(brown.words()))



Printing the first 10 words in the Brown University Corpora:

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']

Now printing the POS tags for the first 10 words:

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]

Note the u'WORD' is the UNICODE UTF-8 encoding
1161192


## Overview of Sentence, Word, and Part of Speech Processing

### What will we do here?

We will bring in a large block of text (Wikipedia entry on Signal Processing) and do work to it.

*  Tokenize the text into sentences
*  Tokenize the sentences into words
*  Tag the words with part of speech and demonstrate use cases for POS tags
*  Show how to print out the "key" for NLTK POS tags

In [None]:
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.chunk import ne_chunk

text = """ The legendary Norman Lykes house, designed by architect Frank Lloyd Wright, will be up for auction on Oct. 16. It was the last residence designed by Wright, 
    who designed many iconic homes, including "Fallingwater" in Pennsylvania, as well as the Solomon Guggenheim Museum in New York City.  
"""
sents = sent_tokenize(text) # This will break the text into sentences.

for i,s in enumerate(sents):
    print(f"Sentence{i}: {s}\n")

print ("*** The # of Sentences in the last example is %s" % len(sents))

tokens = word_tokenize(text)

print ("\n*** Printing the tokens (words) out of the sentences\n")
print (tokens)  # Breaks into tokens.  

tagged_tokens = pos_tag(tokens)

print ("\n*** Printing the POS TAGGED tokens (words) out of the sentences\n")

print (tagged_tokens) # Breaks into (Token, POS Tag) tuples

# Lets walk through the tuple and do some grouping

print ("\n*** NOW we'll be printing only the tokens (words) that are Nouns\n")


for token, pos_tag in tagged_tokens:
    if pos_tag == 'NNP' or pos_tag == 'NN':
        print(token)




Sentence0:  The legendary Norman Lykes house, designed by architect Frank Lloyd Wright, will be up for auction on Oct. 16.

Sentence1: It was the last residence designed by Wright, 
    who designed many iconic homes, including "Fallingwater" in Pennsylvania, as well as the Solomon Guggenheim Museum in New York City.

*** The # of Sentences in the last example is 2

*** Printing the tokens (words) out of the sentences

['The', 'legendary', 'Norman', 'Lykes', 'house', ',', 'designed', 'by', 'architect', 'Frank', 'Lloyd', 'Wright', ',', 'will', 'be', 'up', 'for', 'auction', 'on', 'Oct.', '16', '.', 'It', 'was', 'the', 'last', 'residence', 'designed', 'by', 'Wright', ',', 'who', 'designed', 'many', 'iconic', 'homes', ',', 'including', '``', 'Fallingwater', "''", 'in', 'Pennsylvania', ',', 'as', 'well', 'as', 'the', 'Solomon', 'Guggenheim', 'Museum', 'in', 'New', 'York', 'City', '.']

*** Printing the POS TAGGED tokens (words) out of the sentences

[('The', 'DT'), ('legendary', 'JJ'), ('No

In [None]:
print('Here\'s how we can figure out what these Part of Speech Tags mean!')
print('__________________________________________________________________')

# Print out pos_tag as a unique list first - TODO
for token, pos_tag in tagged_tokens:
    nltk.help.upenn_tagset(pos_tag)

Here's how we can figure out what these Part of Speech Tags mean!
__________________________________________________________________
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster C

## Named Entity Recognition (NER)

In [None]:
# Reusing our tagged_tokens from the above block.  Now we'll do NER on it.
print('\n*** First we will print out the entire NER-tagged tree.\n')
ne_tree = ne_chunk(tagged_tokens)
print(ne_tree)
print ('\n*** Extracting the NER Labels below ***\n')
for chunk in ne_chunk(tagged_tokens):
    if hasattr(chunk, 'label'):
        print(chunk.label(), ' '.join(c[0] for c in chunk))


*** First we will print out the entire NER-tagged tree.

(S
  The/DT
  legendary/JJ
  (PERSON Norman/NNP Lykes/NNP)
  house/NN
  ,/,
  designed/VBN
  by/IN
  architect/NN
  (PERSON Frank/NNP Lloyd/NNP Wright/NNP)
  ,/,
  will/MD
  be/VB
  up/RP
  for/IN
  auction/NN
  on/IN
  Oct./NNP
  16/CD
  ./.
  It/PRP
  was/VBD
  the/DT
  last/JJ
  residence/NN
  designed/VBN
  by/IN
  (PERSON Wright/NNP)
  ,/,
  who/WP
  designed/VBD
  many/JJ
  iconic/JJ
  homes/NNS
  ,/,
  including/VBG
  ``/``
  Fallingwater/NNP
  ''/''
  in/IN
  (GPE Pennsylvania/NNP)
  ,/,
  as/RB
  well/RB
  as/IN
  the/DT
  (ORGANIZATION Solomon/NNP Guggenheim/NNP Museum/NNP)
  in/IN
  (GPE New/NNP York/NNP City/NNP)
  ./.)

*** Extracting the NER Labels below ***

PERSON Norman Lykes
PERSON Frank Lloyd Wright
PERSON Wright
GPE Pennsylvania
ORGANIZATION Solomon Guggenheim Museum
GPE New York City


# Stemming and Lemmatization (what???)

Stemming and Lemmatization are the basic text processing methods for English text. The goal of both stemming and lemmatization is to *reduce inflectional forms of a word to a common base form*. Here is the definition from wikipedia for stemming and lemmatization:

In linguistic morphology (i.e., the structure of words) and information retrieval, **stemming** is the process for reducing inflected (or sometimes derived) words to their stem, base or root form

**Lemmatization** in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

For English, which has a fairly simple morphology, this task is generally simple.  For other languages (Turkish is a good example) it is absolutely necessary.

### What will we do here?
We'll instantiate a Lancaster Stemmer and demonstrate what a stemmer does.  Then we will instantiate a Lemmatizer and demonstrate what a lemmatizer does.

In [None]:
from nltk.stem.lancaster import LancasterStemmer
# word stemmer
stemmer = LancasterStemmer()
print (stemmer.stem('quickly'))
print (stemmer.stem('challenging'))
print (stemmer.stem('challenges'))
print (stemmer.stem('wolves'))
print (stemmer.stem('centre'))

quick
challeng
challeng
wolv
cent


In [None]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
print (wordnet_lemmatizer.lemmatize('dogs'))
print (wordnet_lemmatizer.lemmatize('wolves'))
# Note that the default POS for lemmatize is Noun.  Lets see how it handles verbs.
print (wordnet_lemmatizer.lemmatize('does', pos='n'))
print (wordnet_lemmatizer.lemmatize('centre'))
print (wordnet_lemmatizer.lemmatize('challenging'))
print (wordnet_lemmatizer.lemmatize('challenges'))

dog
wolf
doe
centre
challenging
challenge


# What can we do with these NLP techniques??

## Toy Example: Gender-based name classifier

Use the NLTK Name corpus to train a Gender Identification classifier.  This approach determines the likelihood that a name is associated with the 'male name' section of the corpus or the 'female name' section.  In this case, this is a lightweight form of supervised machine learning.

This approach is the basis for more complex classifiers that I have developed.

### What will we do here?

we're going to take the male and female names from the NLTK names function, shuffle these names, and then

In [None]:
# Grab names out of the nltk name corpus.

from nltk.corpus import names
import random

# Look for the likelihood that a name is contained in the male or the female name corpus.
classified_names = ([(name, 'male') for name in names.words('male.txt')] 
         + [(name, 'female') for name in names.words('female.txt')])

random.shuffle(classified_names)

print ("\nLets output our simple Bayesian name-gender classifications:")
classified_names[0:17]

## Improve the Name Classifier and Return Scores

Using some built-in utilities from NLTK, we will train a classifier (using Scikit-learn, another great Python module) to classify names that were held out from the training set.

In [None]:
from nltk.classify.scikitlearn import SklearnClassifier
import numpy as np
from nltk.classify.util import names_demo, binary_names_demo_features
try:
    from sklearn.linear_model.sparse import LogisticRegression
except ImportError:     # separate sparse LR to be removed in 0.12
    from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

# Classify names using nltk built in names demo

print("\nClassify names using scikit-learn Naive Bayes:\n")
names_demo(SklearnClassifier(BernoulliNB(binarize=False), dtype=bool).train,
               features=binary_names_demo_features)

print("\nClassify names using scikit-learn logistic regression:\n")
names_demo(SklearnClassifier(LogisticRegression(), dtype=np.float64).train,
               features=binary_names_demo_features)

print("\nClassify names using scikit-learn Random Forest Classifier:\n")
names_demo(SklearnClassifier(RandomForestClassifier(), dtype=np.float64).train,
               features=binary_names_demo_features)

