# NLTK

The Natural Language Toolkit, NLTK, is an open source Python library to perform many NLP functions such as tokenizing, stemming, part of speech taggng, and more. 

Instructions for installing NLTK can be found [on the NLTK site](https://www.nltk.org/install.html) for Mac/Unix and Windows systems. 
 
After NLTK is installed, the next step is to download NLTK resources. This is best done in a console window:

```
$ python3

>>> import nltk
>>> nltk.download()

```

The nltk.download() instruction will pop up a window with a list of items to download. At a minimum, the 'book' line should be selected to download content associated with the NLTK book.

### Tokenizing

The process of **tokenizing** is breaking text into smaller units. NLTK can be used to divide text into sentences or individual tokens. A **token** is a word, number, or punctuation mark. The following code block imports NLTK and the tokenizers.

In [1]:
import nltk
from nltk import word_tokenize
from nltk import sent_tokenize

#### Python split() versus NLTK tokenize

We can split text into tokens with Python's split() function, as shown below. However the punctuation attaches to text. Generally the punctuation should be it's own token. As you can see below, NLTK tokenizes the punctuation.

In [2]:
text = "I am Sam. Sam I am. I do not like green eggs and ham."

tokens = text.split()
print(tokens)

tokens = word_tokenize(text)
print(tokens)

['I', 'am', 'Sam.', 'Sam', 'I', 'am.', 'I', 'do', 'not', 'like', 'green', 'eggs', 'and', 'ham.']
['I', 'am', 'Sam', '.', 'Sam', 'I', 'am', '.', 'I', 'do', 'not', 'like', 'green', 'eggs', 'and', 'ham', '.']


#### Sentence segmentation

The NLTK sentence tokenizer performs sentence segmentation. 

In [3]:
sentences = sent_tokenize(text)
for sentence in sentences:
    print(sentence)

I am Sam.
Sam I am.
I do not like green eggs and ham.


In [4]:
# NLTK will not end a sentence on just any '.'
sent_tokenize('Mr. Smith went to Dr. Jones. Dr. Jones was trained in the U.S.A.')

['Mr. Smith went to Dr. Jones.', 'Dr. Jones was trained in the U.S.A.']

In [5]:
# NLTK keeps '.' with the token for titles and abbreviations
word_tokenize('Mr. Smith went to Dr. Jones. Dr. Jones was trained in the U.S.A.')

['Mr.',
 'Smith',
 'went',
 'to',
 'Dr.',
 'Jones',
 '.',
 'Dr.',
 'Jones',
 'was',
 'trained',
 'in',
 'the',
 'U.S.A',
 '.']

# Preprocessing Text

Raw text is preprocessed for NLP applications. Preprocessing can involve any of the following:

* convert text to lower case
* remove punctuation
* remove numbers of replace with a generic token like NUM
* stem words - removing affixes
* lemmatize words - convert to lexical form
* remove stop words - reduce text to content words

Deciding which preprocessing steps are appropriate for a given NLP application is an important decision. Some of the preprocessing actions can be done with Python functions, some with regular expressions, and some with NLTK methods.



In [6]:
raw_text = """ I teach at the University of Texas at Dallas. I
     started teaching there in 2016. As of 2018, UTD has been designated 
     as a national research university!"""

#### Lowercase, remove punctuation and numbers, tokenize

Two different approaches are shown below to get rid of punctuation and numbers. The first code block shows how to do this with regular expressions. The second uses a list comprehension.

In [7]:
# remove punctuation and numbers with a regular expression
import re

text = re.sub(r'[.?!,:;()\-\n\d]',' ', raw_text.lower())
tokens = word_tokenize(text)
print(tokens)

['i', 'teach', 'at', 'the', 'university', 'of', 'texas', 'at', 'dallas', 'i', 'started', 'teaching', 'there', 'in', 'as', 'of', 'utd', 'has', 'been', 'designated', 'as', 'a', 'national', 'research', 'university']


In [8]:
# use a list comprehension

tokens = [t.lower() for t in word_tokenize(raw_text) if t.isalpha()]
print(tokens)

['i', 'teach', 'at', 'the', 'university', 'of', 'texas', 'at', 'dallas', 'i', 'started', 'teaching', 'there', 'in', 'as', 'of', 'utd', 'has', 'been', 'designated', 'as', 'a', 'national', 'research', 'university']


#### Removing stop words

Stop words are words that don't carry much content. These are often function words that help bind a sentence together grammatically. The NLTK list of stop words also includes personal pronouns and common prepositions. NLTK has a list of stop words, but you may find it necessary to make application-dependent stop word lists for different projects. 

Let's look at NLTK's stop words.

In [9]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [10]:
# get rid of stop words
print('number of tokens before stopword removal:', len(tokens))
tokens_content = [t for t in tokens if t not in stopwords]
print('number of tokens after stopword removal:', len(tokens_content))
print(tokens_content)

number of tokens before stopword removal: 25
number of tokens after stopword removal: 11
['teach', 'university', 'texas', 'dallas', 'started', 'teaching', 'utd', 'designated', 'national', 'research', 'university']


#### Stemming and Lemmatizing

Stemming involves removing affixes. The result may be tokens that are not actually words or have different meanings:

```
university is stemmed to univers
```

Lemmatization can do some strange things as well:

```
'as' is lemmatized to 'a'

'has' is lemmatized to 'ha'
```

But in general, using word normalization is helpful to group together words with the same root meaning. For example, "education educational educationally" would all be normalized to 'education'.

In [11]:
from nltk.stem.porter import *
from nltk.stem import WordNetLemmatizer

stemmer = PorterStemmer()
stemmed = [stemmer.stem(t) for t in tokens]
print('stemmed tokens:\n', stemmed)

wnl = WordNetLemmatizer()
lemmatized = [wnl.lemmatize(t) for t in tokens]
print('\nlemmatized tokens:\n', lemmatized)

stemmed tokens:
 ['i', 'teach', 'at', 'the', 'univers', 'of', 'texa', 'at', 'dalla', 'i', 'start', 'teach', 'there', 'in', 'as', 'of', 'utd', 'ha', 'been', 'design', 'as', 'a', 'nation', 'research', 'univers']

lemmatized tokens:
 ['i', 'teach', 'at', 'the', 'university', 'of', 'texas', 'at', 'dallas', 'i', 'started', 'teaching', 'there', 'in', 'a', 'of', 'utd', 'ha', 'been', 'designated', 'a', 'a', 'national', 'research', 'university']


### NLTK in other languages

Some NLTK features support other languages, primarily European languages. Other tools we will look at later in the course support international languages such as Arabic and Chinese. 

In [12]:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')
tokenizer.tokenize('Hola mi amor. Como estas?')

['Hola mi amor.', 'Como estas?']