# **Working with NLTK**

NLTK stands for the Natural Language Toolkit and is written by two  computational linguists, Steven Bird (Senior Research Associate of the LDC and professor at the University of Melbourne) and Ewan Klein (Professor of Linguistics at Edinburgh University).


In [1]:
#Instalation (if not already installed)

!pip install numpy
!pip install nltk



The top-level library is called nltk and we can refer to the included modules by using their fully qualified dotted names, e.g. nltk.corpus and nltk.utilities. The contents of any such module can then be imported into the top-level namespace by using the standard "from ... import ..." construct in Python.

In [3]:
import nltk

# A new window should open, showing the NLTK Downloader. Downloading everything
# may take a while. For now, we will just download the popular packages.
nltk.download("popular")

# You can also download everything with
# nltk.download('all')

# or you can download specific parts of nltk
# nltk.download("abc")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

## **Using and Exploring the Built-in Corpuses**

We have currently downloaded the most popular corpuses in NLTK so lets take a look at one. If you want to view a list of options you can run the code below which will loop through the corpuses in nltk.corpus and print there names.

In [12]:
# Lists the various corpora and CorpusReader classes in the nltk.corpus module
import nltk.corpus
for name in dir(nltk.corpus):
  if not name.startswith('_'):
    print(name)

AlignedCorpusReader
AlpinoCorpusReader
BCP47CorpusReader
BNCCorpusReader
BracketParseCorpusReader
CHILDESCorpusReader
CMUDictCorpusReader
CategorizedBracketParseCorpusReader
CategorizedCorpusReader
CategorizedPlaintextCorpusReader
CategorizedSentencesCorpusReader
CategorizedTaggedCorpusReader
ChasenCorpusReader
ChunkedCorpusReader
ComparativeSentencesCorpusReader
ConllChunkCorpusReader
ConllCorpusReader
CorpusReader
CrubadanCorpusReader
DependencyCorpusReader
EuroparlCorpusReader
FramenetCorpusReader
IEERCorpusReader
IPIPANCorpusReader
IndianCorpusReader
KNBCorpusReader
LazyCorpusLoader
LinThesaurusCorpusReader
MTECorpusReader
MWAPPDBCorpusReader
MacMorphoCorpusReader
NKJPCorpusReader
NPSChatCorpusReader
NombankCorpusReader
NonbreakingPrefixesCorpusReader
OpinionLexiconCorpusReader
PPAttachmentCorpusReader
PanLexLiteCorpusReader
PanlexSwadeshCorpusReader
Pl196xCorpusReader
PlaintextCorpusReader
PortugueseCategorizedPlaintextCorpusReader
PropbankCorpusReader
ProsConsCorpusReader
RTECorp

For a specific corpus, list the fileids that are available.

In [13]:
print(nltk.corpus.gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


**Exercise:** Explore some of the [Corpus Reader Functions](https://www.nltk.org/api/nltk.corpus.reader.html) using one of the files in the Gutenberg corpus.

In [15]:
from nltk.corpus import gutenberg

austen_words = gutenberg.words('austen-sense.txt')
print(austen_words)

['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', ...]


Is this a list? Check the type of gutenberg.words('austen-sense.txt').

In [17]:
print(type(austen_words))

<class 'nltk.corpus.reader.util.StreamBackedCorpusView'>


Some preprocessing steps do not require the use of a library like nltk and use simple string operations such as making all characters in a string lowercase or removing punctuation.

### **Making Lowercase**

In [20]:
import string

text = "Th!s Sh0uLd B3 L0w3rCas3!"
text = text.lower()
print(text)

th!s sh0uld b3 l0w3rcas3!


### **Removing Punctuation**

In [21]:
print(string.punctuation)

text_p = "".join([char for char in text if char not in string.punctuation])
print(text_p)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
ths sh0uld b3 l0w3rcas3


### **Removing Digits**

In [22]:
#removing digits in the corpus
import re
text_d = re.sub(r'\d+','', text_p)
print(text_d)

ths shuld b lwrcas


## **Tokenization**

In [26]:
from nltk.tokenize import sent_tokenize, word_tokenize

text_1883 = "I remember the first time I saw it. I heard a thousand stories… But none could describe this place. It must be witnessed to be understood. And yet, I’ve seen it. And I understand it less than when I first cast eyes on this place."

sentences = sent_tokenize(text_1883)
tokens = word_tokenize(text_1883)


In [27]:
print(sentences)

['I remember the first time I saw it.', 'I heard a thousand stories… But none could describe this place.', 'It must be witnessed to be understood.', 'And yet, I’ve seen it.', 'And I understand it less than when I first cast eyes on this place.']


In [28]:
print(tokens)

['I', 'remember', 'the', 'first', 'time', 'I', 'saw', 'it', '.', 'I', 'heard', 'a', 'thousand', 'stories…', 'But', 'none', 'could', 'describe', 'this', 'place', '.', 'It', 'must', 'be', 'witnessed', 'to', 'be', 'understood', '.', 'And', 'yet', ',', 'I', '’', 've', 'seen', 'it', '.', 'And', 'I', 'understand', 'it', 'less', 'than', 'when', 'I', 'first', 'cast', 'eyes', 'on', 'this', 'place', '.']


NLTK also includes a tweet corpus. Let's compare what the results look like if we tokenize a tweet with a standard word tokenizer vs a tweet tokenizer.

In [29]:
from nltk.corpus import twitter_samples
from nltk.tokenize import TweetTokenizer

# strip_handles=True, reduce_len=True
twt_tokenizer = tweet_tokenizer = TweetTokenizer()
tweet = twitter_samples.strings("positive_tweets.json")[0]
print(tweet)

#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)


In [30]:
print(twt_tokenizer.tokenize(tweet))

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']


What happens when we use the regular word tokenizer?

## **Removing stopwords**


In [67]:
# How do we access possible stopwords?
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words("english")

text = "This is a sample text with some stop words."
words = word_tokenize(text)

filtered_words = [w for w in words if w.lower() not in stop_words]
print(filtered_words)

['sample', 'text', 'stop', 'words', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Remove all of the stopwords in the austen-sense.txt file.

## **Stemming**

In [34]:
from nltk.stem.porter import PorterStemmer

#words = ["program", "programs", "programmer", "programming", "programmers"]

print("Before Stemming:")
print(sentences[2])


Before Stemming:
It must be witnessed to be understood.


In [35]:
porter = PorterStemmer()

for word in tokens[20:50]:
  print(porter.stem(word))

.
it
must
be
wit
to
be
understood
.
and
yet
,
i
’
ve
seen
it
.
and
i
understand
it
less
than
when
i
first
cast
eye
on


**Exercise:** Try using the other stemmers and compare.

In [36]:
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer

stemmer = SnowballStemmer("english")

## **Word Senses and Semantics**

We also have access to Wordnet through nltk though loading the load the WordNet corpus takes a bit of time. WordNet, a long­running lexical resource project from Princeton University, aims to catalog the senses of most words in the English language, along with other lexical relationships. Word senses can also be induced from the context—automatic discovery of word senses from text was actually the first place semi­supervised learning was applied to NLP.

In [37]:
from nltk.corpus import wordnet

syns = wordnet.synsets("car")
print(syns)

[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]


In [38]:
print(syns[0].lemmas()[0].name())
print(syns[1].definition())

car
a wheeled vehicle adapted to the rails of railroad


In [41]:
# get examples of each synset
for synset in syns:
    print(f"Synset: {synset.name()}")
    print(f"Examples: {synset.examples()}")

Synset: car.n.01
Examples: ['he needs a car to get to work']
Synset: car.n.02
Examples: ['three cars had jumped the rails']
Synset: car.n.03
Examples: []
Synset: car.n.04
Examples: ['the car was on the top floor']
Synset: cable_car.n.01
Examples: ['they took a cable car to the top of the mountain']


In [39]:
w1 = wordnet.synset('horse.n.01')
w2 = wordnet.synset('car.n.01')
print(w1.wup_similarity(w2))

0.3076923076923077


In [40]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print("horses :", lemmatizer.lemmatize("horses"))
print("corpora :", lemmatizer.lemmatize("corpora"))

# optional pos argument which lets you constrain the part of speech of the word
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))

horses : horse
corpora : corpus
better : good


[More sample usage for wordnet](https://www.nltk.org/howto/wordnet.html) in the NLTK documentation.

## POS Tagging and Chunking

Chunk rules are defined in terms of regular expression patterns over "tag strings." A tag string is a string consisting of tags, delimited with angle-brackets. Part of speech tags are denoted with the "<" and ">" and we can also place regular expressions within the tags themselves, so account for things like "all nouns" (<N.*>).

In [64]:
from nltk.chunk import RegexpParser
from nltk import pos_tag
from nltk.corpus import brown
from nltk.tag import UnigramTagger

# Define a grammar for chunking, Note that tag strings do not contain any whitespace.
grammar = r"""
    NP: {<DT>?<JJ>*<NN>*} # determiner, adjective, and noun
"""

( . ) Any character except new line

( * ) Match 0 or more repetitions

( + ) Match 1 or more repetitions

( ? ) Match 0 or 1 repetitions

In [65]:
# Tokenize and part-of-speech tag the text
import nltk
text = "The quick brown fox jumps over the lazy dog."
tokens = nltk.word_tokenize(text)

unigram_tagger = UnigramTagger(train=brown.tagged_sents())
tags = unigram_tagger.tag(tokens)
print(tags)

[('The', 'AT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'AT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]


In [66]:
# Create a RegexpParser object
chunker = RegexpParser(grammar)
# Use the RegexpParser object to chunk the text
chunks = chunker.parse(tags)

# Print the chunks
print(chunks)

(S
  The/AT
  (NP quick/JJ brown/JJ fox/NN)
  jumps/NNS
  over/IN
  the/AT
  (NP lazy/JJ dog/NN)
  ./.)


**Exercise:** Change the grammar to tag noun and adjectives and verb (present tense). Try using Unigram tagger and pos_tag. What is the difference?

Hint: You can use nltk.help.upenn_tagset() to get a list of all the symbols used and what they mean for different tags.

In [46]:
nltk.download("tagsets")
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


## Frequency

Get the top X most common words in a text/corpus. We will use Reuters corpus as an example.

In [70]:
# note you may have to download these corpuses

from nltk.corpus import reuters, stopwords
from nltk.probability import FreqDist

# Get all the words in the corpus
words = reuters.words()

# continue...

In [69]:
# punctuation removal for an entire corpus.
import string
from nltk.tokenize import word_tokenize

words = reuters.words()

reuters_corpus = [word for word in words if word.isalpha()]
print(reuters_corpus[0:50])
# try adding this functionality to the frequency function made above

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', 'S', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', 'S', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', 'reaching', 'economic', 'damage', 'businessmen', 'and', 'officials', 'said', 'They', 'told', 'Reuter', 'correspondents', 'in', 'Asian', 'capitals', 'a', 'U']
