# NLTK

This tutorial is based on https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63


NLTK (Natural Language Toolkit) is a Python library to work with human language data. It provides easy-to-use interfaces to many corpora and lexical resources.

Also, it contains a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Best of all, NLTK is a free, open source, community-driven project.

In this tutorial, we’ll cover the following NLP basic tasks:

- Sentence Tokenization
- Word Tokenization
- Text Lemmatization and Stemming
- Stop Words
- Regex





## Installing NLTK

First we need to install NLTK. Also, we need to download the package 'punkt', which we will use to parse the texts:

In [0]:
!pip install --user -U nltk
import nltk
nltk.download('punkt')

Requirement already up-to-date: nltk in /root/.local/lib/python3.6/site-packages (3.4.5)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Sentence tokenization

It is also called sentence segmentation.

It is the problem of dividing a string of written language into its component sentences. In English and some other languages, we can split apart the sentences whenever we see a punctuation mark. However, even in English, this problem is not trivial due to the use of full stop character for abbreviations. NLTK does that job for us, so don’t worry too much for the details for now.



In [0]:
text='Billy always listens to his mother. He always does what she says. If his mother says, "Brush your teeth," Billy brushes his teeth.'
text +='If his mother says, "Go to bed," Billy goes to bed. Billy is a very good boy. A good boy listens to his mother. His mother does not have to ask him again. She asks him to do something one time, and she does not ask again. '
text +='Billy is a good boy. He does what his mother asks the first time. She does not have to ask again. She tells Billy, "You are my best child." Of course Billy is her best child. Billy is her only child.'

sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    print(sentence)
    print()

Billy always listens to his mother.

He always does what she says.

If his mother says, "Brush your teeth," Billy brushes his teeth.If his mother says, "Go to bed," Billy goes to bed.

Billy is a very good boy.

A good boy listens to his mother.

His mother does not have to ask him again.

She asks him to do something one time, and she does not ask again.

Billy is a good boy.

He does what his mother asks the first time.

She does not have to ask again.

She tells Billy, "You are my best child."

Of course Billy is her best child.

Billy is her only child.



## Sentence splitting for other languages

Sentence splitting is also possible for other languages than English. You can see that it does not work any well.

In [0]:
tokenizer_es = nltk.data.load('tokenizers/punkt/spanish.pickle')
text='Esta es la asignatura TESI. Este es un curso de NLP. El segundo tema está dedicado a NLTK, Spacy y expresiones regulares.'
sentences=tokenizer_es.tokenize(text)
print(sentences)
for s in sentences:
  print(s)


['Esta es la asignatura TESI.', 'Este es un curso de NLP.', 'El segundo tema está dedicado a NLTK, Spacy y expresiones regulares.']
Esta es la asignatura TESI.
Este es un curso de NLP.
El segundo tema está dedicado a NLTK, Spacy y expresiones regulares.


## Tokenization

Tokenization is the task of spliting the text into words.


In [0]:
text="Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing"
tokens=[t for t in text.split()]
print(tokens)

['Mr.', "O'Neill", 'thinks', 'that', 'the', "boys'", 'stories', 'about', "Chile's", 'capital', "aren't", 'amusing']


NLTK provides a method to perform this task. Please, observe the tokens are different from those obtained using the previous split method. 

First, you need to download the NLTK's resources. Please, run the following cell:

In [0]:
tokens=nltk.word_tokenize(text)
print(tokens)

['Mr.', "O'Neill", 'thinks', 'that', 'the', 'boys', "'", 'stories', 'about', 'Chile', "'s", 'capital', 'are', "n't", 'amusing']


NLTK includes several tokenizers. The following shows compare two of them:

In [0]:
text="Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing"

tokens=nltk.word_tokenize(text)
print('default tokenizer:', tokens)

from nltk.tokenize import WordPunctTokenizer
pu_tokenizer = WordPunctTokenizer()
tokens=pu_tokenizer.tokenize(text)
print('tokenizer for puntuactions:', tokens)


#from nltk.parse.corenlp import CoreNLPParser
#stanford = CoreNLPParser(url='http://localhost:9000')
#print(list(stanford.tokenize('This is a foo bar sentence.')))

default tokenizer: ['Mr.', "O'Neill", 'thinks', 'that', 'the', 'boys', "'", 'stories', 'about', 'Chile', "'s", 'capital', 'are', "n't", 'amusing']
tokenizer for puntuactions: ['Mr', '.', 'O', "'", 'Neill', 'thinks', 'that', 'the', 'boys', "'", 'stories', 'about', 'Chile', "'", 's', 'capital', 'aren', "'", 't', 'amusing']



## Stemming y Lematización

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Examples:
- am, are, is => be
- dog, dogs, dog’s, dogs’ => dog


Thus, the following sentence 'the boy’s dogs are different sizes' will be translated to 'the boy dog be differ size'.

Stemming and lemmatization are special cases of normalization. However, they are different from each other.

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

The difference is that a stemmer operates without knowledge of the context, and therefore cannot understand the difference between words which have different meaning depending on part of speech. But the stemmers also have some advantages, they are easier to implement and usually run faster. Also, the reduced “accuracy” may not matter for some applications.

Examples:
- The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

- The word “play” is the base form for the word “playing”, and hence this is matched in both stemming and lemmatization.

- The word “meeting” can be either the base form of a noun or a form of a verb (“to meet”) depending on the context; e.g., “in our last meeting” or “We are meeting again tomorrow”. 

Unlike stemming, lemmatization attempts to select the correct lemma depending on the context.

After we know what’s the difference, let’s see some examples using the NLTK tool.

## Lemmatization 

It is the morphological analysis of a word that returns its lemma. For example, 'walk', 'walked', 'walks', 'walking' share the same base form or lemma: 'walk'.

 In many NLP applications (such as Information retrieval or question answering) it is very common the use of lemmas instead of tokens in order to to handle  the huge linguistical variations of words.

In [0]:
from nltk.stem import WordNetLemmatizer
sentence='The boys have listened to lots of sad stories.'
tokens=nltk.word_tokenize(sentence)
lematizer = WordNetLemmatizer()
print('Token:\t\tLemma:')
for t in tokens:
    print(t,'\t\t',lematizer.lemmatize(t))


Token:		Lemma:
The 		 The
boys 		 boy
have 		 have
listened 		 listened
to 		 to
lots 		 lot
of 		 of
sad 		 sad
stories 		 story
. 		 .


## Stemming

It is the process of reducing inflected words to their root (stem). For example, 'fish' is the stem for the following words: 'fishing', 'fished', and 'fisher'.

Both processes help to reduce the lexical variability of natural language. In many NLP applications (such as Information retrieval or question answering) it is very common the use of stems instead of tokens. 

NTLK provides methods to implement these processes:

In [0]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
for t in tokens:
    print(t,'\t\t',porter.stem(t))

The 		 the
boys 		 boy
have 		 have
listened 		 listen
to 		 to
lots 		 lot
of 		 of
sad 		 sad
stories 		 stori
. 		 .


In [0]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
word='seen'

print(word, " Stemmer:", stemmer.stem(word))
print(word, "Lemmatizer:", lemmatizer.lemmatize(word, wordnet.VERB))
print()
word='drove'

print(word, " Stemmer:", stemmer.stem(word))
print(word, "Lemmatizer:", lemmatizer.lemmatize(word, wordnet.VERB))


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
seen  Stemmer: seen
seen Lemmatizer: see

drove  Stemmer: drove
drove Lemmatizer: drive


## Stopwords 

Stopwords are irrelevant words (common words such as “and”, “the”, “a” in a language that do no offer semantic meaning). These word usually add a lot of noise when we apply machine learning algorithms. So we must remove them.

The list of the stop words can change depending on your application.

The NLTK tool has a predefined list of stopwords that refers to the most common words.

In [0]:
nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words("english"))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'bo

This code shows how to remove stopwords from a sentence:

In [0]:
stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."

words = nltk.word_tokenize(sentence)
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)

['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']


## PoS tagging 

Classify words into their lexical categories (part-of-speech tags). Pos tags provide information very useful for tasks such as Named Entity Recognition or Relation extraction. 
First, you must tokenize the text. Then, the method **pos_tag** provides the PoS tags. 

In [0]:
nltk.download('averaged_perceptron_tagger')
text="At least four people were dead after a man began shooting at a synagogue in the Squirrel Hill neighbourhood of Pittsburgh on Saturday."
tokens = nltk.word_tokenize(text)
tags=nltk.pos_tag(tokens)
print(tags)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[('At', 'IN'), ('least', 'JJS'), ('four', 'CD'), ('people', 'NNS'), ('were', 'VBD'), ('dead', 'JJ'), ('after', 'IN'), ('a', 'DT'), ('man', 'NN'), ('began', 'VBD'), ('shooting', 'VBG'), ('at', 'IN'), ('a', 'DT'), ('synagogue', 'NN'), ('in', 'IN'), ('the', 'DT'), ('Squirrel', 'NNP'), ('Hill', 'NNP'), ('neighbourhood', 'NN'), ('of', 'IN'), ('Pittsburgh', 'NNP'), ('on', 'IN'), ('Saturday', 'NNP'), ('.', '.')]


## Named Entity Recognition 

NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). 

The possible categories are: 
PERSON, ORGANIZATION, and GPE.


In [0]:
print(nltk.ne_chunk(tags))

(S
  At/IN
  least/JJS
  four/CD
  people/NNS
  were/VBD
  dead/JJ
  after/IN
  a/DT
  man/NN
  began/VBD
  shooting/VBG
  at/IN
  a/DT
  synagogue/NN
  in/IN
  the/DT
  (ORGANIZATION Squirrel/NNP Hill/NNP)
  neighbourhood/NN
  of/IN
  (GPE Pittsburgh/NNP)
  on/IN
  Saturday/NNP
  ./.)
