**To adapt this notebook to your own needs** and to be able to edit it, please make a copy of your own. This works via "*File*" -> "*Save a copy ..*."


---



Some of the **NLP Techniques** mentioned [in Sect. 2.4 of the ISE 2021 lecture](https://www.slideshare.net/lysander07/02-ise2020-natural-language-processing-1-232058444) are already implemented in the [python NLTK library.](https://www.nltk.org/) Please find some basic NLP examples below.

# Tokenization
Tokenization is the process of separating character sequences into
smaller pieces, called tokens. In this process, certain characters
might be omitted, such as punctuation (dependening on the
tokenizer).

In [None]:
#First we have to import nltk and download a few required packages
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.download('wordnet')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('treebank')

First let's try **Sentence Splitting**:

In [2]:
text="On March 21, 1768, French mathematician and physicist Jean Baptiste Joseph du Fourier was born. He is probably best known for his work in thermodynamics, where he introduced the concept of the Fourier Analysis, named in honor after him."
# We import the two methods required for (1) word-based tokenization, and (2) sentence splitting
from nltk.tokenize import word_tokenize, sent_tokenize
sents=sent_tokenize(text)
print(sents)

['On March 21, 1768, French mathematician and physicist Jean Baptiste Joseph du Fourier was born.', 'He is probably best known for his work in thermodynamics, where he introduced the concept of the Fourier Analysis, named in honor after him.']


Now, let's try **Words**:

In [None]:
words=[word_tokenize(sent) for sent in sents]
print(words)

[['On', 'March', '21', ',', '1768', ',', 'French', 'mathematician', 'and', 'physicist', 'Jean', 'Baptiste', 'Joseph', 'du', 'Fourier', 'was', 'born', '.'], ['He', 'is', 'probably', 'best', 'known', 'for', 'his', 'work', 'in', 'thermodynamics', ',', 'where', 'he', 'introduced', 'the', 'concept', 'of', 'the', 'Fourier', 'Analysis', ',', 'named', 'in', 'honor', 'after', 'him', '.']]


# Part-of-Speech tagging
Part-of-speech tagging classifies words into their part-of-speech
and labels them according to a specified tagset. Most commonly
the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) is used.

In [None]:
# each word in the text will be assigned a POS tag
nltk.pos_tag(word_tokenize(text))

[('On', 'IN'),
 ('March', 'NNP'),
 ('21', 'CD'),
 (',', ','),
 ('1768', 'CD'),
 (',', ','),
 ('French', 'JJ'),
 ('mathematician', 'NN'),
 ('and', 'CC'),
 ('physicist', 'JJ'),
 ('Jean', 'NNP'),
 ('Baptiste', 'NNP'),
 ('Joseph', 'NNP'),
 ('du', 'NNP'),
 ('Fourier', 'NNP'),
 ('was', 'VBD'),
 ('born', 'VBN'),
 ('.', '.'),
 ('He', 'PRP'),
 ('is', 'VBZ'),
 ('probably', 'RB'),
 ('best', 'RBS'),
 ('known', 'VBN'),
 ('for', 'IN'),
 ('his', 'PRP$'),
 ('work', 'NN'),
 ('in', 'IN'),
 ('thermodynamics', 'NNS'),
 (',', ','),
 ('where', 'WRB'),
 ('he', 'PRP'),
 ('introduced', 'VBD'),
 ('the', 'DT'),
 ('concept', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('Fourier', 'NNP'),
 ('Analysis', 'NNP'),
 (',', ','),
 ('named', 'VBN'),
 ('in', 'IN'),
 ('honor', 'NN'),
 ('after', 'IN'),
 ('him', 'PRP'),
 ('.', '.')]

In [None]:
# in case you don't know the meaning of some of the POS tags
nltk.help. upenn_tagset ('NNP')
nltk.help. upenn_tagset ('JJ')
nltk.help. upenn_tagset ('PRP$')

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
PRP$: pronoun, possessive
    her his mine my our ours their thy your


# Lemmatization
* Lemmatization groups words together that have different inflections so that they can be treated as the same item.
* It reduces a word to its baseform using a online lexicon. 

*For Lemmatization, NLTK provides an interface to the [WordNet](https://wordnet.princeton.edu/) dictionary. WordNet is a large English lexical database. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.*



In [None]:
# we import the WordNet lemmatizer
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
# new text example
sentence = "On March 21, 1768, French mathematician and physicist Jean Baptiste Joseph du Fourier was born."

lemmatizer = WordNetLemmatizer()
# for each word of the sentence
for token in word_tokenize(sentence):
  print(lemmatizer.lemmatize(token, pos='v'))

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


On
March
21
,
1768
,
French
mathematician
and
physicist
Jean
Baptiste
Joseph
du
Fourier
be
bear
.


# Stemming
* Stemming strips the words of its suffixes and prefixes. For English, the [Porter Stemmer](http://snowball.tartarus.org/algorithms/porter/stemmer.html) is rather popular.

In [None]:
# we import the Porter Stemmer
from nltk.stem import PorterStemmer
sentence = "On March 21, 1768, French mathematician and physicist Jean Baptiste Joseph du Fourier was born."

ps = PorterStemmer()
# for each word of the sentence
for token in word_tokenize(sentence):
  print(ps.stem(token))

on
march
21
,
1768
,
french
mathematician
and
physicist
jean
baptist
joseph
du
fourier
wa
born
.


# Named Entity Recognition (NER)
Locating and classifying atomic elements into predefined categories such as **names, persons, organizations, locations, expressions of time, quantities, monetary values**, etc.

*For casual use, NLTK provides us with a method called `ne_chunk` to perform NER on a given text. In order to use `ne_chunk`, the text needs to first be tokenized into words and then POS tagged. After NER, the tagged words depict their respective entity type*

In [None]:
# For NER, we need tokenization, POS tagging and Named Entity Chunking
from nltk import word_tokenize, pos_tag, ne_chunk
# New text example
sentence = "On March 21, 1768, French mathematician and physicist Jean Baptiste Joseph du Fourier was born."
print (ne_chunk(pos_tag(word_tokenize(sentence))))

(S
  On/IN
  March/NNP
  21/CD
  ,/,
  1768/CD
  ,/,
  (GPE French/JJ)
  mathematician/NN
  and/CC
  physicist/JJ
  (PERSON Jean/NNP Baptiste/NNP Joseph/NNP)
  du/NNP
  Fourier/NNP
  was/VBD
  born/VBN
  ./.)


Now let's try an **alternativ NLP Library: [spacy]**(https://spacy.io/).

In [None]:
!python -m spacy download en

2022-09-15 08:45:51.749977: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 7.8 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

We first start with **Named Entity Recognition **

In [None]:
doc = nlp(u'On March 21, 1768, French mathematician and physicist Jean Baptiste Joseph du Fourier was born.')

 
# Named Entity Recognition
for ent in doc.ents:
    print(ent.text, ent.label_)
    
# displaCy
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

March 21, 1768 DATE
French NORP
Jean Baptiste PERSON
Joseph du Fourier PERSON


# Dependency Parsing
Dependency Parsing is an approximation of semantic relations between arguments. It relies on direct binary grammatical relations among words.


In [None]:
# Dependency Parsing

doc = nlp(u'On March 21, 1768, French mathematician and physicist Jean Baptiste Joseph du Fourier was born.')
 
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

On/IN <--prep-- born/VBN
March/NNP <--pobj-- On/IN
21/CD <--nummod-- March/NNP
,/, <--punct-- March/NNP
1768/CD <--nummod-- March/NNP
,/, <--punct-- born/VBN
French/JJ <--amod-- mathematician/NN
mathematician/NN <--nsubjpass-- born/VBN
and/CC <--cc-- mathematician/NN
physicist/NN <--conj-- mathematician/NN
Jean/NNP <--compound-- Baptiste/NNP
Baptiste/NNP <--compound-- Fourier/NNP
Joseph/NNP <--compound-- Fourier/NNP
du/NNP <--compound-- Fourier/NNP
Fourier/NNP <--conj-- mathematician/NN
was/VBD <--auxpass-- born/VBN
born/VBN <--ROOT-- born/VBN
./. <--punct-- born/VBN


In [None]:
# Visualizing Dependency Parsing

from spacy import displacy
 
doc = nlp(u'On March 21, 1768, French mathematician and physicist Jean Baptiste Joseph du Fourier was born.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})