<h1 style="text-align: center">Natural Language Toolkit (NLTK)</h1>
<p> 
    is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over <u>over 50 corpora and lexical resources</u> such as WordNet, along with a suite of text processing for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP Libriaries, and an active <u>discussion forum</u>.
</p>
<br>
<h3>Installation</h3>
<ul>
    <li>pip install nltk</li>
    or
    <li>conda install -c anaconda nltk</li>
</ul>
<br>
<h3>NLTK data installation</h3>
<p>
    NLTK comes with many corpora, toy grammars, trained models, etc. A complete list is posted at <a href='https://www.nltk.org/nltk_data/'>NLTK main page (nltk data)</a>
</p>
```python
    import nltk
    nltk.download()
```
<p>that will lunch the nltk downloader gui, from which you can download corpus</p>
<img title="a title" alt="Alt Image" src="src/img/nltk downloader.PNG">

<h2>Start...</h2>

In [1]:
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import nltk

In [19]:
# nltk.download() # uncomment to see the gui, comment if you wish not to.

In [11]:
text = """Monticello wasn't designated as UNESCO World Heritage Site until 1987 """

In [15]:
text # display text

"Monticello wasn't designated as UNESCO World Heritage Site until 1987 "

<b>better word tokenizers</b>

In [16]:
import regex
regex.split("[\s\.\,]", text)

['Monticello',
 "wasn't",
 'designated',
 'as',
 'UNESCO',
 'World',
 'Heritage',
 'Site',
 'until',
 '1987',
 '']

In [17]:
nltk.word_tokenize(text)

['Monticello',
 'was',
 "n't",
 'designated',
 'as',
 'UNESCO',
 'World',
 'Heritage',
 'Site',
 'until',
 '1987']

Notice that "wasn't" is separated as "was" and "n't" in which can be efficient in some cases. Also, the empty space is not included.

<h2 style="text-align: center">Stemming</h2>
<p style="text-align: center">there are multiple stemmers in nltk, let's investigate them!</p>

<h3>Porter Stemming</h3>
<p>it applies some rules on the text, you can check the rules <a href='https://www.nltk.org/api/nltk.stem.porter.html'>here</a>.</p>

In [31]:
from nltk.stem import PorterStemmer
p_stemmer = PorterStemmer()

# Read more: https://onlymyenglish.com/plural-noun-list/
plurals = [
    'Armies', 'Children' ,'Essays', 'Baby', 'Bamboos', 'Benches', 'Birds' ,'Boats' ,'Bones', 'Boxes'
]

for word in plurals:
    print(f'{word} >>>> {p_stemmer.stem(word)}')

Armies >>>> armi
Children >>>> children
Essays >>>> essay
Baby >>>> babi
Bamboos >>>> bamboo
Benches >>>> bench
Birds >>>> bird
Boats >>>> boat
Bones >>>> bone
Boxes >>>> box


Note: This algorithm for stemming is ok to use, but there are other stemming algorithm like SnowballStemming is better and far more convinient.

<h3>Snowball Stemming</h3>
<p>it supports multiple languages</p>

In [32]:
from nltk.stem import SnowballStemmer
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

In [33]:
sn_stemmer = SnowballStemmer('english') # we use English in which far superior than Porter

In [34]:
sn_stemmer.stem('generously')

'generous'

compare to....

In [35]:
p_stemmer.stem('generously')

'gener'

See, you can judge the result.
<br>
<h3>Lemmatization</h3>
<p>retrieving the source of the word</p>

In [36]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for word in plurals:
    print(f'{word} >>>> {lemmatizer.lemmatize(word)}')

Armies >>>> Armies
Children >>>> Children
Essays >>>> Essays
Baby >>>> Baby
Bamboos >>>> Bamboos
Benches >>>> Benches
Birds >>>> Birds
Boats >>>> Boats
Bones >>>> Bones
Boxes >>>> Boxes


<h3>Summary!</h3>
<ul>
    <li>installed NLTK and it's data</li>
    <li>used NLTK tokenizers</li>
    <li>used NLTK Stemmers</li>
    <li>used NLTK Lemmatizers</li>
</ul>

sources
<br>
intro: https://www.youtube.com/watch?v=WYge0KZBhe0&ab_channel=ProgrammingKnowledge