# NLP Text Cleaning and Preprocessing

We are going to see several steps/methods used **to clean and preprocess *text data***. <br/>
**NOT all** of these steps are _always necessary_, and ***not all*** of them are performed in the order in which we will present here.

Inicially, we are going to see some popular **NLP packages** and its differences.

## 1. NLP Packages
- NLTK: https://www.nltk.org/
- spaCy: https://spacy.io/
- BeautifulSoup4: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
!pip install nltk
!pip install spacy
!pip install beautifulsoup4

### Natural Language Toolkit (NLTK)
- Popular NLP open source package
- Released in 2011
- Provides many functinalities to treat and process _text data_

### SpaCy
- Open Source Natural Language Processing Library
- Designed to _efficiently_ execute popular algorithms and tackle NLP tasks with maximum _effectiveness_.
- SpaCy typically offers _only one_ implemented approach for various NLP tasks, selecting **the most efficient algorithm** that is currently available.
- Thus, we cannot choose other algorithms for a given NLP task.

### NLTK vs SpaCy
- In general, SpaCy is considerably **faster and more efficient** than NLTK.
- In turn, NLTK provides **pre-trained models** for some applications, such as _sentiment analysis_.

### 1.1 Dive into SpaCy


#### Loading SpaCy for a given Language
 - **English:** https://spacy.io/models/en

In [None]:
!python -m spacy download en

- **Portuguese:** https://spacy.io/models/pt

In [None]:
!python -m spacy download pt

#### SpaCy (Preprocessing) Pipeline
SpaCy works with a _Pipeline object_.

<img src='./imgs/spacy_pipeline.png' width=600 />

From [official documentation](https://spacy.io/usage/spacy-101#pipelines):
- When you call `nlp` on a _text_, spaCy first **tokenizes** the _text_ to produce a `Doc object`.
- The `Doc` is then processed in several different steps – this is also referred to as the **processing pipeline**.
- The pipeline used by the _trained pipelines_ typically include a `tagger`, a `lemmatizer`, a `parser` and an `entity recognizer`.
- Each pipeline component returns the **processed `Doc`**, which is then passed on to the next component.

<img src='./imgs/spacy_pipeline_table.png' width=800 />

#### Example

In [None]:
text = "The sun is a massive, luminous ball of gas that is the center of our solar system. It is estimated to be around 4.6 billion years old and has a diameter of about 1.39 million kilometers. The sun's energy is produced through a process called nuclear fusion, where hydrogen atoms combine to form helium and release a tremendous amount of energy in the process."
text

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
# full text


In [None]:
# language of the document


In [None]:
# for each token/word


In [None]:
# for each token/word


## 🧹 2. Text Cleaning
- actions of removing something
- we can perform some preprocessing before it (e.g., lowering text)

### 2.1 Remove newlines `\n` and Tabs `\t`

In [None]:
cat_text_newlines_tabs = \
"""Cats are fascinating creatures that have captured the hearts of humans for centuries. They are known for their agility, grace, and independence, and their ability to make us laugh and feel comforted. \n

\tDespite their reputation for being aloof, cats are actually quite social animals and thrive on attention and affection from their owners. They are also highly intelligent and can be trained to do a variety of tricks and behaviors, including using a litter box and walking on a leash. \n

\tOne of the most endearing things about cats is their tendency to curl up in cozy spots, whether it's a sunny windowsill, a soft bed, or a warm lap. They are also famous for their love of napping, and can often be found snoozing for hours on end. \n

\tWhile cats may sometimes get a bad rap for being aloof or indifferent, those who have experienced the joy of sharing their lives with a feline friend know that there's nothing quite like the bond between a cat and its human.
"""

In [None]:
print(cat_text_newlines_tabs)

In [None]:
cat_text_newlines_tabs

### 2.2. Strip HTML Tags

In [None]:
hp_html_text = '''<p><i><b>Harry Potter and the Philosopher's Stone</b></i> is a 1997 <a href="/wiki/Fantasy_novel" class="mw-redirect" title="Fantasy novel">fantasy novel</a> written by British author <a href="/wiki/J._K._Rowling" title="J. K. Rowling">J. K. Rowling</a>. The first novel in the <i><a href="/wiki/Harry_Potter" title="Harry Potter">Harry Potter</a></i> series and Rowling's <a href="/wiki/Debut_novel" title="Debut novel">debut novel</a>, it follows <a href="/wiki/Harry_Potter_(character)" title="Harry Potter (character)">Harry Potter</a>, a young <a href="/wiki/Wizard_(fantasy)" class="mw-redirect" title="Wizard (fantasy)">wizard</a> who discovers his magical heritage on his eleventh birthday, when he receives a letter of acceptance to <a href="/wiki/Hogwarts_School_of_Witchcraft_and_Wizardry" class="mw-redirect" title="Hogwarts School of Witchcraft and Wizardry">Hogwarts School of Witchcraft and Wizardry</a>. Harry makes close friends and a few enemies during his first year at the school and with the help of his friends, <a href="/wiki/Ron_Weasley" title="Ron Weasley">Ron Weasley</a> and <a href="/wiki/Hermione_Granger" title="Hermione Granger">Hermione Granger</a>, he faces an attempted comeback by the dark wizard <a href="/wiki/Lord_Voldemort" title="Lord Voldemort">Lord Voldemort</a>, who killed Harry's parents, but failed to kill Harry when he was just 15 months old.
</p>'''

In [None]:
print(hp_html_text)

In [None]:
strip_html_tags(hp_html_text)

### 2.3. Remove Links

In [None]:
cat_text_with_links = \
'''Cats are amazing animals that have been domesticated for thousands of years. They are beloved by many people for their adorable appearance, unique personalities, and entertaining antics. Cats are also known for their ability to form strong bonds with their human companions.
If you're a cat lover looking for more information about these fascinating creatures, there are many resources available online. You can visit websites like http://www.catster.com/ to learn more about cat breeds, behavior, and health. For more in-depth information, check out https://www.vet.cornell.edu/departments-centers-and-institutes/cornell-feline-health-center, which offers comprehensive resources for cat owners and veterinarians.
If you're interested in adopting a cat or finding a local shelter, petfinder.com/cats/ is a great resource. This website allows you to search for cats available for adoption in your area and provides helpful information about the adoption process.
Whether you're a seasoned cat owner or simply a cat enthusiast, these websites can provide you with valuable information and resources to help you better understand and care for your feline friends.'''

In [None]:
print(cat_text_with_links)

In [None]:
cat_text_with_links

In [None]:
remove_links(cat_text_with_links)

### 2.4. Remove extra whitespaces

In [None]:
venice_text = \
'''Venice           is a city unlike any other, built on a network of     canals and filled with historic architecture and     cultural treasures.    Its unique layout and beautiful surroundings      have made it a      top tourist destination, attracting millions of visitors    each year.        The city is home to many iconic     landmarks, such as the Rialto Bridge, St. Mark's Basilica,     and the Doge's     Palace, which offer a glimpse into Venice's     rich history and artistic legacy.'''

In [None]:
venice_text

In [None]:
remove_extra_whitespaces(venice_text)

### 2.5. Unicode Normalization

In [None]:
salvador_text = "Salvador is a city full of life, culture, and history 🌇🎭🏰. From the beautiful beaches 🌊🏖️ to the vibrant music and dance scene 💃🕺🎶, there's never a dull moment in this exciting Brazilian city 🇧🇷."

In [None]:
salvador_text

In [None]:
unicode_normalization(salvador_text)

### 2.6. Removing Emojis

In [None]:
def remove_emojis(text):
    regex_emoticons = u"\U0001F600-\U0001F64F"
    regex_symbols_pictographs = u"\U0001F300-\U0001F5FF"
    regex_transport_map_symbols = u"\U0001F680-\U0001F6FF"
    regex_flags_ios = u"\U0001F1E0-\U0001F1FF"
    
    regex = f"\s+[{regex_emoticons}{regex_symbols_pictographs}{regex_transport_map_symbols}{regex_flags_ios}]+"
    
    return re.sub(regex, '', text)

In [None]:
remove_emojis(salvador_text)

## 🛠 3. Preprocessing

<img src='./imgs/preprocessing_pipeline.png' width=600 />

**Source:** Vajjala et al. (2020), 'Practical Natural Language Processing'

### 3.1. Lowercasing
To prevent the _same word_ written with _different CASES_ from being interpreted _differently_ in future steps of NLP, we perform **lowering** to _standardize_ them with _lowercase letters_.

In [None]:
ice_cream_text = 'My favorite ice cream flavor is Chocolate. Ow, I love CHOCOLATE ICE CreAM.'
ice_cream_text

### 3.2. Sentence segmentation
- We can do **sentence segmentation** by _breaking up_ text into **sentences** at the appearance of _full stops_ and _question marks_.
- However, there may be abbreviations, forms of addresses (Dr., Mr., etc.), or ellipses (...) that may break the simple rule.

In [None]:
dl_text = \
'''Deep learning is a subfield of machine learning that uses artificial neural networks to simulate the way the human brain works. It involves training these neural networks on large datasets in order to identify complex patterns and relationships, which can then be used to make predictions or perform other tasks.

One of the key advantages of deep learning is its ability to learn from unstructured data, such as images, audio, and text. This has made it a powerful tool for a wide range of applications, from image and speech recognition to natural language processing and autonomous vehicles.

However, deep learning can also be computationally intensive and requires large amounts of data to train the models effectively. Nonetheless, the potential benefits of deep learning are enormous, and it is rapidly becoming an essential tool in fields such as healthcare, finance, and manufacturing. As researchers continue to improve the algorithms and techniques used in deep learning, we can expect to see even more exciting advances in the years to come.'''

In [None]:
print(dl_text)

#### NLTK

**Punkt Sentence Tokenizer** - `nltk.tokenize.punkt module`
- This **tokenizer** divides _a text_ into **a list of *sentences*** by using an _unsupervised algorithm_ to build a model for abbreviation words, collocations, and words that start sentences.
- It **must** be _trained on a large collection of plaintext_ in the **target language** before it can be used.
- The NLTK data package includes a pre-trained Punkt tokenizer for English.

https://www.nltk.org/api/nltk.tokenize.punkt.html#module-nltk.tokenize.punkt

In [None]:
# downloading the pre-trained tokenizers
# download on: /home/your_user/nltk_data/tokenizers/punkt
import nltk
nltk.download('punkt')

In [None]:
# show the (downloaded) available tokenizers

In [None]:
ls ~/nltk_data/tokenizers/punkt

In [None]:
# example 1 in English


In [None]:
len(dl_sentences)

In [None]:
for i, sentence in enumerate(dl_sentences):
    print(f'{i:02d} - {sentence}')

In [None]:
# example 2 (with HTML) in English
hp_html_text

In [None]:
for i, sentence in enumerate(hp_sentences):
    print(f'{i:02d} - {sentence}')

**Tokenizer doesn't** deal with HTML tags. See the last segmented sentence, for example.

#### SpaCy

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

In [None]:
for i, sentence in enumerate(dl_sentences_spacy):
    print(f'{i:02d} - {sentence}')

`SpaCy` keeps the `\n`.

**`SpaCy` doesn't** deal with HTML tags. See the last segmented sentence, for example.

#### **Portuguese**

##### **NLTK**

In [None]:
suco_text = \
'''Suco de laranja é uma bebida popular em todo o mundo devido ao seu sabor doce e refrescante. Rico em vitamina C e outros nutrientes, o suco de laranja é conhecido por seus muitos benefícios à saúde, incluindo o fortalecimento do sistema imunológico, a prevenção de doenças cardíacas e a melhoria da digestão.

Muitas pessoas preferem o sabor do suco de laranja fresco, espremido na hora, pois é mais natural e contém mais nutrientes do que o suco de laranja concentrado. Algumas pessoas também gostam de adicionar outras frutas ou ingredientes ao suco de laranja para dar um sabor extra, como limão, abacaxi ou hortelã.

Independentemente de como é preparado, o suco de laranja é uma bebida refrescante e nutritiva que pode ser apreciada em qualquer época do ano. Seja no café da manhã, no almoço ou em qualquer outro momento, um copo de suco de laranja pode ser uma ótima maneira de obter uma dose de vitaminas e se refrescar ao mesmo tempo.'''

In [None]:
print(suco_text)

In [None]:
for i, sentence in enumerate(suco_sentences):
    print(f'{i:02d} - {sentence}')

##### **SpaCy**

In [None]:
import spacy
nlp = spacy.load('pt_core_news_sm')

In [None]:
for i, sentence in enumerate(suco_sentences_spacy):
    print(f'{i:02d} - {sentence}')

### 3.3. Word tokenization
#### NLTK
- To **tokenize** a _sentence_ into **words**, we can start with a simple rule to split text into words based on the presence of **punctuation marks**.
- The NLTK library allows us to do that.

In [None]:
dl_sentences

In [None]:
for i, sentence in enumerate(dl_sentences):
    print(i, sentence)
    print(word_tokenize(sentence))
    print()

<br/>

We can also perform **word tokenization** in the _entire text_ directly:

In [None]:
word_tokenize(dl_text)

#### SpaCy

In [None]:
dl_text

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
doc = nlp(dl_text)

In [None]:
type(doc)

#### Word Tokenizers may be imprecise

**Example 1:**

In [None]:
sentence_1 = "Mr. Jack O’Neil works at Melitas Marg, located at 245 Yonge Avenue, Austin, 70272."
sentence_1

In [None]:
# NLTK
print(word_tokenize(sentence_1))

Note that, using `NLTK`, _O, ‘_, and _Neil_ are identified as three separate tokens, which is wrong.

In [None]:
# spaCy
doc = nlp(sentence_1)

tokens = [token for token in doc]

print(tokens)

Now, using `SpaCy`, O'Neil was considered a single token.

**Example 2:**

In [None]:
sentence_2 = "There are $10,000 and €1000 which are there just for testing a tokenizer"
sentence_2

In [None]:
# NLTK
print(word_tokenize(sentence_2))

While `$` and `10,000` are identified as _separate tokens_, `€1000` is identified as a _single token_ in `NLTK`.

In [None]:
# spaCy
doc = nlp(sentence_2)

tokens = [token for token in doc]

print(tokens)

Both money signs were identified as _separated tokens_ in `spaCy`.

##### **Casual Word Tokenizers**
In some cases, as Tweets or post in other social networks, the _standard word tokenization_ may be _even more imprecise_ due to **casual language** and **custom symbols** (e.g., @, #, ...). <br/>
Some NLP packages, such as NLP, also provides a specific _word tokenizer_ for such cases.

For example, NLTK `TweetTokenizer`: https://www.nltk.org/api/nltk.tokenize.casual.html

##### **Tokenization is language-dependent**

**Tokenization** is also **_heavily_ dependent on language**.

For example, **'N.Y.!'** has a total of _three punctuations_, but in English, **N.Y.** stands for New York, hence **'N.Y.'** should be treated as a **single word** and not be tokenized further.

Such _language-specific exceptions_ can be specified in some NLP packages. NLTK treats this specific situation.

### 3.4. Correcting mis-spelled words
- Be careful, because this preprocessing task might change the true meaning of the word.

**Required package - Text Blob**
https://textblob.readthedocs.io/en/dev/

`textblob` **only corrects spelling in English**. There is some alternative packages, such as [`pyspellchecker`](https://pyspellchecker.readthedocs.io/en/latest/), that supports Portuguese.

In [None]:
!pip install textblob

#### **Correct spelling of a *word***
It corrects _simple spelling mistakes_ as a repeated character and fat finger.

In [None]:
word_to_be_corrected = 'appple'

#### **Correct spelling of a _sentence_***
**`TextBlob()`** is a simple text block representation from the `textblob` library which has many useful methods, especially for **correcting the spelling**.

In [None]:
sentence_to_be_corrected = 'A sentencee to checkk!'

In [None]:
type(sentence)

In [None]:
# spelling correction


In [None]:
type(corrected_sentence)

In [None]:
# converting to string


In [None]:
# wrapping function to perform spelling correction



In [None]:
text = "My favorit team is Barcelonah, they got sum amazin playas who can do all sortz of trickz with the balll."
text

In [None]:
spelling_correction(text)

<br/>

Note that the **spelling corrector _is not perfect_** but it helps.

### 3.5. Expand Contractions
- To remove **stop words** in the next step, it is crucial that we deal with ***contractions*** first.
- _Contractions_ are shorthand forms for words like _do not_ (***don’t***), _would not_ (***wouldn’t***), _it is_ (***it's***).

In [None]:
CONTRACTION_MAP = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so as",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have",
}

In [None]:
text = "ain't, aren't, can't, cause, can't've"
text

In [None]:
expand_contractions(text)

In [None]:
text = "I'm gonna tell you a little story about a guy who loved to travel. He'd go to all sorts of places, from big cities to small towns, and he'd always find something interesting to see or do. Sometimes he'd go with friends, but most of the time he'd go alone. That didn't bother him though, because he liked the freedom of being able to go wherever he wanted without having to worry about anyone else's schedule. And let me tell you, he saw some amazing things. From the top of mountains to the depths of the ocean, he saw it all. And he wouldn't have traded those experiences for anything."
text

In [None]:
expand_contractions(text, lower=True)

### 3.6. Remove _Stop Words_
- Some of the **frequently used words** __are not particularly **useful**_ for some NLP tasks, for example, _tokenization_, _text classification_, _text summarization_, or any similar task.
- In English, some examples of those words are: "such as a, an, the, of, in, ..."
- These words, called **stop words**, don't don’t carry any content on their own.
- **Stop words** are typically (though not always) _removed_ from further analysis in some NLP problems.

#### NLTK

In [None]:
# download stopwords
import nltk
nltk.download('stopwords')

In [None]:
ls /home/hisamuka/nltk_data/corpora/stopwords

In [None]:
text = "This is a sample sentence, showing off the stop words filtration."
text

In [None]:
remove_stopwords(text)

##### **Portuguese**

In [None]:
text = "Este é um texto incrível, o melhor texto que você lerá hoje, te garanto."
text

In [None]:
remove_stopwords(text, language='portuguese')

#### SpaCy

In [None]:
text = "This is a sample sentence, showing off the stop words filtration."
text

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')



In [None]:
list(stop_words)[:10]

In [None]:
remove_stopwords_spacy(text)

In [None]:
remove_stopwords_spacy_2(text)

### 3.7. Stemming and Lemmatization

<img src='./imgs/stemming_vs_lemmatization.png' /> <br/>
Source: https://www.kaggle.com/getting-started/186152

<img src='./imgs/stemming_vs_lemmatization_2.png' /> <br/>
Source: https://www.quora.com/What-is-difference-between-stemming-and-lemmatization

#### **Stemming**
- Refers to the process of **removing suffixes and reducing a word to some *base (stem)*** form such that all different variants of that word can be represented by the _same form_..
- The words obtained (_stems_) are not guaranteed to exist, which can be a problem depending on what we are solving.
- **Porter Stemmer** and **Snowball Stemmer** are popular _stemming algorithms_ included in NLTK.

In [None]:
word_list = ['develop', 'developed', 'developing', 'development']

In [None]:
# PorterStemmer


In [None]:
print(f'Word           | Stem')
print(f'---------------|---------------')

for word in word_list:
    stem = stemmer.stem(word)
    print(f'{word:14} | {stem:14}')

In [None]:
# SnowballStemmer


In [None]:
print(f'Word           | Stem')
print(f'---------------|---------------')

for word in word_list:
    stem = snow_stemmer.stem(word)
    print(f'{word:14} | {stem:14}')

##### **Portuguese**
- **RSLPStemmer** is a popular _stemming algorithm_ designed for **Portuguese**, which is also included in NLTK.
- We can also use **Snowball Stemmer** for _Portuguese_.

In [None]:
word_list = ['amigo', 'amizade', 'amigas', 'amigão']

In [None]:
# RSLPStemmer
import nltk
nltk.download('rslp')

In [None]:
import nltk 
from nltk.stem import RSLPStemmer

In [None]:
print(f'Word           | Stem')
print(f'---------------|---------------')

for word in word_list:
    stem = stemmer.stem(word)
    print(f'{word:14} | {stem:14}')

In [None]:
# SnowballStemmer


In [None]:
print(f'Word           | Stem')
print(f'---------------|---------------')

for word in word_list:
    stem = snow_stemmer.stem(word)
    print(f'{word:14} | {stem:14}')

#### **Lemmatization**
- The process of ***mapping* all the different forms of a word to its *base word (lemma)***.
- Requires more _linguistic knowledge_ than _stemming_.
- **WordNet** is a popular lemmatization algorithm_ included in NLTK.

In [None]:
word_list = ['drive', 'drives', 'drove', 'driven']

In [None]:
# download the lemmas
import nltk
nltk.download('wordnet')

In [None]:
print(f'Word           | Lemma')
print(f'---------------|---------------')

for word in word_list:
    # pos='v' means the word to be analyzed is a verb
    # we need to be explicit here
    # there are other options: e.g., 'a' for adjectives


In [None]:
# lemmatization for an adjective
print(lemmatizer.lemmatize('better', pos='a'))

#### **Stemming and Lemmatization** in `spaCy`

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')  # small vocabulary

In [None]:
doc = nlp('better')

In [None]:
type(doc)

In [None]:
word_list = ['drive', 'drives', 'drove', 'driven']

In [None]:
print(f'Word           | Lemma')
print(f'---------------|---------------')



##### **Portuguese**

In [None]:
import spacy
nlp = spacy.load('pt_core_news_sm')  # small vocabulary

In [None]:
word_list = ['amigo', 'amizade', 'amigas', 'amigão']

In [None]:
print(f'Word           | Lemma')
print(f'---------------|---------------')



### 3.8. Part-Of-Speech (POS) Tagging
- Also called ***grammatical tagging***, it is the _automatic assignment_ of **POS tags** to _words_ in a sentence.
- A **POS** is a _grammatical_ classification that commonly includes verbs, adjectives, adverbs, nouns, etc.
- **POS tags** make it possible for _automatic text processing tools_ to take into account which part of speech each word is.
- This facilitates the use of linguistic criteria in addition to statistics.
- For languages where the same word can have different parts of speech, e.g. work in English, **POS tags** are used to distinguish between the occurrences of the word when used as a noun or verb.

#### View token tags
Recall that you can obtain a particular token by its index position.
* To view the coarse POS tag use `token.pos_`
* To view the fine-grained tag use `token.tag_`
* To view the description of either type of tag use `spacy.explain(tag)`

In [None]:
text_bill = "Bill Gates is an American entrepreneur and philanthropist who co-founded Microsoft Corporation in 1975."

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')  # small vocabulary

In [None]:
# Process the text with spaCy
doc = nlp(text_bill)

In [None]:
print(f'Word           | POS tag  | Tag    | Explanation POS')
print(f'---------------|----------|--------|----------------')

for token in doc:
    print(f'{token.text:14} | {token.pos_:{8}} | {token.tag_:{6}} | {spacy.explain(token.pos_):{14}}')

#### Coarse-grained Part-of-speech Tags
Every token is assigned a POS Tag from the following list:


<table><tr><th>POS</th><th>DESCRIPTION</th><th>EXAMPLES</th></tr>
    
<tr><td>ADJ</td><td>adjective</td><td>*big, old, green, incomprehensible, first*</td></tr>
<tr><td>ADP</td><td>adposition</td><td>*in, to, during*</td></tr>
<tr><td>ADV</td><td>adverb</td><td>*very, tomorrow, down, where, there*</td></tr>
<tr><td>AUX</td><td>auxiliary</td><td>*is, has (done), will (do), should (do)*</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>*and, or, but*</td></tr>
<tr><td>CCONJ</td><td>coordinating conjunction</td><td>*and, or, but*</td></tr>
<tr><td>DET</td><td>determiner</td><td>*a, an, the*</td></tr>
<tr><td>INTJ</td><td>interjection</td><td>*psst, ouch, bravo, hello*</td></tr>
<tr><td>NOUN</td><td>noun</td><td>*girl, cat, tree, air, beauty*</td></tr>
<tr><td>NUM</td><td>numeral</td><td>*1, 2017, one, seventy-seven, IV, MMXIV*</td></tr>
<tr><td>PART</td><td>particle</td><td>*'s, not,*</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>*I, you, he, she, myself, themselves, somebody*</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>*Mary, John, London, NATO, HBO*</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>*., (, ), ?*</td></tr>
<tr><td>SCONJ</td><td>subordinating conjunction</td><td>*if, while, that*</td></tr>
<tr><td>SYM</td><td>symbol</td><td>*$, %, §, ©, +, −, ×, ÷, =, :), 😝*</td></tr>
<tr><td>VERB</td><td>verb</td><td>*run, runs, running, eat, ate, eating*</td></tr>
<tr><td>X</td><td>other</td><td>*sfpksdpsxmsa*</td></tr>
<tr><td>SPACE</td><td>space</td></tr>

___
## Fine-grained Part-of-speech Tags
Tokens are subsequently given a fine-grained tag as determined by morphology:
<table>
<tr><th>POS</th><th>Description</th><th>Fine-grained Tag</th><th>Description</th><th>Morphology</th></tr>
<tr><td>ADJ</td><td>adjective</td><td>AFX</td><td>affix</td><td>Hyph=yes</td></tr>
<tr><td>ADJ</td><td></td><td>JJ</td><td>adjective</td><td>Degree=pos</td></tr>
<tr><td>ADJ</td><td></td><td>JJR</td><td>adjective, comparative</td><td>Degree=comp</td></tr>
<tr><td>ADJ</td><td></td><td>JJS</td><td>adjective, superlative</td><td>Degree=sup</td></tr>
<tr><td>ADJ</td><td></td><td>PDT</td><td>predeterminer</td><td>AdjType=pdt PronType=prn</td></tr>
<tr><td>ADJ</td><td></td><td>PRP\$</td><td>pronoun, possessive</td><td>PronType=prs Poss=yes</td></tr>
<tr><td>ADJ</td><td></td><td>WDT</td><td>wh-determiner</td><td>PronType=int rel</td></tr>
<tr><td>ADJ</td><td></td><td>WP\$</td><td>wh-pronoun, possessive</td><td>Poss=yes PronType=int rel</td></tr>
<tr><td>ADP</td><td>adposition</td><td>IN</td><td>conjunction, subordinating or preposition</td><td></td></tr>
<tr><td>ADV</td><td>adverb</td><td>EX</td><td>existential there</td><td>AdvType=ex</td></tr>
<tr><td>ADV</td><td></td><td>RB</td><td>adverb</td><td>Degree=pos</td></tr>
<tr><td>ADV</td><td></td><td>RBR</td><td>adverb, comparative</td><td>Degree=comp</td></tr>
<tr><td>ADV</td><td></td><td>RBS</td><td>adverb, superlative</td><td>Degree=sup</td></tr>
<tr><td>ADV</td><td></td><td>WRB</td><td>wh-adverb</td><td>PronType=int rel</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>CC</td><td>conjunction, coordinating</td><td>ConjType=coor</td></tr>
<tr><td>DET</td><td>determiner</td><td>DT</td><td>determiner</td><td></td></tr>
<tr><td>INTJ</td><td>interjection</td><td>UH</td><td>interjection</td><td></td></tr>
<tr><td>NOUN</td><td>noun</td><td>NN</td><td>noun, singular or mass</td><td>Number=sing</td></tr>
<tr><td>NOUN</td><td></td><td>NNS</td><td>noun, plural</td><td>Number=plur</td></tr>
<tr><td>NOUN</td><td></td><td>WP</td><td>wh-pronoun, personal</td><td>PronType=int rel</td></tr>
<tr><td>NUM</td><td>numeral</td><td>CD</td><td>cardinal number</td><td>NumType=card</td></tr>
<tr><td>PART</td><td>particle</td><td>POS</td><td>possessive ending</td><td>Poss=yes</td></tr>
<tr><td>PART</td><td></td><td>RP</td><td>adverb, particle</td><td></td></tr>
<tr><td>PART</td><td></td><td>TO</td><td>infinitival to</td><td>PartType=inf VerbForm=inf</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>PRP</td><td>pronoun, personal</td><td>PronType=prs</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>NNP</td><td>noun, proper singular</td><td>NounType=prop Number=sign</td></tr>
<tr><td>PROPN</td><td></td><td>NNPS</td><td>noun, proper plural</td><td>NounType=prop Number=plur</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>-LRB-</td><td>left round bracket</td><td>PunctType=brck PunctSide=ini</td></tr>
<tr><td>PUNCT</td><td></td><td>-RRB-</td><td>right round bracket</td><td>PunctType=brck PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>,</td><td>punctuation mark, comma</td><td>PunctType=comm</td></tr>
<tr><td>PUNCT</td><td></td><td>:</td><td>punctuation mark, colon or ellipsis</td><td></td></tr>
<tr><td>PUNCT</td><td></td><td>.</td><td>punctuation mark, sentence closer</td><td>PunctType=peri</td></tr>
<tr><td>PUNCT</td><td></td><td>''</td><td>closing quotation mark</td><td>PunctType=quot PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>""</td><td>closing quotation mark</td><td>PunctType=quot PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>``</td><td>opening quotation mark</td><td>PunctType=quot PunctSide=ini</td></tr>
<tr><td>PUNCT</td><td></td><td>HYPH</td><td>punctuation mark, hyphen</td><td>PunctType=dash</td></tr>
<tr><td>PUNCT</td><td></td><td>LS</td><td>list item marker</td><td>NumType=ord</td></tr>
<tr><td>PUNCT</td><td></td><td>NFP</td><td>superfluous punctuation</td><td></td></tr>
<tr><td>SYM</td><td>symbol</td><td>#</td><td>symbol, number sign</td><td>SymType=numbersign</td></tr>
<tr><td>SYM</td><td></td><td>\$</td><td>symbol, currency</td><td>SymType=currency</td></tr>
<tr><td>SYM</td><td></td><td>SYM</td><td>symbol</td><td></td></tr>
<tr><td>VERB</td><td>verb</td><td>BES</td><td>auxiliary "be"</td><td></td></tr>
<tr><td>VERB</td><td></td><td>HVS</td><td>forms of "have"</td><td></td></tr>
<tr><td>VERB</td><td></td><td>MD</td><td>verb, modal auxiliary</td><td>VerbType=mod</td></tr>
<tr><td>VERB</td><td></td><td>VB</td><td>verb, base form</td><td>VerbForm=inf</td></tr>
<tr><td>VERB</td><td></td><td>VBD</td><td>verb, past tense</td><td>VerbForm=fin Tense=past</td></tr>
<tr><td>VERB</td><td></td><td>VBG</td><td>verb, gerund or present participle</td><td>VerbForm=part Tense=pres Aspect=prog</td></tr>
<tr><td>VERB</td><td></td><td>VBN</td><td>verb, past participle</td><td>VerbForm=part Tense=past Aspect=perf</td></tr>
<tr><td>VERB</td><td></td><td>VBP</td><td>verb, non-3rd person singular present</td><td>VerbForm=fin Tense=pres</td></tr>
<tr><td>VERB</td><td></td><td>VBZ</td><td>verb, 3rd person singular present</td><td>VerbForm=fin Tense=pres Number=sing Person=3</td></tr>
<tr><td>X</td><td>other</td><td>ADD</td><td>email</td><td></td></tr>
<tr><td>X</td><td></td><td>FW</td><td>foreign word</td><td>Foreign=yes</td></tr>
<tr><td>X</td><td></td><td>GW</td><td>additional word in multi-word expression</td><td></td></tr>
<tr><td>X</td><td></td><td>XX</td><td>unknown</td><td></td></tr>
<tr><td>SPACE</td><td>space</td><td>_SP</td><td>space</td><td></td></tr>
<tr><td></td><td></td><td>NIL</td><td>missing tag</td><td></td></tr>
</table>

#### Visualizing Parts of Speech
spaCy offers an outstanding visualizer called **displaCy**:

In [None]:
# Import the displaCy library
from spacy import displacy

In [None]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

#### Portuguese

In [None]:
text_mau = 'Maurício de Sousa é o criador da Turma da Mônica, uma série de gibis brasileiros que foi criada em 1959.'
text_mau

In [None]:
import spacy
nlp = spacy.load('pt_core_news_sm')  # small vocabulary

doc = nlp(text_mau)

print(f'Word           | POS tag  | Explanation POS | Tag    ')
print(f'---------------|----------|-----------------|----------------')

for token in doc:
    print(f'{token.text:14} | {token.pos_:{8}} | {spacy.explain(token.pos_):{15}} | {token.tag_:{6}}')

### 3.9. Named-Entity Recognition (NER)
- **NER** tries to find out whether or not a word is a ***named entity***.
- ***Named entities*** are persons, locations, organizations, time expressions etc.
- This problem can be broken down into _detection of names_ followed by _classification of name_ into the corresponding categories.
- Most often a word recognized by **NER** may be recognized as a ***noun*** by a _POS tagger_.

In [None]:
text_bill

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')  # small vocabulary

In [None]:
doc = nlp(text_bill)

In [None]:
print(f'Word                   | NER           | Explanation')
print(f'-----------------------|---------------|-------------------------------')

# Print the named entities in the text
for ent in doc.ents:
    print(f'{ent.text:22} | {ent.label_:13} | {spacy.explain(ent.label_):30}')

#### Entity annotations
`Doc.ents` are token spans with their own set of annotations.
<table>
<tr><td>`ent.text`</td><td>The original entity text</td></tr>
<tr><td>`ent.label`</td><td>The entity type's hash value</td></tr>
<tr><td>`ent.label_`</td><td>The entity type's string description</td></tr>
<tr><td>`ent.start`</td><td>The token span's *start* index position in the Doc</td></tr>
<tr><td>`ent.end`</td><td>The token span's *stop* index position in the Doc</td></tr>
<tr><td>`ent.start_char`</td><td>The entity text's *start* index position in the Doc</td></tr>
<tr><td>`ent.end_char`</td><td>The entity text's *stop* index position in the Doc</td></tr>
</table>



We can **extend the *NER*** library in `spaCy`:
- https://towardsdatascience.com/extend-named-entity-recogniser-ner-to-label-new-entities-with-spacy-339ee5979044

#### Visualizing NER
- https://spacy.io/usage/visualizers#ent

In [None]:
from spacy import displacy

In [None]:
displacy.render(doc, style='ent')

#### Portuguese

In [None]:
text_mau

In [None]:
import spacy
nlp = spacy.load('pt_core_news_sm')  # small vocabulary

In [None]:
doc = nlp(text_mau)

In [None]:
print(f'Word                   | NER           | Explanation')
print(f'-----------------------|---------------|-------------------------------')

# Print the named entities in the text
for ent in doc.ents:
    print(f'{ent.text:22} | {ent.label_:13} | {spacy.explain(ent.label_):30}')

The year **wasn't** identified as a **NER**.

In [None]:
displacy.render(doc, style='ent')

## See more:
- https://www.alura.com.br/artigos/lemmatization-vs-stemming-quando-usar-cada-uma
- https://www.askpython.com/python/examples/pos-tagging-in-nlp-using-spacy
- https://towardsdatascience.com/text-preprocessing-in-natural-language-processing-using-python-6113ff5decd8
- https://medium.com/product-ai/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908
