<a href="https://colab.research.google.com/github/pstorniolo/Master2021/blob/main/L1_Text_Cleaning_and_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Natural Language Processing - Lesson 1**


---


### > Schedule
The timeframes are only estimates and may vary according to how the class is progressing.

0. Strings and regex with Python
1. **Text Cleaning and Preprocessing**
2. Word Embedding
3. Going towards BERT

### > Why Colab? 
Colab (Google Colaboratory) is a free cloud service based on Jupyter Notebooks that supports... FREE GPU!!! <3

Lectures will be held through Colab Notebooks. To download each notebook there are few and really simple steps to do:

( *only one thing is required ... having Google Drive or GitHub* )

1.   Click on https://bit.ly/3oktucr  
2.   `File` > `Save a copy in Drive` / or `Save a copy on GitHub`
3.   (Drive option) Go to your `Drive` and check if the copied version of notebook is present into `Colab Notebooks` folder 
4.   (Github option) Choose which `repository` to copy the notebook to and than `open it with Colab`

## **Lesson 1 - Text Cleaning and Preprocessing**
Data Scientist's work, as you surely know, was declared to be ***The Sexiest Job of the 21st Century***. 

![alt text](https://whatsthebigdata.files.wordpress.com/2016/05/time.jpg?w=768)

Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, **"garbage in, garbage out"**. Feeding dirty data into a model will give us results that are meaningless.

And this certainly applies to any Data Science application and project, but it is even more true for NLP projects.


As Data Scientist, you may can use NLP for sentiment analysis (classifying words to have positive or negative connotation) or to make predictions in classification models, among other things. Typically, whether we’re given the data or have to scrape it, the text will be in its natural human format of sentences, paragraphs, tweets, etc. From there, before we can dig into analyzing, we will have to do some cleaning to break the text down into a format the computer can easily understand.

Just to give you a rough idea of what we're talking about, let's look at these shots of real text: (questo era il corpus di una conversazione)



> ------ Inizio chat: giovedì, giugno 01, 2017, 10:05:31 (+0200) Origine chat: Chat_Prevention Agente VIOLA ( 2s ) VIOLA: Benvenuto/a nel Servizio Clienti <tagged>, sono VIOLA. Posso chiederti il motivo per cui vuoi inviare disdetta? ( 45s ) 1498743565.6.1.fJRQ1qXM.51785098108183337..2P7WhXWCLXggWeIpnzP6q3IZyXFZi0ojXS7d576-OcE.df6e3a7f: buongiorno Viola, non ci interessa più il servizio, vorrei capire che modulo debbo scaricare, grazie mille ( 1m 15s ) VIOLA: Buongiorno ( 1m 21s ) VIOLA: ora ti fornisco tutte le info ( 1m 50s ) 1498743565.6.1.fJRQ1qXM.51785098108183337..2P7WhXWCLXggWeIpnzP6q3IZyXFZi0ojXS7d576-OcE.df6e3a7f: grazie mille per tutto, siamo soddisfatti di <tagged> e sicuramente in tempi diversi torneremo ( 2m 37s )......( 4m 30s ) VIOLA: Sei ancora in linea? ( 6m 3s ) 1498743565.6.1.fJRQ1qXM.51785098108183337..2P7WhXWCLXggWeIpnzP6q3IZyXFZi0ojXS7d576-OcE.df6e3a7f: si viola, ma noi abbiamo due gemelle nate da poco, come può immaginare, in questo momento siamo completamente assorbiti, sarò franca, stiamo pagando un servizio che non sfruttiamo più. Ma ripeto, appena le acque si saranno placate toreneremo ( 6m 23s ) 



> ------ Inizio chat: sabato, giugno 03, 2017, 18:08:20 (+0200) Origine chat: Chat_Amministrativa_Pulsante Agente Miriam ( 0s ) Miriam: Benvenuto, sono Miriam, in cosa posso esserti utile? ( 20s ) GIANNINO: Buonasera Miriam, ( 3m 20s ) GIANNINO: Il giorno 01/12/2016 il mio abbonamento è stato variato su mia richiesta e mi èstata inviata una email confermandomi la variazione e specificando che il nuovo importo mensile era di 48,90 Euro. Fino ad oggi ho pagato 54,90 Euro. Mi può aiutare? ...



> ------ Inizio chat: sabato, giugno 03, 2017, 18:59:23 (+0200) Origine chat: Chat_Amministrativa_Pulsante Agente Marco ( 0s ) Marco: Benvenuto, sono Marco, in cosa posso esserti utile? ( 32s ) Maria Annunziata: VORREI CAMBIARE LA MODALITà DI PAGAMENTO DA ANNUALE HA MENSILE





The pre-processing steps for a problem depend mainly on the domain and the problem itself, hence, we don’t need to apply all steps to every problem.

**Text Cleaning** 

1. Remove whitespace and unwanted text (/n)
2. Make text all lower case
3. Expand abbreviations
4. Remove punctuation
5. Remove numerical values or converting numbers into words 
6. Handling with typos (regex or spell corrector)


**Text Pre-processing**

7. Tokenize text
8. Remove stop words (after TF-IDF)
9. Stemming / lemmatization
10. Parts of speech tagging
11. NER (Named Entity Recognition)

**Vectorization**
12. Bag of Words
13. TF-IDF




---


In this notebook, we are going to see text cleaning and preprocessing in Python. We will be using the **NLTK** (Natural Language Toolkit) library here.

In [None]:
# Import the necessary libraries 
import nltk 
import string 
import re 



nltk.download('stopwords')
nltk.download('wordnet') 
# WordNet is a database which is built for natural language processing. It includes groups of synonyms and a brief definition.
# WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms 
# (synsets), each expressing a distinct concept (https://wordnet.princeton.edu/)
#nltk.download('omw')
#
from nltk import chunk
nltk.download('maxent_ne_chunker')
nltk.download('words')




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
from nltk.corpus import wordnet
syn = wordnet.synsets("pain")
#syn = wordnet.synsets("dolore", lang='ita') 
print(syn[0].definition()) 
print(syn[0].examples())
synonyms = [] 
for syn in wordnet.synsets('computer'): 
    for lemma in syn.lemmas(): 
        synonyms.append(lemma.name()) 
print(synonyms)

a symptom of some physical hurt or disorder
['the patient developed severe pain and distension']
['computer', 'computing_machine', 'computing_device', 'data_processor', 'electronic_computer', 'information_processing_system', 'calculator', 'reckoner', 'figurer', 'estimator', 'computer']


We begin to see the most common practices of text processing that aim to reduce the variance of the text: **lowercasing**, **handling stop words** and **punctuation**.

>>>>> DO YOU REMEMBER WHY IT IS SO IMPORTANT?? 



### Text Lowercase
It is a very common practise. Lowercasing the text is used to reduce the size of the vocabulary of our text data.

In [None]:
def text_lowercase(text):
    return text.lower()

input_str = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil."
input_str = text_lowercase(input_str) 
input_str

'the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.'

### Remove punctuation
You should remove punctuations so that you don’t have different forms of the same word. If you don’t remove the punctuation, then 

> been. 

> been, 

> been! 

will be treated separately. 

One way of doing this is by looping through the Series with list comprehension and keeping everything that is not in `string.punctuation`, a list of all punctuation we imported at the beginning with `import string`.


In [None]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
def remove_punctuation(text): 
    no_punct_text = "".join([char for char in text if char not in string.punctuation])
    return no_punct_text   
input_str = remove_punctuation(input_str) 
input_str

'the 5 biggest countries by population in 2017 are china india united states indonesia and brazil'

But, **be very careful**! If then in your application it will be very important to use emails, in this way you will have lost them forever. 

Keep in mind: always think how much a character is important to you.

### Remove numbers
Remove numbers if they are not relevant to your analyses. Usually, regular expressions are used to remove numbers.

In [None]:
re.sub('\d+', '', input_str)

'the  biggest countries by population in  are china india united states indonesia and brazil'

If instead numbers are important for the analysis, you can also convert the numbers into words. This can be done by using the `inflect` library.

In [None]:
# inflect correctly generate plurals, singular nouns, ordinals, indefinite articles; convert numbers to words
import inflect 
p = inflect.engine() 

# convert number into words 
def convert_number(text): 
	# split string into list of words 
	temp_str = text.split() 
	# initialise empty list 
	new_string = [] 

	for word in temp_str: 
		# if word is a digit, convert the digit 
		# to numbers and append into the new_string list 
		if word.isdigit(): 
			temp = p.number_to_words(word) 
			new_string.append(temp) 

		# append the word as it is 
		else: 
			new_string.append(word) 

	# join the words of new_string to form a string 
	temp_str = ' '.join(new_string) 
	return temp_str 

input_str = convert_number(input_str) 
input_str

'the five biggest countries by population in two thousand and seventeen are china india united states indonesia and brazil'

### Remove whitespaces 
Use the `join` and `split` function to remove all the white spaces in a string.

In [None]:
def remove_whitespace(text): 
    return  " ".join(text.split()) 

remove_whitespace("   we don't need   the given sentence") 

"we don't need the given sentence"

### Remove unwanted text
Use `str.replace()` adn `re.sub()` methods and a laaaarge amount of regular expressions.

In [None]:
def remove_unwanted_text(text):
  new_text = str(text)
  new_text = re.sub(r'\'?(\d+)[-x,\.:]?(\d+)?',r'\1',text)  # 41,5 -> 41 | 3,65 -> 3  removing decimals
  new_text = re.sub('3+[0-9]{9}', '<mobilephone>', new_text) # adding <mobilphone> tag
  new_text = re.sub('(\:(-)?\)|\:(-)?\(|<3|\:(-)?\/|\:-\/|\:(-)?\||\:(-)?[pP]|\s\:+(-)?([0-9])?\s|\^\^|\s\:+(-)?(\D)?\s)', '', new_text)  # removing smile with :
  new_text = new_text.replace('1st', 'first')
  new_text = new_text.replace('2nd', 'second')  
  new_text = re.sub('xké|xkè|xchè|xke|xche|perche|perché', 'perchè',new_text, flags=re.IGNORECASE)
  new_text = re.sub('xo|xò', 'però',new_text, flags=re.IGNORECASE) 
  return new_text

print(remove_unwanted_text('Do the sum of 34.5 and 65,7'))
print(remove_unwanted_text('Please call me at 3331234567!'))
print(remove_unwanted_text('Your new car is fantastic :) :-) :P'))
print(remove_unwanted_text('The 1st classified is Luca'))
print(remove_unwanted_text('Ma xke dici così?'))

Do the sum of 34 and 65
Please call me at <mobilephone>!
Your new car is fantastic 
The first classified is Luca
Ma perchè dici così?


### Tokenization
Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens. 

And now, let's start to use **nltk** library. Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP) which was written in Python and has a big community behind it.

NLTK also is very easy to learn, actually, it’s the easiest natural language processing (NLP) library that you’ll use.

Natural Language toolkit has very important module `tokenize`.

We will compare nltk in these applications with a much more complex and articulated NLP library, used for much more advanced things: **spacy**. 


#### Tokenization - Sentences 

> In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort.



In [None]:
import nltk
from nltk.tokenize import sent_tokenize

# The NLTK data package includes a pre-trained Punkt tokenizer for English.
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
text = "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort."

In [None]:
sent_tokenize(text)

['In a hole in the ground there lived a hobbit.',
 'Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort.']

And now with **spacy**. 

In [None]:
import spacy 
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
doc = nlp(text)
for s in doc.sents:
  print(s)

In a hole in the ground there lived a hobbit.
Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat
: it was a hobbit-hole, and that means comfort.



#### Tokenization - Words

Word tokenization: Split a text into individual words.

In [None]:
from nltk import word_tokenize

In [None]:
text = "In a hole in the ground there lived a hobbit."
word_tokenize(text)

['In',
 'a',
 'hole',
 'in',
 'the',
 'ground',
 'there',
 'lived',
 'a',
 'hobbit',
 '.']

In [None]:
doc = nlp(text)
for word in doc:
  print(word)

In
a
hole
in
the
ground
there
lived
a
hobbit
.


In [None]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\w+') # It removes punctuations
tokenizer.tokenize("So lucky! I won a lottery. Am I the 1st?")

['So', 'lucky', 'I', 'won', 'a', 'lottery', 'Am', 'I', 'the', '1st']

In [None]:
# Instantiate tokenizer
tokenizer = RegexpTokenizer('\w+') 
tokenizer_1 = RegexpTokenizer('\w+|\$[\d\.]+|\S+') 
tokenizer_2 = RegexpTokenizer('[A-Z]\w+') #only words that begin with a capital letter 

text = "So lucky! I won a lottery. Am I the 1st?"
print(tokenizer.tokenize(text))
print(tokenizer_1.tokenize(text))
print(tokenizer_2.tokenize(text))
print([t for t in text.split()])

['So', 'lucky', 'I', 'won', 'a', 'lottery', 'Am', 'I', 'the', '1st']
['So', 'lucky', '!', 'I', 'won', 'a', 'lottery', '.', 'Am', 'I', 'the', '1st', '?']
['So', 'Am']
['So', 'lucky!', 'I', 'won', 'a', 'lottery.', 'Am', 'I', 'the', '1st?']


Can’t simply split on spaces.

“They aren’t here.”  --> "They" "are" "n't" "here"
 

In [None]:
print(word_tokenize("They aren’t here."))
doc = nlp("They aren’t here.")
for word in doc:
  print(word)

['They', 'aren', '’', 't', 'here', '.']
They
are
n’t
here
.


### N-Gram

An n-gram is a subsequence of n elements of a given sequence.

> In a hole in the ground there lived a hobbit. 

Unigrams - 1 word: “In”, “a”, “hole”, “in”, ...

Bigrams - 2 words: “In a”, “a hole”, “hole in”, ...

Trigrams - 3 words: “In a hole”, “a hole in”, ...  
Etc…

Applications: **Keywords extraction**

In [None]:
from nltk import ngrams

sentence = 'In a hole in the ground there lived a hobbit.'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print(grams)

('In', 'a', 'hole', 'in', 'the', 'ground')
('a', 'hole', 'in', 'the', 'ground', 'there')
('hole', 'in', 'the', 'ground', 'there', 'lived')
('in', 'the', 'ground', 'there', 'lived', 'a')
('the', 'ground', 'there', 'lived', 'a', 'hobbit.')


### Remove stop words
"*Stop words*" are the most common words in a language like "the", "a", "on", "is", "all". These words do not carry important meaning and are usually removed from texts

We imported a list of the most frequently used words from the NL Toolkit at the beginning with `from nltk.corpus import stopwords`. You can run `stopwords.word(insert language)` to get a full list for every language. There are 179 English words, including ‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘you’, ‘he’, ‘his’, for example.

They are **filtered** from the text for **word based approaches**.




In [None]:
from nltk.corpus import stopwords
stopwords.words("english") 
stopwords.words("italian")

['ad',
 'al',
 'allo',
 'ai',
 'agli',
 'all',
 'agl',
 'alla',
 'alle',
 'con',
 'col',
 'coi',
 'da',
 'dal',
 'dallo',
 'dai',
 'dagli',
 'dall',
 'dagl',
 'dalla',
 'dalle',
 'di',
 'del',
 'dello',
 'dei',
 'degli',
 'dell',
 'degl',
 'della',
 'delle',
 'in',
 'nel',
 'nello',
 'nei',
 'negli',
 'nell',
 'negl',
 'nella',
 'nelle',
 'su',
 'sul',
 'sullo',
 'sui',
 'sugli',
 'sull',
 'sugl',
 'sulla',
 'sulle',
 'per',
 'tra',
 'contro',
 'io',
 'tu',
 'lui',
 'lei',
 'noi',
 'voi',
 'loro',
 'mio',
 'mia',
 'miei',
 'mie',
 'tuo',
 'tua',
 'tuoi',
 'tue',
 'suo',
 'sua',
 'suoi',
 'sue',
 'nostro',
 'nostra',
 'nostri',
 'nostre',
 'vostro',
 'vostra',
 'vostri',
 'vostre',
 'mi',
 'ti',
 'ci',
 'vi',
 'lo',
 'la',
 'li',
 'le',
 'gli',
 'ne',
 'il',
 'un',
 'uno',
 'una',
 'ma',
 'ed',
 'se',
 'perché',
 'anche',
 'come',
 'dov',
 'dove',
 'che',
 'chi',
 'cui',
 'non',
 'più',
 'quale',
 'quanto',
 'quanti',
 'quanta',
 'quante',
 'quello',
 'quelli',
 'quella',
 'quelle',
 'q

In [None]:
def remove_stopwords(text, lang): 
    stop_words = set(stopwords.words(lang)) 
    word_tokens = tokenizer.tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words] 
    return filtered_text 
  
example_text = "This is a sample sentence and we are going to remove the stopwords from this."
remove_stopwords(example_text, "english") 

['This', 'sample', 'sentence', 'going', 'remove', 'stopwords']


But the stop-words can also be **other** than those generally considered as such, in fact the verbs for example can be carriers of much meaning for our use case.

For example, in our sentence `In a hole in the ground there lived a hobbit.`.

We can decide that our stopwords are

> hole, ground, lived, hobbit



In [None]:
#stop = set(stopwords.words('english'))
stop = set(["hole","ground","lived","hobbit"]) 
tokenized = nltk.word_tokenize("In a hole in the ground there lived a hobbit.")
filtered = [w.lower() for w in tokenized \
if (not w.lower() in stop and w.isalnum())]
print(tokenized)
print(filtered)

['In', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit', '.']
['in', 'a', 'in', 'the', 'there', 'a']


### Stemming & Lemmatizing
Both tools shorten words back to their **root form**. 

* **Stemming** is a little more aggressive. It cuts off prefixes and/or endings of words based on common ones. Stem or root is the part to which inflectional affixes (-ed, -ize, -de, -s, etc.) are added. The stem of a word is created by removing the prefix or suffix of a word. It can sometimes be helpful, but not always because often times the new word is so much a root that it loses its actual meaning.

* **Lemmatizing**, on the other hand, maps common words into one base. Unlike stemming though, it always still returns a proper word that can be found in the dictionary.


I am preferring lemmatizing than stemming because I could extract the word meaning from the context in the sentence (e.g. distinguish between a verb and a noun) and obtain words that exist in the language, rather than roots of those words that don't usually have a meaning.

**There is no stemming in Spacy.**

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
porter = PorterStemmer()
tokenized = nltk.word_tokenize("We don't want any adventures here, thank you!")
for word in tokenized:
    print(porter.stem(word))

We
do
n't
want
ani
adventur
here
,
thank
you
!


In [None]:
# Instantiate lemmatizer
lemmatizer = WordNetLemmatizer()
# Instantiate stemmer
stemmer = PorterStemmer()

def word_lemmatizer(text):
  tokens = tokenizer.tokenize(text)
  lem_text = [lemmatizer.lemmatize(i) for i in tokens]
  return lem_text

def word_stemmer(text):
  tokens = tokenizer.tokenize(text)
  stem_text = [stemmer.stem(i) for i in tokens]
  return stem_text

print(word_lemmatizer("We don't want any adventures here, thank you!"))
print(word_stemmer("We don't want any adventures here, thank you!"))

print(word_lemmatizer('data science uses scientific methods algorithms and many types of processes'))
print(word_stemmer('data science uses scientific methods algorithms and many types of processes'))

print(word_lemmatizer('been had done languages cities mice'))
print(word_stemmer('been had done languages cities mice'))

print(word_lemmatizer('yesterday I studied the mouse cities'))
print(word_stemmer('yesterday I studied the mouse cities'))

['We', 'don', 't', 'want', 'any', 'adventure', 'here', 'thank', 'you']
['We', 'don', 't', 'want', 'ani', 'adventur', 'here', 'thank', 'you']
['data', 'science', 'us', 'scientific', 'method', 'algorithm', 'and', 'many', 'type', 'of', 'process']
['data', 'scienc', 'use', 'scientif', 'method', 'algorithm', 'and', 'mani', 'type', 'of', 'process']
['been', 'had', 'done', 'language', 'city', 'mouse']
['been', 'had', 'done', 'languag', 'citi', 'mice']
['yesterday', 'I', 'studied', 'the', 'mouse', 'city']
['yesterday', 'I', 'studi', 'the', 'mous', 'citi']


In [None]:
# Lemmatization with Spacy, we can start to see a POS-tagging action.
nlp = en_core_web_sm.load()
doc = nlp("We don't want any adventures here, thank you!")
for word in doc:
    print(word.lemma_)

-PRON-
do
not
want
any
adventure
here
,
thank
-PRON-
!


And for other languages? Let's try with Italian!!





In [None]:
from nltk.stem import SnowballStemmer
print(" ".join(SnowballStemmer.languages)) # See which languages are supported

arabic danish dutch english finnish french german hungarian italian norwegian porter portuguese romanian russian spanish swedish


In [None]:
arabic_stemmer = SnowballStemmer(language='arabic', ignore_stopwords=False)
arabic_stemmer.stem("شكرا  الله يحفظك ويحميك ويحرص يرص عليك")

'شكرا  الله يحفظك ويحميك ويحرص يرص عل'

In [None]:
ita_stemmer = nltk.stem.snowball.ItalianStemmer()
ita_text = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
print(" ".join([ita_stemmer.stem(i) for i in tokenizer.tokenize(ita_text)])) 
#print(" ".join([ita_stemmer.stem(i) for i in tokenizer_1.tokenize(ita_text)])) 

ier son andat in due supermerc oggi vol andar all ippodrom staser mang la pizz con le verdur


### POS (Part Of Speech) tagging 

*Ci sei alle sei?*

Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context.

part of speech general tags: Noun (N), Verb (V), Adjective(ADJ), Adverb (ADV), Preposition (P), Conjunction (CON), Pronoun(PRO), Interjection (INT)


![alt text](https://cdn-media-1.freecodecamp.org/images/1*f6e0uf5PX17pTceYU4rbCA.jpeg)


POS tagging is one of the fundamental tasks of natural language processing tasks.
Other than the usage mentioned in the other answers here, I have one important use for POS tagging **Word Sense Disambiguation**.

Words often occur in different senses as different parts of speech. For eg.

She saw a `bear`.

Your efforts will `bear` fruit.

Ci `sei` alle `sei`?

The word bear in the above sentences has completely different senses, but more importantly one is a noun and other is a verb.
A basic word sense disambiguation is possible if you can tag words with their POS.

There are many tools containing POS taggers including NLTK, spaCy, TextBlob, Pattern, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), FreeLing, Illinois Part of Speech Tagger, and DKPro Core.


[SpaCy official Page](https://spacy.io/)

spaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython. The library is published under the MIT license and currently offers statistical neural network models for English, German, Spanish, Portuguese, French, Italian, Dutch and multi-language NER. 



In [None]:
import pprint

!python -m spacy download en_core_web_sm # These are the pretrained pos-tagging and ner models for english and italian
!python -m spacy download it
import spacy
import it_core_news_sm
nlp = spacy.load('en_core_web_sm')
nlp_ita = spacy.load('it')

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 7.6 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
Collecting it_core_news_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/it_core_news_sm-2.2.5/it_core_news_sm-2.2.5.tar.gz (14.5 MB)
[K     |████████████████████████████████| 14.5 MB 6.2 MB/s 
Building wheels for collected packages: it-core-news-sm
  Building wheel for it-core-news-sm (setup.py) ... [?25l[?25hdone
  Created wheel for it-core-news-sm: filename=it_core_news_sm-2.2.5-py3-none-any.whl size=14471129 sha256=8f2f5a1a4796813a99e93931797a206e838f9c2e46c935397550aed798cbdb73
  Stored in directory: /tmp/pip-ephem-wheel-cache-l6vvj6qu/wheels/87/88/46/36fd0cabbebd89b2ee247bf113c1ca4f2cb184f8b7a6758ba2
Successfu

In [None]:
nltk.download('averaged_perceptron_tagger')

pp = pprint.PrettyPrinter(indent=2)

tagged_sent = nltk.pos_tag(tokenizer.tokenize("Can you please buy me an Arizona Ice Tea, please? It's $0.99., please?")) # See https://www.nltk.org/book/ch05.html for tags legend

pp.pprint(tagged_sent)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[ ('Can', 'MD'),
  ('you', 'PRP'),
  ('please', 'VB'),
  ('buy', 'VB'),
  ('me', 'PRP'),
  ('an', 'DT'),
  ('Arizona', 'NNP'),
  ('Ice', 'NNP'),
  ('Tea', 'NNP'),
  ('please', 'VB'),
  ('It', 'PRP'),
  ('s', 'VBZ'),
  ('0', 'CD'),
  ('99', 'CD'),
  ('please', 'NN')]


In [None]:
doc = nlp("Can you please buy me an Arizona Ice Tea? It's $0.99.")
for token in doc:
   print(token.text, token.lemma_, token.pos_) #See https://spacy.io/api/annotation for tags legend

Can Can VERB
you -PRON- PRON
please please INTJ
buy buy VERB
me -PRON- PRON
an an DET
Arizona Arizona PROPN
Ice Ice PROPN
Tea Tea PROPN
? ? PUNCT
It -PRON- PRON
's be AUX
$ $ SYM
0.99 0.99 NUM
. . PUNCT


In [None]:
nltk.pos_tag(tokenizer.tokenize("Can you please buy me an Arizona Ice Tea? It's $0.99.".lower())) ## 'Arizona' becomes a JJ (adjective) 

[('can', 'MD'),
 ('you', 'PRP'),
 ('please', 'VB'),
 ('buy', 'VB'),
 ('me', 'PRP'),
 ('an', 'DT'),
 ('arizona', 'JJ'),
 ('ice', 'NN'),
 ('tea', 'NN'),
 ('it', 'PRP'),
 ('s', 'VBD'),
 ('0', 'CD'),
 ('99', 'CD')]

In [None]:
#Now for the italian
tagged_sent_ita = nltk.pos_tag(tokenizer.tokenize("Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo"), lang="italian") 
pp.pprint(tagged_sent_ita)

[ ('Ieri', 'NNP'),
  ('sono', 'NN'),
  ('andato', 'NN'),
  ('in', 'IN'),
  ('due', 'JJ'),
  ('supermercati', 'NNS'),
  ('Oggi', 'NNP'),
  ('volevo', 'NN'),
  ('andare', 'NN'),
  ('all', 'DT'),
  ('ippodromo', 'NN')]


In [None]:
doc_ita = nlp_ita("Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo")
for token in doc_ita:
   print(token.text, token.lemma_, token.pos_)

Ieri Ieri ADV
sono essere AUX
andato andare VERB
in in ADP
due due NUM
supermercati supermercato NOUN
. . PUNCT
Oggi Oggi ADV
volevo volere AUX
andare andare VERB
all' alla SCONJ
ippodromo ippodromo PROPN


Seems easy?? 

**Algorithms for PoS tagging**

*  Rule-based taggers 
   
  "manual" creation of a large database of rules which
   specify for ambiguous cases the conditions to be verified for
  the assignment of every possible tag
   E.g. a word is a noun if it is preceded by an article


*  Probabilistic taggers (HMM, CRF)
 
  They generally solve ambibuity by estimating the probability that a specific
  word has a given tag in a given context using a dataset of reference

* Other approaches

  * Tagging problem as a classification problem (each tag corresponds to a    class and a classifier processes text features to describe the context)
   
  * Rules-based taggers learned from examples


State of the art of POS-Tagging is the **BI-LSTM-CRF** model for sequence labeling. 

(Some tutorial to develop a state of the art POS-tagging with Keras)

https://github.com/Hironsan/anago

https://www.depends-on-the-definition.com/sequence-tagging-lstm-crf/

https://nlpforhackers.io/lstm-pos-tagger-keras/

### NER (Named Entity Recognition)

<center><img src='https://miro.medium.com/max/725/1*i8IfPgFDFVAIqXBnIdg5yg.jpeg' width="340" height="240"></center>


NER, short for Named Entity Recognition is probably the first step towards **Information Extraction from unstructured text**(POS Tagging is still more data augmentation than information extraction). 
It basically means extracting what is a real world entity from the text (Person,Organization, Event etc …).

<center><img src='https://miro.medium.com/max/1171/1*OZaHa-z7A4Xny3dN1qbsQg.png
' width="630" height="210"></center>

Few Use-Cases of Named Entity Recognition:

* News and Blog Post Classification 

* Efficient Search Algorithms ( What would happen if we were to search for a word in a blog of 10,000 articles? The system should search within each of them, and since it has no memory, repeat this operation with each new query.

Such a system, in addition to being poorly scalable, proves to be inefficient.

The NER solves this problem by analyzing each article only once, extracting the key words and populating a list of named entities that can be used by search queries.)


* Customer Support (We apply the NER to the texts so as to properly sort the message to the most relevant department)

In [None]:
text = """London is the capital and most populous city of England and 
the United Kingdom.  Standing on the River Thames in the south east 
of the island of Great Britain, London has been a major settlement 
for two millennia. It was founded by the Romans, who named it Londinium.
\n
Bill works for Apple so he went to Boston for a conference.
"""
doc = nlp(text)

for entity in doc.ents:    
    spacy_expl=spacy.explain(entity.label_)
    print(f"{entity.text} ({entity.label_} : {spacy_expl} )")

London (GPE : Countries, cities, states )
England (GPE : Countries, cities, states )
the United Kingdom (GPE : Countries, cities, states )
the south east (LOC : Non-GPE locations, mountain ranges, bodies of water )
Great Britain (GPE : Countries, cities, states )
London (GPE : Countries, cities, states )
two (CARDINAL : Numerals that do not fall under another type )
Romans (NORP : Nationalities or religious or political groups )
Londinium (ORG : Companies, agencies, institutions, etc. )
Bill (PERSON : People, including fictional )
Apple (ORG : Companies, agencies, institutions, etc. )
Boston (GPE : Countries, cities, states )


In [None]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

* PERSON	People, including fictional.
* NORP	Nationalities or religious or political groups.
* FAC	Buildings, airports, highways, bridges, etc.
* ORG	Companies, agencies, institutions, etc.
* GPE	Countries, cities, states.
* LOC	Non-GPE locations, mountain ranges, bodies of water.
* PRODUCT	Objects, vehicles, foods, etc. (Not services.)
* EVENT	Named hurricanes, battles, wars, sports events, etc.
* WORK_OF_ART	Titles of books, songs, etc.
* LAW	Named documents made into laws.
* LANGUAGE	Any named language.
* DATE	Absolute or relative dates or periods.
* TIME	Times smaller than a day.
* PERCENT	Percentage, including ”%“.
* MONEY	Monetary values, including unit.
* QUANTITY	Measurements, as of weight or distance.
* ORDINAL	“first”, “second”, etc.
* CARDINAL	Numerals that do not fall under another type.




## *EXERCISE: Why don't you see how it performs with Italian language?*

In [None]:
text_ita = """Mattia è un bimbo di 5 anni che passa tutte le sue giornate a disegnare. 
Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio 
la pizza con le verdure.  Che tempo fara' a Bastia Umbra""" 

doc_ita = nlp_ita(text_ita)
for entity in doc_ita.ents:
    #print(spacy.explain(entity.label_))
    spacy_expl=spacy.explain(entity.label_)
    print(f"{entity.text} ({entity.label_} : {spacy_expl} )")
  
displacy.render(doc_ita, style='ent', jupyter=True)

Mattia (MISC : Miscellaneous entities, e.g. events, nationalities, products or works of art )
Stasera (MISC : Miscellaneous entities, e.g. events, nationalities, products or works of art )
Bastia Umbra (LOC : Non-GPE locations, mountain ranges, bodies of water )


And if we make lowercase the sentence?

That sounds easy too, doesn't it?

* *Classical Approaches*

  mostly rule-based. ([Tutorial: How use NLTK to create a rule-based NER](https://www.youtube.com/watch?v=LFXsG7fueyk))
* *Machine Learning Approaches*
  * *Multi-class Classification* (they are algorithms that ignore the context)
  * *Conditional Random Field* (CRF) model 
    * they are models widely used to model sequential data, just like words in sentences
    * the CRF model is able to capture the features of the current and previous labels in a sequence but it cannot understand the context of the forward labels (let's see [NER With CRF In Python](https://www.depends-on-the-definition.com/named-entity-recognition-conditional-random-fields-python/) )
* *Deep Learning Approaches* 

  Which type of neural network works best to tackle NER problem considering that the text is a sequential data format? Yeah, you guessed it right… Long short Term Memory (LSTM). But not any type of LSTM, we need to use bi-directional LSTMs because using a standard LSTM to make predictions will only take the “past” information in a sequence of the text into account. (Bi-LSTM e' la combinazione di due LSTM, una 'forward' da sx a dx e una 'backward' da dx a sx)  (vediamo alcuni tra gli approcci state-of-the-art)
  * *Bidirectional LSTM-CRF* (More details and [implementation](https://www.depends-on-the-definition.com/sequence-tagging-lstm-crf/) in keras)
  * *Bidirectional LSTM-CNNs* (More details and [implementation](https://www.depends-on-the-definition.com/lstm-with-char-embeddings-for-ner/) in keras)
  * *Bidirectional LSTM-CNNS-CRF* (Let's see the paper [here](https://arxiv.org/pdf/1603.01354.pdf))


  



----------------
### Bag of Words

<center><img src='https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRdXiRXvj1cwjs9_ewoAXEY2ex_HF2A5tG1HA&usqp=CAU
'></center>
 

Bag of Words (BoW) is an algorithm that **counts** how many times a word appears in a document. It’s a tally. 

Those word counts allow us to **compare documents** and gauge **their similarities** for applications like **search**, **document classification** and **topic modeling**. 

BoW is a also method for preparing text for input in a deep-learning net.

BoW lists words paired with their word counts per document. 

In the table where the words and documents that effectively become vectors are stored:
* each row is a word
* each column is a document
* each cell is a word count

**Each of the documents in the corpus is represented by columns/vectors** of equal length. Those are wordcount vectors, an output stripped of context.

With BoW, the order of words does not matter...



### Term Frequency-Inverse Document Frequency (TF-IDF)
Term-frequency-inverse document frequency (TF-IDF) is another way to judge the **topic** of an article by the words it contains. 

With TF-IDF, words are given weight – **TF-IDF measures relevance, not frequency**. 

That is, wordcounts are replaced with TF-IDF scores across the whole dataset.

Have a look to the slides...