# The Power of NLP

Natural language processing (NLP) deals with building computational algorithms to automatically analyze and represent human language. NLP-based systems have enabled a wide range of applications such as Google’s powerful search engine, and more recently, Amazon’s voice assistant named Alexa. NLP is also useful to teach machines the ability to perform complex natural language related tasks such as machine translation and dialogue generation.

For a long time, the majority of methods used to study NLP problems employed shallow machine learning models and time-consuming, hand-crafted features. This lead to problems such as the curse of dimensionality since linguistic information was represented with sparse representations (high-dimensional features).

However, with the recent popularity and success of word embeddings (low dimensional, distributed representations), neural-based models have achieved superior results on various language-related tasks as compared to traditional machine learning models

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/nlp22.png" width="1200">

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/roadmap.jpg" width="900">

## BERT
BERT (Bidirectional Encoder Representations from Transformers) is a recent paper published by researchers at Google AI Language. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.
BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/bert.jpg" width="900">

## GPT-3

According to researchers in the paper[https://arxiv.org/pdf/2005.14165.pdf];

“GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic,”

The team also finds that “ GPT-3 can construct samples of news articles which people will get conflict in order to recognize articles transcripted by humans.”

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/gpt.jpg" width="900">

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/model.png" width="900">


Specification:


- A state-of-the-art language model made up of 175 billion parameters.

- A parameter is a measurement in a neural network that deploys a large or small weightage to a few aspects of data, for providing that aspect larger or smaller importance in an entire measurement of the data.

- These are the weights that deliver shape to the data, and provide a neural network an understanding angle on the data.

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/gpt3.jpg" width="900">

# spaCy
SpaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage, spaCy also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, PyTorch or MXNet through its own machine learning library Thinc.

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/nlp23.jpg" width="1200">



<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/nltkspacy1.png" width="600">

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/nltkspacy2.png" width="500">

### Installation and Setup

Installation is a two-step process. First, install spaCy using either conda or pip. Next, download the specific model you want, based on language.<br> For more info visit https://spacy.io/usage/

In [None]:
# !pip install spacy ## Colab har already the installer
# !python -m spacy download en
#!python -m spacy download it_core_news_sm
!python -m spacy download it_core_news_md
#!python -m spacy download it_core_news_lg

#RESTART RUNTIME

In [None]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('it_core_news_md')

# Create a string that includes opening and closing quotation marks
itadoc = nlp("Ciao a tutti ragazzi! valle d'Aosta ")
print(itadoc)

Ciao a tutti ragazzi! valle d'Aosta 


In [None]:
# Print each token separately
for token in itadoc:
    print(token.text, token.pos_,)

Ciao INTJ
a ADP
tutti DET
ragazzi NOUN
! PUNCT
valle PROPN
d' ADP
Aosta PROPN


In [None]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp('tesla is looking at buying U.S.A startup for $6 million, U.K')

# Print each token separately
for token in doc:
    print(token.text, token.pos_, token.dep_) # dep=dependencies



tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S.A PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj
, PUNCT punct
U.K PROPN npadvmod


This doesn't look very user-friendly, but right away we see some interesting things happen:
1. Tesla is recognized to be a Proper Noun, not just a word at the start of a sentence
2. U.S. is kept together as one entity (we call this a 'token')

As we dive deeper into spaCy we'll see what each of these abbreviations mean and how they're derived. We'll also see how spaCy can interpret the last three tokens combined `$6 million` as referring to ***money***.

___
# spaCy Objects

After importing the spacy module in the cell above we loaded a **model** and named it `nlp`.<br>Next we created a **Doc** object by applying the model to our text, and named it `doc`.<br>

**In the sm/md/lg models:**

- The tagger, morphologizer and parser components listen to the tok2vec component. If the lemmatizer is trainable (v3.3+), lemmatizer also listens to tok2vec.
- The attribute_ruler maps token.tag to token.pos if there is no morphologizer. - The attribute_ruler additionally makes sure whitespace is tagged consistently and copies token.pos to token.tag if there is no tagger. For English, the attribute ruler can improve its mapping from token.tag to token.pos if dependency parses from a parser are present, but the parser is not required.
- The lemmatizer component for many languages requires token.pos annotation from either tagger+attribute_ruler or morphologizer.
- The ner component is independent with its own internal tok2vec layer.

## Pipeline
When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/python/pipespacy.png" width="1200">

In [None]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7cb59173d010>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7cb59173cb30>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7cb656fd9460>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7cb591746290>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7cb59174f450>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7cb5933e5a80>)]

In [None]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

___
## Tokenization
The first step in processing text is to split up all the component parts (words & punctuation) into "tokens"

In [None]:
doc2 = nlp("Tesla isn't   looking into startups anymore.")

for token in doc2:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
n't PART neg
   SPACE dep
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


`isn't` has been split into two tokens. spaCy recognizes both the root verb `is` and the negation attached to it. Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens.

It's important to note that even though `doc2` contains processed information about each token, it also retains the original text:

In [None]:
doc2

Tesla isn't   looking into startups anymore.

In [None]:
doc2[0]

Tesla

In [None]:
doc2[2]

n't

In [None]:
type(doc2)

spacy.tokens.doc.Doc

In [None]:
doc2[-1]

.

___
## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

In [None]:
doc2[0].text

'Tesla'

In [None]:
doc2[0].pos_

'PROPN'

___
## Dependencies
We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.

For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
<br>A good explanation of typed dependencies can be found [here](https://nlp.stanford.edu/software/dependencies_manual.pdf)

In [None]:
doc2[0].dep_

'nsubj'

To see the full name of a tag use `spacy.explain(tag)`

In [None]:
spacy.explain('PROPN')

'proper noun'

In [None]:
spacy.explain('nsubj')

'nominal subject'

___
## Additional Token Attributes
We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [None]:
# Lemmas (the base form of the word):
print(doc2[4].text)
print(doc2[4].lemma_)
print(doc2[4].pos_)
print(doc2[4].tag_)
print(doc2[4].is_alpha)
print(doc2[4].is_stop)

looking
look
VERB
VBG
True
False


In [None]:
spacy.explain('VBG')

'verb, gerund or present participle'

In [None]:
# Simple Parts-of-Speech & Detailed Tags:
print(doc2[4].pos_)
print(doc2[4].tag_ + ' / ' + spacy.explain(doc2[4].tag_))

VERB
VBG / verb, gerund or present participle


In [None]:
# Word Shapes:
print(doc2[0].text+': '+doc2[0].shape_)
print(doc[5].text+' : '+doc[5].shape_)

Tesla: Xxxxx
U.S.A : X.X.X


In [None]:
# Boolean Values:
print(doc2[0].is_alpha) #alfabetic
print(doc2[0].is_stop)  #is a stopwords

True
False


___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [None]:
doc3 = nlp('Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [None]:
type(doc3)

spacy.tokens.doc.Doc

In [None]:
life_quote = doc3[18:29]
print(life_quote)

is what happens to us while we are making other plans


In [None]:
type(life_quote)

spacy.tokens.span.Span

___
## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [None]:
doc4 = nlp('This is the first sentence. This is another sentence.This is the last sentence.')

In [None]:
for sent in doc4.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [None]:
doc4[6].text

'This'

In [None]:
doc4[5]

.

In [None]:
doc4[6]

This

In [None]:
doc4[6].is_sent_start

True

## Recap all spaCy tools

In [None]:
import spacy
from spacy import displacy

nlp = spacy.load('en_core_web_sm')
text = "Apple, This is first sentence. and Google this is another one. here 3rd one is"
doc = nlp(text)
doc

Apple, This is first sentence. and Google this is another one. here 3rd one is

In [None]:
for token in doc:
    print(token)

Apple
,
This
is
first
sentence
.
and
Google
this
is
another
one
.
here
3rd
one
is


In [None]:
# sent = nlp.create_pipe('sentencizer')
# nlp.add_pipe(sent)#), before='parser')
# doc = nlp(text)

In [None]:
for sent in doc.sents:
    print(sent)

Apple, This is first sentence.
and Google this is another one.
here 3rd one is


In [None]:
## Remove stopwords

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

In [None]:
stopwords = list(STOP_WORDS)

In [None]:
print(stopwords)

['afterwards', 'through', 'sometimes', "'s", '‘ll', 'beyond', 'often', 'had', 'we', 'any', 'really', 'make', 'although', 'because', '’ll', '’s', 'give', '’ve', 'himself', 'see', 'been', 'whole', 'whether', 'else', 'until', 'an', 'nine', 'herself', 'together', 'off', 'neither', '’re', 'eight', 'few', 'of', 'enough', 'already', 'nowhere', 'alone', 'you', 'his', 'if', 'someone', 'down', 'ten', 'per', 'perhaps', 'from', 'part', 'no', 'another', 'least', 'might', 'seeming', 'either', 'she', 'two', "'d", "'m", 'how', 'thereupon', 'five', 'regarding', 'as', 'seem', 'therein', 'up', 'between', 'our', 'for', 'unless', 'everyone', 'and', 'latterly', 'cannot', 'almost', '‘ve', 'would', 'anyone', 'nothing', 'used', 'via', 'below', '‘m', 'everywhere', 'becoming', 'are', 'behind', 'once', 'be', 'other', 'by', 'moreover', 'full', 'rather', 'latter', 'itself', 'also', 'one', 'otherwise', 'too', 'onto', 'third', 'own', 'who', 'them', 'what', 'anyway', 'put', 'twelve', 'hereafter', 'others', 'toward', '

In [None]:
len(stopwords)

326

In [None]:
text = "Apple, This is first sentence. and Google this is another one. here 3rd one is"
doc = nlp(text)

for token in doc:
    if token.is_stop == False:
        print(token)

Apple
,
sentence
.
Google
.
3rd


In [None]:
### Lemmatization

In [None]:
doc = nlp('run runs running runners')

In [None]:
for lem in doc:
    print(lem.text, lem.lemma_)

run run
runs run
running run
runners runner


In [None]:
##POS

In [None]:
doc = nlp('All is well at your end!')

In [None]:
for token in doc:
    print(token.text, token.pos_)

All PRON
is AUX
well ADV
at ADP
your PRON
end NOUN
! PUNCT


In [None]:
displacy.render(doc, style = 'dep', jupyter=True)

In [None]:
from spacy import displacy
nlp = spacy.load('en_core_web_sm')

docx = nlp("Not all cheeses require this final stage of ripening.")

options = {"compact":True,"color": "darkgreen", "add_lemma": True, "distance": 100 }
displacy.render(docx,style='dep', options=options, jupyter=True)

In [None]:
#Displacy - dep Style
docx = nlp("These oranges are sweet and juicy.")

options = {"color": "Purple", "add_lemma": True}
displacy.render(docx,style='dep', options = options, jupyter=True)

In [None]:
### Name Entity Recognition (NER)

In [55]:
doc = nlp("New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $ 1,000.")

In [56]:
doc[0].pos_

''

In [57]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

In [58]:
displacy.render(doc, style = 'ent',jupyter=True)

In [None]:
doc2 = nlp("Barr Clashes With House Democrats, Defending Responses to Protests and Russia Inquiry"
"The deployment of federal agents to confront protesters and rioters and attacks on the Russia investigation highlighted a contentious hearing.")

In [None]:
displacy.render(doc2, style = 'ent',jupyter=True)

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Jeff Bezos, founder and CEO of Amazon, an ecommerce company with headquarter in Seattle, became the world richest man on October 2017 with a net worth of 90 billions USD")

displacy.render(doc, style='ent', jupyter=True)

In [None]:
## Esercizio

#Prendere 10 frasi dal new york times ed applicare il NER di Spacy

In [None]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("After meeting for three hours, President Biden and Xi Jinping of China made a cautious pledge to improve ties, while still laying bare a mutual distrust. ")

displacy.render(doc, style='ent', jupyter=True)

## Test Classification

### Bag of Words - The Simplest Word Embedding

This is one of the simplest methods of embedding words into numerical vectors. It is not often used in practice due to its oversimplification of language, but often the first embedding technique to be taught in the classroom setting.

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/fav/nlp/1.PNG" width="600">

### TF-IDF
TF-IDF or ( Term Frequency(TF) — Inverse Dense Frequency(IDF) )is a technique which is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/fav/nlp/2.PNG" width="1000">

## Sentiment Analysis

###Textblob with spacy

In [49]:
!pip install spacytextblob -q
!python -m textblob.download_corpora -q

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


In [50]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')
text = 'I had a really horrible day. It was the worst day ever! But every now and then I have a really good day that makes me happy.'
doc = nlp(text)
doc._.blob.polarity                            # Polarity: -0.125
doc._.blob.subjectivity                        # Subjectivity: 0.9
doc._.blob.sentiment_assessments.assessments   # Assessments: [(['really', 'horrible'], -1.0, 1.0, None), (['worst', '!'], -1.0, 1.0, None), (['really', 'good'], 0.7, 0.6000000000000001, None), (['happy'], 0.8, 1.0, None)]
doc._.blob.ngrams()



[WordList(['I', 'had', 'a']),
 WordList(['had', 'a', 'really']),
 WordList(['a', 'really', 'horrible']),
 WordList(['really', 'horrible', 'day']),
 WordList(['horrible', 'day', 'It']),
 WordList(['day', 'It', 'was']),
 WordList(['It', 'was', 'the']),
 WordList(['was', 'the', 'worst']),
 WordList(['the', 'worst', 'day']),
 WordList(['worst', 'day', 'ever']),
 WordList(['day', 'ever', 'But']),
 WordList(['ever', 'But', 'every']),
 WordList(['But', 'every', 'now']),
 WordList(['every', 'now', 'and']),
 WordList(['now', 'and', 'then']),
 WordList(['and', 'then', 'I']),
 WordList(['then', 'I', 'have']),
 WordList(['I', 'have', 'a']),
 WordList(['have', 'a', 'really']),
 WordList(['a', 'really', 'good']),
 WordList(['really', 'good', 'day']),
 WordList(['good', 'day', 'that']),
 WordList(['day', 'that', 'makes']),
 WordList(['that', 'makes', 'me']),
 WordList(['makes', 'me', 'happy'])]

## Disable some components:

In [51]:
#only "tok2vec"
import spacy
#nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "attribute_ruler", "lemmatizer","ner"]) # solo tok2vec
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "attribute_ruler", "lemmatizer"]) # solo tok2vec +ner

text = 'Jean had one of the most difficult days of her life while her week was bad.But she feels better days are to come.'
doc = nlp(text)
doc.text

'Jean had one of the most difficult days of her life while her week was bad.But she feels better days are to come.'

In [52]:
doc[0]

Jean

In [53]:
doc.pos_

AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'pos_'

In [54]:
from spacy import displacy
displacy.render(doc, style = 'ent',jupyter=True)

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/python/end.gif" width="1000">

## Neural Network NLP

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/neural2.png" width="1000">

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/neural.png" width="1000">