<div style="line-height:1.2;">

<h1 style="color:#900C3F; margin-bottom: 0.2em;">Common practices in Machine Learning 2</h1>

</div>

<div style="line-height:1.2;">

<h4 style="margin-top: 0.2em; margin-bottom: 0.5em;">3 Scikit-learn examples. Focus on Tokenization and Identification of lemmas and stopwords. </h4>

</div>

<div style="margin-top: 5px;">
<span style="display: inline-block;">
    <h3 style="color: lightblue; display: inline; margin-bottom: 0;">Keywords:</h3> SpaCy (for Natural Language Processing)
</span>
</div>

<h3 style="color:#900C3F"> Recap SpaCy: </h3>
<div style="margin-top: -8px;">
- Show the noun phrases and verbs in the text, as well as the named entities identified by Spacy,
such as persons, dates, and a cardinal number. <br>
- Load English tokenizer, tagger, parser and NER (Named Entity Recognition) from "en_core_web_sm" model. <br>
- Contain pre-trained models for tokenization, part-of-speech tagging, parsing, and named entity recognition in English. <br>
</div>

In [1]:
%%script echo Skipping since SpaCy is already installed
!pip install -U spacy
!python -m spacy download en_core_web_sm

Skipping since SpaCy is already installed


In [46]:
import os
# To avoid printing warnings on CUDA initialization in TensorFlow when importing SpaCy:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # 0: default, 1: INFO, 2: WARNING, 3: ERROR
import spacy
from spacy import displacy
from spacy.tokens import Token

In [3]:
# Load a pre-trained English language model from the spaCy library
nlp = spacy.load("en_core_web_sm")

<h2 style="color:#900C3F"> Example 1:</h2>

In [4]:
text = ("""Maya Angelou's writing career spanned over six decades, 
        during which she became known for her powerful poetry, 
        autobiographical works, and activism. Despite experiencing significant personal and professional obstacles, 
        including racism, poverty, and sexual assault, Angelou persevered and 
        created a body of work that has inspired and touched millions of readers around the world. 
        Her most famous work, "I Know Why the Caged Bird Sings," is a memoir that recounts her childhood experiences 
        and has been praised for its honesty, vulnerability, and powerful message. 
        Angelou's writing has earned her numerous awards and accolades, and she remains one of the most influential writers of the 20th century.""")

<font size="5"> Processing text: </font>
<font size="4"> 
- tokenizes the text <br>
- assigns part-of-speech tags to each token <br>
- performs dependency parsing to identify the syntactic relationships between tokens, <br>
- identifies named entities in the text. <br>
</font>

In [5]:
doc = nlp(text)
doc

Maya Angelou's writing career spanned over six decades, 
        during which she became known for her powerful poetry, 
        autobiographical works, and activism. Despite experiencing significant personal and professional obstacles, 
        including racism, poverty, and sexual assault, Angelou persevered and 
        created a body of work that has inspired and touched millions of readers around the world. 
        Her most famous work, "I Know Why the Caged Bird Sings," is a memoir that recounts her childhood experiences 
        and has been praised for its honesty, vulnerability, and powerful message. 
        Angelou's writing has earned her numerous awards and accolades, and she remains one of the most influential writers of the 20th century.

<font size="5"> Analyzing syntax: </font>

In [6]:
# Find contiguous spans of tokens that form noun phrases in the text.
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])

Noun phrases: ["Maya Angelou's writing career", 'over six decades', 'which', 'she', 'her powerful poetry', 'autobiographical works', 'activism', 'significant personal and professional obstacles', 'racism', 'poverty', 'sexual assault', 'Angelou', 'a body', 'work', 'that', 'millions', 'readers', 'the world', 'Her most famous work', 'I', 'Why the Caged Bird Sings', 'a memoir', 'that', 'her childhood experiences', 'its honesty', 'vulnerability', 'powerful message', "Angelou's writing", 'her', 'numerous awards', 'accolades', 'she', 'the most influential writers', 'the 20th century']


In [7]:
# Show the lemmas of all the verbs, filtering for tokens with the part-of-speech tag "VERB" and the "token.lemma_" attribute.
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

Verbs: ['span', 'become', 'know', 'experience', 'include', 'persevere', 'create', 'inspire', 'touch', 'know', 'recount', 'praise', 'earn', 'remain']


In [8]:
""" Find and show named entities, phrases and concepts """
for entity in doc.ents:
    print(entity.text, entity.label_)

Maya Angelou's PERSON
six decades DATE
Angelou PERSON
millions CARDINAL
Angelou PERSON
the 20th century DATE


<h2 style="color:#900C3F"> Example 2:</h2>

In [9]:
text = "Apple is looking at buying U.K. startup for $1 billion"

In [10]:
doc = nlp(text)

In [11]:
""" Print out the entities in the text.
The doc.ents attribute identifies named entities in the text and assigns them entity labels, 
such as "ORG" for organizations and "MONEY" for monetary values.
"""
for ent in doc.ents:
    print(ent.text, ent.label_)
print()

Apple ORG
U.K. GPE
$1 billion MONEY



In [12]:
# Tag and show the part-of-speech tags and dependencies for each token in the text
for token in doc:
    print(token.text, token.pos_, token.dep_)
print()

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj



In [14]:
## Dependency Parsing
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_)

Apple nsubj looking VERB
is aux looking VERB
looking ROOT looking VERB
at prep looking VERB
buying pcomp at ADP
U.K. dobj buying VERB
startup dep looking VERB
for prep startup NOUN
$ quantmod billion NUM
1 compound billion NUM
billion pobj for ADP


In [16]:
##  Named Entity Recognition (NER)
for ent in doc.ents:
    print(ent.text, ent.label_, spacy.explain(ent.label_))

Apple ORG Companies, agencies, institutions, etc.
U.K. GPE Countries, cities, states
$1 billion MONEY Monetary values, including unit


In [15]:
""" Print out the noun chunks in the text. 
N.B.
doc.noun_chunks" finds contiguous spans of tokens that form noun phrases in the text.
The part-of-speech tags and dependency labels provide information about the syntactic structure of the sentence, 
such as identifying the subject and object of a verb.
"""
for chunk in doc.noun_chunks:
    print(chunk.text)
print()

Apple
U.K.



# Example 3 

In [27]:
""" The en_core_web_sm model in SpaCy is specifically trained for English text! It will not work effectively with Italian text. """ 
text3 = (""" L'Asia, con la sua vasta gamma di paesaggi, culture e popolazioni, è un altro continente che offre una ricchezza di studi per i geografi. +
        Dalle steppe della Mongolia alle foreste pluviali del Sud-Est Asiatico, dalla densamente popolata pianura del Gange in India alle remote vette dell'Himalaya, 
        l'Asia è un mosaico di ambienti naturali e di comunità umane.""")

tokens = nlp(text3)

In [29]:
# Sentence Boundary Detection (SBD)
for sent in tokens.sents:
    print(sent.text)

 L'Asia, con la sua vasta gamma di paesaggi, culture e popolazioni, è un altro continente che offre una ricchezza di studi per i geografi.
+
        Dalle steppe della Mongolia alle foreste pluviali del Sud-Est Asiatico, dalla densamente popolata pianura del Gange in India alle remote vette dell'Himalaya, 
        l'Asia è un mosaico di ambienti naturali e di comunità umane.


In [24]:
""" Check similarity """
counts = 0 
for token1 in tokens:
    for token2 in tokens:
        if counts < 100:
            print(token1.text, token2.text, token1.similarity(token2))
            counts +=1

    1.0
  L'Asia 0.031051501631736755
  , -0.03211601451039314
  con 0.1844586431980133
  la 0.08773552626371384
  sua 0.13781750202178955
  vasta 0.04182692989706993
  gamma 0.07451114803552628
  di 0.05086849629878998
  paesaggi 0.006577547639608383
  , -0.03244805708527565
  culture 0.03573552519083023
  e 0.09464190155267715
  popolazioni -0.0028980958741158247
  , -0.00538644241169095
  è 0.08970832079648972
  un 0.08541999012231827
  altro 0.024544285610318184
  continente 0.05755249038338661
  che 0.04743819311261177
  offre 0.026201605796813965
  una -0.034233126789331436
  ricchezza 0.07146204262971878
  di 0.20989203453063965
  studi 0.09703853726387024
  per -0.12811018526554108
  i -0.03695463389158249
  geografi -0.04419403895735741
  . 0.044363319873809814
  + -0.008387534879148006
  
         0.8730809688568115
  Dalle -0.015174098312854767
  steppe -0.058458808809518814
  della 0.07139787822961807
  Mongolia 0.08303110301494598
  alle 0.1334519237279892
  foreste -0.005

  print(token1.text, token2.text, token1.similarity(token2))


In [37]:
%%script echo Skipping since already downloaded 
# Install a larger model => md (medium) instead of small (sm)
!python -m spacy download it_core_news_md

Skipping since already downloaded


In [38]:
nlp3 = spacy.load("it_core_news_md")
tokens = nlp3(text3)
doc3 = nlp3(text3)

In [41]:
for token1 in doc3:
    for token2 in tokens:
        if token1.has_vector and token2.has_vector:
            print(token1.text, token2.text, token1.similarity(token2))

L' L' 1.0
L' Asia -0.0681772381067276
L' , -0.061594195663928986
L' con -0.03701285272836685
L' la -0.053165894001722336
L' sua 0.18699246644973755
L' vasta 0.07807796448469162
L' gamma -0.02895643748342991
L' di -0.05783780664205551
L' paesaggi -0.11344229429960251
L' , -0.061594195663928986
L' culture -0.07787531614303589
L' e -0.11887722462415695
L' popolazioni -0.12265948951244354
L' , -0.061594195663928986
L' è 0.0446704737842083
L' un 0.16354717314243317
L' altro -0.2577219605445862
L' continente -0.13317744433879852
L' che -0.04715902358293533
L' offre -0.00580450426787138
L' una -0.048209741711616516
L' ricchezza -0.10690784454345703
L' di -0.05783780664205551
L' studi -0.042338840663433075
L' per -0.18151147663593292
L' i -0.23611943423748016
L' geografi 0.017681073397397995
L' . -0.005119629204273224
L' + 0.056686583906412125
L' Dalle 0.025730177760124207
L' steppe -0.11325624585151672
L' della 0.03030782751739025
L' Mongolia -0.05326397716999054
L' alle -0.11888101696968079


In [42]:
""" Check similarity. 
N.B.1 
Better to narrow down this comparison to specific tokens or types of tokens. To aovid performing meaningless comparisons.
N.B.2
Check 'has vector' to avoid the UserWarning: [W008] Evaluating Token.similarity based on empty vectors
"""
counts, max_comparisons = 0, 30

for token1 in doc3:
    for token2 in doc3:
        if token1.has_vector and token2.has_vector and counts < max_comparisons:
            print(token1.text, token2.text, token1.similarity(token2))
            counts += 1

L' L' 1.0
L' Asia -0.0681772381067276
L' , -0.061594195663928986
L' con -0.03701285272836685
L' la -0.053165894001722336
L' sua 0.18699246644973755
L' vasta 0.07807796448469162
L' gamma -0.02895643748342991
L' di -0.05783780664205551
L' paesaggi -0.11344229429960251
L' , -0.061594195663928986
L' culture -0.07787531614303589
L' e -0.11887722462415695
L' popolazioni -0.12265948951244354
L' , -0.061594195663928986
L' è 0.0446704737842083
L' un 0.16354717314243317
L' altro -0.2577219605445862
L' continente -0.13317744433879852
L' che -0.04715902358293533
L' offre -0.00580450426787138
L' una -0.048209741711616516
L' ricchezza -0.10690784454345703
L' di -0.05783780664205551
L' studi -0.042338840663433075
L' per -0.18151147663593292
L' i -0.23611943423748016
L' geografi 0.017681073397397995
L' . -0.005119629204273224
L' + 0.056686583906412125


In [47]:
""" Visualize Dependency Parses
N.B.
To avoid the UserWarning: [W011] It looks like you're calling displacy.serve from within a Jupyter notebook or a similar environment....
Use 'render' method instead of 'serve' when using Jupyter, since there's no need to make 'displaCy' start another local web server.
"""
#displacy.serve(doc3, style="dep")
displacy.render(doc3, style="dep", jupyter=True)

In [44]:
# Word embeddings
tok = doc3[2]
print(tok.text, tok.has_vector, tok.vector_norm, tok.is_oov)

Asia True 39.327988 False


In [49]:
# Stop words 
print(nlp3.Defaults.stop_words)

{'stavate', 'negl', 'milioni', 'riecco', 'avremo', 'sembrato', 'scola', 'quasi', 'avute', 'da', 'sarebbe', 'ognuna', 'sarei', "quest'", 'lungo', 'faranno', 'glielo', 'fosse', 'faremmo', 'sotto', 'staresti', 'farebbero', 'molto', 'dell', 'po', 'peccato', 'una', 'questo', 'altro', 'mediante', 'stessero', 'dietro', 'dello', 'quando', 'dire', 'sue', 'registrazione', 'ai', 'gruppo', 'colui', 'fummo', 'maggior', 'sia', 'starei', "t'", 'co', 'vi', 'subito', 'stetti', 'concernente', 'giorno', 'questa', 'volte', 'nuovo', 'egli', 'avrebbero', 'in', 'alla', 'fece', 'stessimo', 'detto', 'conciliarsi', 'persino', 'relativo', 'sembrare', 'la', 'ad', 'sembri', 'parecchio', 'abbiate', 'piuttosto', 'suo', 'oggi', 'tue', 'tuoi', 'avessimo', 'medesimo', 'avanti', 'nel', 'avuto', 'inc', 'sei', 'gli', 'farebbe', 'preferibilmente', 'codesta', 'sarà', 'furono', 'ecc', 'siamo', 'ne', 'seguente', 'cio', 'dalle', 'avrebbe', 'per', 'lato', 'come', 'tutte', 'città', 'pochissimo', 'starebbe', 'fin', 'quel', 'dall'

In [45]:
"""  Set custom attributes to store custom data about the tokens. """
Token.set_extension("is_color", default=False, force=True)
doc3[3]._.is_color = True