<div style="line-height:1.2;">

<h1 style="color:#900C3F; margin-bottom: 0.2em;">Common practices in Machine Learning 2</h1>

</div>

<div style="line-height:1.2;">

<h4 style="margin-top: 0.2em; margin-bottom: 0.5em;">2 examples based on Scikit-learn. Focus on Tokenization and Identification. </h4>

</div>

<div style="margin-top: 5px;">
<span style="display: inline-block;">
    <h3 style="color: lightblue; display: inline; margin-bottom: 0;">Keywords:</h3> SpaCy (for Natural Language Processing)
</span>
</div>

<h3 style="color:#900C3F"> Recap SpaCy: </h3>
<div style="margin-top: -8px;">
Show the noun phrases and verbs in the text, as well as the named entities identified by Spacy, <br>
such as persons, dates, and a cardinal number. <br>
Load English tokenizer, tagger, parser and NER from "en_core_web_sm" model. <br>
It contains pre-trained models for tokenization, part-of-speech tagging, parsing, and named entity recognition in English. <br>
</div>

In [1]:
%%script echo Skipping since SpaCy is already installed
!pip install -U spacy
!python -m spacy download en_core_web_sm

Skipping since SpaCy is already installed


In [3]:
import os
# To avoid printing warnings on CUDA initialization in TensorFlow when importing SpaCy:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # 0: default, 1: INFO, 2: WARNING, 3: ERROR
import spacy

In [4]:
# Load a pre-trained English language model from the spaCy library
nlp = spacy.load("en_core_web_sm")

<h2 style="color:#900C3F"> Example 1:</h2>

In [4]:
text = ("""Maya Angelou's writing career spanned over six decades, 
        during which she became known for her powerful poetry, 
        autobiographical works, and activism. Despite experiencing significant personal and professional obstacles, 
        including racism, poverty, and sexual assault, Angelou persevered and 
        created a body of work that has inspired and touched millions of readers around the world. 
        Her most famous work, "I Know Why the Caged Bird Sings," is a memoir that recounts her childhood experiences 
        and has been praised for its honesty, vulnerability, and powerful message. 
        Angelou's writing has earned her numerous awards and accolades, and she remains one of the most influential writers of the 20th century.""")

<font size="5"> Processing text: </font>
<font size="4"> 
- tokenizes the text <br>
- assigns part-of-speech tags to each token <br>
- performs dependency parsing to identify the syntactic relationships between tokens, <br>
- identifies named entities in the text. <br>
</font>

In [5]:
doc = nlp(text)
doc

Maya Angelou's writing career spanned over six decades, 
        during which she became known for her powerful poetry, 
        autobiographical works, and activism. Despite experiencing significant personal and professional obstacles, 
        including racism, poverty, and sexual assault, Angelou persevered and 
        created a body of work that has inspired and touched millions of readers around the world. 
        Her most famous work, "I Know Why the Caged Bird Sings," is a memoir that recounts her childhood experiences 
        and has been praised for its honesty, vulnerability, and powerful message. 
        Angelou's writing has earned her numerous awards and accolades, and she remains one of the most influential writers of the 20th century.

<font size="5"> Analyzing syntax: </font>

In [6]:
# Find contiguous spans of tokens that form noun phrases in the text.
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])

Noun phrases: ["Maya Angelou's writing career", 'over six decades', 'which', 'she', 'her powerful poetry', 'autobiographical works', 'activism', 'significant personal and professional obstacles', 'racism', 'poverty', 'sexual assault', 'Angelou', 'a body', 'work', 'that', 'millions', 'readers', 'the world', 'Her most famous work', 'I', 'Why the Caged Bird Sings', 'a memoir', 'that', 'her childhood experiences', 'its honesty', 'vulnerability', 'powerful message', "Angelou's writing", 'her', 'numerous awards', 'accolades', 'she', 'the most influential writers', 'the 20th century']


In [7]:
# Show the lemmas of all the verbs, filtering for tokens with the part-of-speech tag "VERB" and the token.lemma_ attribute.
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

Verbs: ['span', 'become', 'know', 'experience', 'include', 'persevere', 'create', 'inspire', 'touch', 'know', 'recount', 'praise', 'earn', 'remain']


In [8]:
""" Find and show named entities, phrases and concepts """
for entity in doc.ents:
    print(entity.text, entity.label_)

Maya Angelou's PERSON
six decades DATE
Angelou PERSON
millions CARDINAL
Angelou PERSON
the 20th century DATE


<h2 style="color:#900C3F"> Example 2:</h2>

In [10]:
text = "Apple is looking at buying U.K. startup for $1 billion"

In [11]:
doc = nlp(text)

In [12]:
""" Print out the entities in the text.
The doc.ents attribute identifies named entities in the text and assigns them entity labels, 
such as "ORG" for organizations and "MONEY" for monetary values.
"""
for ent in doc.ents:
    print(ent.text, ent.label_)
print()

Apple ORG
U.K. GPE
$1 billion MONEY



In [13]:
# Tag and show the part-of-speech tags and dependencies for each token in the text
for token in doc:
    print(token.text, token.pos_, token.dep_)
print()

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj



In [14]:
""" Print out the noun chunks in the text. 
N.B.
doc.noun_chunks" finds contiguous spans of tokens that form noun phrases in the text.
The part-of-speech tags and dependency labels provide information about the syntactic structure of the sentence, 
such as identifying the subject and object of a verb.
"""
for chunk in doc.noun_chunks:
    print(chunk.text)
print()

Apple
U.K.

