<a href="https://colab.research.google.com/github/guilhermelaviola/NaturalLanguageProcessing/blob/main/Class09.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Natural Language with SpaCy**

In [6]:
# Importing all the necessary libraries:
! pip3 install wikipedia
! pip3 install spacy
! spacy download en_core_web_sm

import wikipedia
import spacy

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [7]:
# Downloading text from Wikipedia:
wikipedia.set_lang('en')

text = wikipedia.page('Frank Zappa').content
text

'Frank Vincent Zappa (December 21, 1940 – December 4, 1993) was an American guitarist, composer, and bandleader. In a career spanning more than 30 years, Zappa composed rock, pop, jazz, jazz fusion, orchestral  and musique concrète works; he also produced almost all of the 60-plus albums that he released with his band the Mothers of Invention and as a solo artist. His work is characterized by nonconformity, improvisation sound experimentation, musical virtuosity and satire of American culture. Zappa also directed feature-length films and music videos, and designed album covers. He is considered one of the most innovative and stylistically diverse musicians of his generation.\nAs a mostly self-taught composer and performer, Zappa had diverse musical influences that led him to create music that was sometimes difficult to categorize. While in his teens, he acquired a taste for 20th-century classical modernism, African-American rhythm and blues, and doo-wop music. He began writing classica

In [8]:
# Using SpaCy model:
nlp = spacy.load('en_core_web_sm')

doc = nlp(text)
doc

Frank Vincent Zappa (December 21, 1940 – December 4, 1993) was an American guitarist, composer, and bandleader. In a career spanning more than 30 years, Zappa composed rock, pop, jazz, jazz fusion, orchestral  and musique concrète works; he also produced almost all of the 60-plus albums that he released with his band the Mothers of Invention and as a solo artist. His work is characterized by nonconformity, improvisation sound experimentation, musical virtuosity and satire of American culture. Zappa also directed feature-length films and music videos, and designed album covers. He is considered one of the most innovative and stylistically diverse musicians of his generation.
As a mostly self-taught composer and performer, Zappa had diverse musical influences that led him to create music that was sometimes difficult to categorize. While in his teens, he acquired a taste for 20th-century classical modernism, African-American rhythm and blues, and doo-wop music. He began writing classical 

In [9]:
# Accessing tokens:
doc[0]
doc[1]
doc[:3]

len(doc)

17825

In [10]:
# Accessing sentences:
doc.sents

list(doc.sents)[0]
list(doc.sents)[1]

In a career spanning more than 30 years, Zappa composed rock, pop, jazz, jazz fusion, orchestral  and musique concrète works; he also produced almost all of the 60-plus albums that he released with his band the Mothers of Invention and as a solo artist.

In [11]:
# Accessing entities:
print(doc.ents)

for entity in doc.ents:
  print(entity.text, entity.label_)

doc.ents[0]
doc.ents[0].start, doc.ents[0].end
doc[0:4]
doc.ents[0].start_char, doc.ents[0].end_char

text[0:23]

list(doc.sents)[0]
list(doc.sents)[0].start_char, list(doc.sents)[0].end_char

text[0:215]

(Frank Vincent Zappa, December 21, 1940, December 4, 1993, American, more than 30 years, Zappa, 60-plus, American, Zappa, Zappa, 20th-century, African-American, the Mothers of Invention, 1966, Zappa, Project/Object, Zappa, U.S., Europe, Zappa, 1995, the Rock and Roll Hall of Fame, 1997, Grammy Lifetime Achievement Award, 1940–1965, Childhood, Zappa, December 21, 1940, Baltimore, Maryland, Rose Marie, Colimore, Francis Vincent Zappa, Sicilian, Greek, Arab, French, Frank, four, Italian, Italian, 6  , Florida, the 1940s, Maryland, Zappa, Edgewood Arsenal, the Aberdeen Proving Ground, the U.S. Army, 20–23, Zappa, Zappa, Zappa, 19, Zappa, Zappa, Zappa, 10, Nasal, Cal Schenkel, Zappa, Baltimore, 20–23, 10, 1952, Monterey, California, the Naval Postgraduate School, 22, San Diego, Clairemont, 46  , El Cajon, San Diego, First, Zappa, the age of 12, summer, Monterey, California, Keith McKillop, Frank, 13, Zappa, first, Mission Bay High School, San Diego, 29, 22, The Rough Guide to Rock, 2003, Za

'Frank Vincent Zappa (December 21, 1940 – December 4, 1993) was an American guitarist, composer, and bandleader. In a career spanning more than 30 years, Zappa composed rock, pop, jazz, jazz fusion, orchestral  and m'

In [13]:
# Putting tokens into a table:
import pandas as pd

columns = 'Token Lema Letras Formato POS Função Governante Stopword Entidade IOB'.split(' ')
columns

registries = []
for token in list(doc.sents)[0]:
    registries.append([token.text, token.lemma_, token.is_alpha, token.shape_, token.pos_, token.dep_, token.head, token.is_stop, token.ent_type_, token.ent_iob_])

print(registries)

table = pd.DataFrame(registries, columns=columns)
table

spacy.explain('PROPN')
spacy.explain('nsubj')

[['Frank', 'Frank', True, 'Xxxxx', 'PROPN', 'compound', Zappa, False, 'PERSON', 'B'], ['Vincent', 'Vincent', True, 'Xxxxx', 'PROPN', 'compound', Zappa, False, 'PERSON', 'I'], ['Zappa', 'Zappa', True, 'Xxxxx', 'PROPN', 'nsubj', was, False, 'PERSON', 'I'], ['(', '(', False, '(', 'PUNCT', 'punct', Zappa, False, '', 'O'], ['December', 'December', True, 'Xxxxx', 'PROPN', 'appos', Zappa, False, 'DATE', 'B'], ['21', '21', False, 'dd', 'NUM', 'nummod', December, False, 'DATE', 'I'], [',', ',', False, ',', 'PUNCT', 'punct', December, False, 'DATE', 'I'], ['1940', '1940', False, 'dddd', 'NUM', 'nummod', December, False, 'DATE', 'I'], ['–', '–', False, '–', 'PUNCT', 'punct', Zappa, False, '', 'O'], ['December', 'December', True, 'Xxxxx', 'PROPN', 'appos', Zappa, False, 'DATE', 'B'], ['4', '4', False, 'd', 'NUM', 'nummod', December, False, 'DATE', 'I'], [',', ',', False, ',', 'PUNCT', 'punct', December, False, 'DATE', 'I'], ['1993', '1993', False, 'dddd', 'NUM', 'nummod', December, False, 'DATE', 

'nominal subject'

In [14]:
# Vector similarity:
doc[2]
doc[24]
doc[9]
doc[2].similarity(doc[24])
doc[2].similarity(doc[9])
doc[1:10]
doc[400:410]
doc[1:10].similarity(doc[400:410])

  doc[2].similarity(doc[24])
  doc[2].similarity(doc[9])
  doc[1:10].similarity(doc[400:410])


0.022568481042981148

In [16]:
# DisplaCy for visualization:
! pip3 install displacy

spacy.displacy.render(list(doc.sents)[0], style='ent', jupyter=True)
spacy.displacy.render(list(doc.sents)[1], style='ent', jupyter=True)
spacy.displacy.render(list(doc.sents)[0], style='dep', jupyter=True)

[31mERROR: Could not find a version that satisfies the requirement displacy (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for displacy[0m[31m
[0m