# SPACY BASICS

In this lab we will learn to use the spacy.io API to annotate text. Many of the concepts seen in this lab are explained in detailed in the spacy course:

https://spacy.io/usage/spacy-101 

Here you can configure the kind of spacy setup (language, annotators, etc.) that you may require for installation:

https://spacy.io/usage 


In [1]:
# Install Spacy and learn about Token and Sentence objects
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 5.3 MB/s 
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Collecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.13-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (628 kB)
[K     |████████████████████████████████| 628 kB 41.5 MB/s 
[?25hCollecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 34.5 MB/s 
[?25hCollecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.4 MB/s 
[?25hCollecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting srsly<

# ASSIGNMENT 1

Install the language modules of your choice. 

Read the documentation in https://spacy.io/usage and choose the language modules (according to your interests) that you would like to install.  
  + TODO: Install the language module(s).
  + TODO: Try different language module versions for one language and compare the results obtained.

In [2]:
# TODO install other language modules of your choice following the https://spacy.io/usage
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md
!python -m spacy download en_core_web_lg

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 5.0 MB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[K     |████████████████████████████████| 45.7 MB 58.3 MB/s 
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.2.0
[38;5;2m✔ Download and installation successful

# Loading the language modules

The nlp object is a language model instance. You can assume that, throughout this tutorial, nlp refers to the language model loaded by the language package or packages of your choice. In the following steps we will use spacy to process a string and a text file.

In [3]:
import spacy
#TODO load the installed language module
nlp_en_sm = spacy.load('en_core_web_sm')
nlp_en_md = spacy.load('en_core_web_md')
nlp_en_lg = spacy.load('en_core_web_lg')

In [33]:
doc_en_sm = nlp_en_sm("Washington University, which is located in Missouri, is named after George Washington.")
print(doc_en_sm)

Washington University, which is located in Missouri, is named after George Washington.


In [5]:
doc_en_md = nlp_en_md("Washington University, which is located in Missouri, is named after George Washington.")
print(doc_en_md)

Washington University, which is located in Missouri, is named after George Washington.


In [6]:
doc_en_lg = nlp_en_lg("Washington University, which is located in Missouri, is named after George Washington.")
print(doc_en_lg)

Washington University, which is located in Missouri, is named after George Washington.


# ASSIGNMENT 2

When you call nlp on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you’ll learn more about the Doc, as well as its views Token and Span.

+ TODO: print the tokens in the Doc object. You should get something like the output below.
+ TODO: print the description of each tag (see morphology example, below)
+ TODO: print the entities recognized by iterating over the Doc object (scrowl down after the morphology print to see an example outputs).

In [7]:
# TODO add your code here to print the tokens in the Doc object
tokens = [token.text for token in doc_en_sm]
print(tokens)

['Washington', 'University', ',', 'which', 'is', 'located', 'in', 'Missouri', ',', 'is', 'named', 'after', 'George', 'Washington', '.']


+ TODO: print the two entities containing "Washington"


In [8]:
# A slice of the Doc for "Washington University"
print(doc_en_sm[:2])

# A slice of the Doc for "George Washington" (without the ".")
print(doc_en_sm[-3:-1])

Washington University
George Washington


In [9]:
# TODO obtain number of sentences
print(len(list(doc_en_sm.sents)))
print(list(doc_en_sm.sents)[0])

1
Washington University, which is located in Missouri, is named after George Washington.


In [10]:
# morphology and syntax
for token in doc_en_sm:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_tag = token.tag_
    token_lemma = token.lemma_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_tag:<10}{token_lemma:<20}{token_dep:<20}")

Washington  PROPN     NNP       Washington          compound            
University  PROPN     NNP       University          nsubjpass           
,           PUNCT     ,         ,                   punct               
which       PRON      WDT       which               nsubjpass           
is          AUX       VBZ       be                  auxpass             
located     VERB      VBN       locate              relcl               
in          ADP       IN        in                  prep                
Missouri    PROPN     NNP       Missouri            pobj                
,           PUNCT     ,         ,                   punct               
is          AUX       VBZ       be                  auxpass             
named       VERB      VBN       name                ROOT                
after       ADP       IN        after               prep                
George      PROPN     NNP       George              compound            
Washington  PROPN     NNP       Washington         

In [11]:
# morphology and syntax
for token in doc_en_sm:
    token_text = token.text
    token_pos = token.pos_
    token_tag = token.tag_
    token_lemma = token.lemma_
    token_dep = token.dep_
    # This is for formatting only
    # TODO modify the code above to print the description of each tag, like so:
    token_desc = spacy.explain(token.tag_)
    print(f"{token_text:<12}{token_pos:<10}{token_tag:<10}{token_desc:<50}{token_lemma:<20}{token_dep:<20}")

Washington  PROPN     NNP       noun, proper singular                             Washington          compound            
University  PROPN     NNP       noun, proper singular                             University          nsubjpass           
,           PUNCT     ,         punctuation mark, comma                           ,                   punct               
which       PRON      WDT       wh-determiner                                     which               nsubjpass           
is          AUX       VBZ       verb, 3rd person singular present                 be                  auxpass             
located     VERB      VBN       verb, past participle                             locate              relcl               
in          ADP       IN        conjunction, subordinating or preposition         in                  prep                
Missouri    PROPN     NNP       noun, proper singular                             Missouri            pobj                
,           PUNC

In [12]:
# TODO Iterate over the predicted entities
entities = [(ent.text, ent.label_) for ent in doc_en_sm.ents]
print(entities)

[('Washington University', 'ORG'), ('Missouri', 'GPE'), ('George Washington', 'PERSON')]


In [13]:
# TODO modify the code above to iterate over the predicted entities at token level, like so:
# iob2 entities
import re

for token in doc_en_sm:
    token_text = token.text
    token_pos = token.pos_
    token_tag = token.tag_
    token_lemma = token.lemma_
    token_dep = token.dep_
    token_ent = token.ent_iob_
    if token.ent_type_:
        token_ent += "-" + token.ent_type_
    token_desc = spacy.explain(token.tag_)
    print(f"{token_text:<12}{token_pos:<10}{token_tag:<10}{token_desc:<50}{token_lemma:<20}{token_dep:<20}{token_ent:<20}")

Washington  PROPN     NNP       noun, proper singular                             Washington          compound            B-ORG               
University  PROPN     NNP       noun, proper singular                             University          nsubjpass           I-ORG               
,           PUNCT     ,         punctuation mark, comma                           ,                   punct               O                   
which       PRON      WDT       wh-determiner                                     which               nsubjpass           O                   
is          AUX       VBZ       verb, 3rd person singular present                 be                  auxpass             O                   
located     VERB      VBN       verb, past participle                             locate              relcl               O                   
in          ADP       IN        conjunction, subordinating or preposition         in                  prep                O                   

In [14]:
# easy feature extraction
for token in doc_en_sm:
  print (token, token.idx, token.text_with_ws, 
         token.is_alpha, token.is_punct, token.is_space,
         token.shape_, token.is_stop)

Washington 0 Washington  True False False Xxxxx False
University 11 University True False False Xxxxx False
, 21 ,  False True False , False
which 23 which  True False False xxxx True
is 29 is  True False False xx True
located 32 located  True False False xxxx False
in 40 in  True False False xx True
Missouri 43 Missouri True False False Xxxxx False
, 51 ,  False True False , False
is 53 is  True False False xx True
named 56 named  True False False xxxx False
after 62 after  True False False xxxx True
George 68 George  True False False Xxxxx False
Washington 75 Washington True False False Xxxxx False
. 85 . False True False . False


In [15]:
# stopwords available for English
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)
for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

thereupon
several
anyhow
over
seem
of
no
former
latterly
other


# ASSIGNMENT 3

+ TODO: Remove stopwords from doc
+ TODO: print only the verbs, 3rd person singular present and the proper singular nouns

In [16]:
# TODO remove stopwords
for token in doc_en_sm:
    if token.text not in spacy_stopwords:
        print(token)

Washington
University
,
located
Missouri
,
named
George
Washington
.


In [17]:
# TODO print only verbs, 3rd person singular present and proper singular nouns
nouns = []
verbs = []

# TODO add your code here
for token in doc_en_sm:
    if token.tag_ == "NNP":
        nouns.append(token.text)
    if token.tag_ == "VBZ":
        verbs.append(token.text)

print(nouns)
print(verbs)

['Washington', 'University', 'Missouri', 'George', 'Washington']
['is', 'is']


# ASSIGNMENT 4 (BONUS 1)

Visualizations with spacy. Check the documentation in  https://spacy.io/usage/visualizers and render the dependencies and NER annotations, like so:



In [36]:
from spacy import displacy
from IPython.core.display import display, HTML
displacy.render(doc_en_sm, style="ent", jupyter=True)
displacy.render(doc_en_sm, style="dep", jupyter=True)

# ASSIGNMENT 5 (BONUS 2)

In this task you will be annotating a movie review at document and sentence level.

1. Open the file '/content/drive/My Drive/Colab Notebooks/2022-ILTAPP/resources/movie-review.txt'
2. Predict and print the various annotations seen previously (POS, NER, lemmas, etc.) for each of the sentences in the document using at least two language modules for one language of your interest (most basic and most advanced).
3. Visualize the results.



In [19]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [20]:
%cd /content/drive/MyDrive/LAP/Subjects/AP1/labs

/content/drive/MyDrive/LAP/Subjects/AP1/labs


## Guardian

In [42]:
file_name = '../resources/guardian.txt'
with open(file_name) as f:
    lines = f.readlines()

lines = [line[:-1] for line in lines if line != "\n"]
for line in lines:
    print(line)
    print()

Twelve years after the fall of the Taliban, Afghanistan is heading for a near-record opium crop as instability pushes up the amount of land planted with illegal but lucrative poppies, according to a bleak UN report.

The rapid growth of poppy farming as western troops head home reflects particularly badly on Britain, which was designated "lead nation" for counter-narcotics work over a decade ago.

"Poppy cultivation is not only expected to expand in areas where it already existed in 2012 … but also in new areas or areas where poppy cultivation was stopped," the Afghanistan Opium Winter Risk Assessment found.

The growth in opium cultivation reflects both spreading instability and concerns about the future. Farmers are more likely to plant the deadly crop in areas of high violence or where they have not received any agricultural aid, the report said.

Opium traders are often happy to provide seeds, fertilisers and even advance payments to encourage crops, leaving farmers who do not have

### Entities

In [25]:
for line in lines:
    doc_en_sm = nlp_en_sm(line)
    displacy.render(doc_en_sm, style="ent", jupyter=True)
    doc_en_md = nlp_en_md(line)
    displacy.render(doc_en_md, style="ent", jupyter=True)
    doc_en_lg = nlp_en_lg(line)
    displacy.render(doc_en_lg, style="ent", jupyter=True)



### Dependencies and POS

In [44]:
for line in lines:
    doc_en_sm = nlp_en_sm(line)
    displacy.render(doc_en_sm, style="dep", jupyter=True, options={'compact':True})
    doc_en_md = nlp_en_md(line)
    displacy.render(doc_en_md, style="dep", jupyter=True, options={'compact':True})
    doc_en_lg = nlp_en_lg(line)
    displacy.render(doc_en_lg, style="dep", jupyter=True, options={'compact':True})

## Movie review

In [51]:
file_name = '../resources/movie-review.txt'
with open(file_name) as f:
    lines = f.readlines()

lines = [line[:-1] for line in lines if line != "\n"]
print(lines[0])

Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.


### Entities

In [47]:
for line in lines:
    doc_en_sm = nlp_en_sm(line)
    displacy.render(doc_en_sm, style="ent", jupyter=True)
    doc_en_md = nlp_en_md(line)
    displacy.render(doc_en_md, style="ent", jupyter=True)
    doc_en_lg = nlp_en_lg(line)
    displacy.render(doc_en_lg, style="ent", jupyter=True)

In [49]:
for line in lines:
    doc_en_sm = nlp_en_sm(line)
    sents_en_sm = list(doc_en_sm.sents)
    displacy.render(sents_en_sm, style="ent", jupyter=True)
    doc_en_md = nlp_en_md(line)
    sents_en_md = list(doc_en_md.sents)
    displacy.render(sents_en_md, style="ent", jupyter=True)
    doc_en_lg = nlp_en_lg(line)
    sents_en_lg = list(doc_en_lg.sents)
    displacy.render(sents_en_lg, style="ent", jupyter=True)



### Dependencies and POS

In [48]:
for line in lines:
    doc_en_sm = nlp_en_sm(line)
    displacy.render(doc_en_sm, style="dep", jupyter=True, options={'compact':True})
    doc_en_md = nlp_en_md(line)
    displacy.render(doc_en_md, style="dep", jupyter=True, options={'compact':True})
    doc_en_lg = nlp_en_lg(line)
    displacy.render(doc_en_lg, style="dep", jupyter=True, options={'compact':True})

In [50]:
for line in lines:
    doc_en_sm = nlp_en_sm(line)
    sents_en_sm = list(doc_en_sm.sents)
    displacy.render(sents_en_sm, style="dep", jupyter=True)
    doc_en_md = nlp_en_md(line)
    sents_en_md = list(doc_en_md.sents)
    displacy.render(sents_en_md, style="dep", jupyter=True)
    doc_en_lg = nlp_en_lg(line)
    sents_en_lg = list(doc_en_lg.sents)
    displacy.render(sents_en_lg, style="dep", jupyter=True)