# Parsing & POS application 

```
Plan : 
  1. Parts of Speech (POS) Tagging
  2. Shallow Parsing or Chunking
  3. Constituency Parsing
  4. Dependency Parsing
  5. Application of POS (Named Entity Recognition)
```

## 1. Parts of Speech (POS) Tagging


1. Use Spacy and NLTK and compare the results 


**More details in next lab**

In [None]:
import nltk
from nltk import word_tokenize

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [None]:
sentence = word_tokenize("allow us to add lines in list of allow actions")
print(sentence)
nltk.pos_tag(sentence)

['allow', 'us', 'to', 'add', 'lines', 'in', 'list', 'of', 'allow', 'actions']


[('allow', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('add', 'VB'),
 ('lines', 'NNS'),
 ('in', 'IN'),
 ('list', 'NN'),
 ('of', 'IN'),
 ('allow', 'JJ'),
 ('actions', 'NNS')]

### 1.1 Try it yourself

Using Python libraries, download Wikipedia's page on topic of your choice and tag the text parts of speech.

You can use wikipedia api to retrieve the wikipedia page : [`pip install Wikipedia-API`](https://pypi.org/project/Wikipedia-API/)

In [None]:
!pip install wikipedia-API

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia-API
  Downloading Wikipedia_API-0.5.8-py3-none-any.whl (13 kB)
Installing collected packages: wikipedia-API
Successfully installed wikipedia-API-0.5.8


In [None]:
# based on https://wikipedia-api.readthedocs.io/en/latest/README.html?badge=latest
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')
page_py = wiki_wiki.page('Python_(programming_language)')

text = page_py.text
tokens= word_tokenize(text)
nltk.pos_tag(tokens)

ModuleNotFoundError: ignored

## 2. Shallow Parsing or Chunking

A process of extracting phrases from unstructured text. Chunking groups adjacent tokens into phrases on the basis of their POS tags. There are some standard well-known chunks such as noun phrases, verb phrases, and prepositional phrases.

There are five major categories of phrases : **Noun phrase (NP), Adjective phrase (ADJP), Verb phrase (VP), Prepositional phrase (PP), Adverb phrase (ADVP)**

In [None]:
from nltk.corpus import conll2000
nltk.download('conll2000')

data = conll2000.chunked_sents()
train_data = data[:10900]
test_data = data[10900:] 

print(len(train_data), len(test_data))
print(train_data[1])

[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\conll2000.zip.


10900 48
(S
  Chancellor/NNP
  (PP of/IN)
  (NP the/DT Exchequer/NNP)
  (NP Nigel/NNP Lawson/NNP)
  (NP 's/POS restated/VBN commitment/NN)
  (PP to/TO)
  (NP a/DT firm/NN monetary/JJ policy/NN)
  (VP has/VBZ helped/VBN to/TO prevent/VB)
  (NP a/DT freefall/NN)
  (PP in/IN)
  (NP sterling/NN)
  (PP over/IN)
  (NP the/DT past/JJ week/NN)
  ./.)


## 3. Constituency Parsing

Constituent-based grammars are used to analyze and determine the constituents of a sentence. These grammars can be used to model or represent the internal structure of sentences in terms of a hierarchically ordered structure of their constituents. A constituency parser can be built based on such grammars/rules. The grammer has to be defined. 

One of the popular Constituency Parsing implementation is from stanford. A **probabilistic context-free grammar parser**

**TODO: Implement an example** <br>
Parser can be downloaded here : `https://nlp.stanford.edu/software/stanford-parser-4.2.0.zip`

Online tutorial : 

In [None]:
# Download and unzip the parser
# !pip install unzip
# !pip install wget
#!wget https://nlp.stanford.edu/software/stanford-parser-4.2.0.zip
!unzip stanford-parser-4.2.0.zip

Archive:  stanford-parser-4.2.0.zip
   creating: stanford-parser-full-2020-11-17/
  inflating: stanford-parser-full-2020-11-17/ejml-simple-0.38.jar  
  inflating: stanford-parser-full-2020-11-17/StanfordDependenciesManual.pdf  
  inflating: stanford-parser-full-2020-11-17/Makefile  
  inflating: stanford-parser-full-2020-11-17/ShiftReduceDemo.java  
  inflating: stanford-parser-full-2020-11-17/slf4j-api-1.7.12-sources.jar  
  inflating: stanford-parser-full-2020-11-17/ejml-core-0.38.jar  
  inflating: stanford-parser-full-2020-11-17/build.xml  
  inflating: stanford-parser-full-2020-11-17/lexparser-gui.sh  
   creating: stanford-parser-full-2020-11-17/data/
 extracting: stanford-parser-full-2020-11-17/data/chinese-onesent-unseg-gb18030.txt  
 extracting: stanford-parser-full-2020-11-17/data/arabic-onesent-utf8.txt  
  inflating: stanford-parser-full-2020-11-17/data/chinese-onesent-unseg-utf8.txt  
  inflating: stanford-parser-full-2020-11-17/data/testsent.txt  
  inflating: stanford-pa

In [None]:
import nltk, os
from nltk.parse.stanford import StanfordParser

os.environ['CLASSPATH'] = '/content/stanford-parser-full-2020-11-17/*'

scp = StanfordParser('/content/stanford-parser-full-2020-11-17/stanford-parser.jar','/content/stanford-parser-full-2020-11-17/stanford-parser-4.2.0-models.jar')

sentence = "Innopolis University is a university located in the city of Innopolis."

result = list(scp.raw_parse(sentence))
print(result[0])

Please use [91mnltk.parse.corenlp.CoreNLPParser[0m instead.
  scp = StanfordParser('/content/stanford-parser-full-2020-11-17/stanford-parser.jar','/content/stanford-parser-full-2020-11-17/stanford-parser-4.2.0-models.jar')


(ROOT
  (S
    (NP (NNP Innopolis) (NNP University))
    (VP
      (VBZ is)
      (NP
        (NP (DT a) (NN university))
        (VP
          (VBN located)
          (PP
            (IN in)
            (NP
              (NP (DT the) (NN city))
              (PP (IN of) (NP (NNP Innopolis))))))))
    (. .)))


In [None]:
## To display in colab
!apt-get install -y xvfb # Install X Virtual Frame Buffer

os.system('Xvfb :1 -screen 0 1600x1200x16  &')    # create virtual display with size 1600x1200 and 16 bit color. Color can be changed to 24 or 8
os.environ['DISPLAY']=':1.0'  

!apt install ghostscript python3-tk

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-510
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  xvfb
0 upgraded, 1 newly installed, 0 to remove and 28 not upgraded.
Need to get 780 kB of archives.
After this operation, 2,271 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 xvfb amd64 2:1.20.13-1ubuntu1~20.04.5 [780 kB]
Fetched 780 kB in 1s (689 kB/s)
Selecting previously unselected package xvfb.
(Reading database ... 129501 files and directories currently installed.)
Preparing to unpack .../xvfb_2%3a1.20.13-1ubuntu1~20.04.5_amd64.deb ...
Unpacking xvfb (2:1.20.13-1ubuntu1~20.04.5) ...
Setting up xvfb (2:1.20.13-1ubuntu1~20.04.5) ...
Processing triggers for man-db (2.9.1-1) ...
Reading package lists... Done
Building dependency tree       
Reading 

For the tags meanings [see](https://web.archive.org/web/20130517134339/http://bulba.sdsu.edu/jeanette/thesis/PennTags.html)

In [None]:
from IPython.display import display
display(result[0])

ModuleNotFoundError: ignored

Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NNP', ['Innopolis']), Tree('NNP', ['University'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['a']), Tree('NN', ['university'])]), Tree('VP', [Tree('VBN', ['located']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['city'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NNP', ['Innopolis'])])])])])])])]), Tree('.', ['.'])])])

### 3.1 Try it on your sentence

## 4. Dependency Parsing

In dependency parsing, we try to use dependency-based grammars to analyze and infer both structure and semantic dependencies and relationships between tokens in a sentence. 

![](https://files.realpython.com/media/displacy_dependency_parse.de72f9b1d115.png)

Dependency Parsing used in shallow parsing and named entity recognition



In [9]:
import spacy
import nltk

nlp = spacy.load("en_core_web_sm")

sentence_nlp = nlp("US unvails world's most powerful supercomputer, beats China")
sentence_nlp = nlp("Innopolis University is a university located in the city of Innopolis.")

In [None]:
from spacy import displacy

displacy.render(sentence_nlp, jupyter=True, 
                options={'distance': 110,
                         'arrow_stroke': 2,
                         'arrow_width': 8})

### 4.1 Try it on your sentence

In [None]:
sentence_nlp = nlp("This is a totally random sentence and most probably has a wrong grammer. I am way too hungry to exist")

displacy.render(sentence_nlp, jupyter=True, 
                options={'distance': 110,
                         'arrow_stroke': 2,
                         'arrow_width': 8})

## 5. Named Entity Recognition

Named Entity Recognition (NER) is the process of locating named entities in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.

In [13]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

## 5.1 Get data

In [None]:
sentence = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'


## Preprocess the data 

In [None]:
def preprocess(sent):
  sent = nltk.word_tokenize(sent)
  sent = nltk.pos_tag(sent)
  return sent

sent = preprocess(sentence)
sent

[('European', 'JJ'),
 ('authorities', 'NNS'),
 ('fined', 'VBD'),
 ('Google', 'NNP'),
 ('a', 'DT'),
 ('record', 'NN'),
 ('$', '$'),
 ('5.1', 'CD'),
 ('billion', 'CD'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('for', 'IN'),
 ('abusing', 'VBG'),
 ('its', 'PRP$'),
 ('power', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mobile', 'JJ'),
 ('phone', 'NN'),
 ('market', 'NN'),
 ('and', 'CC'),
 ('ordered', 'VBD'),
 ('the', 'DT'),
 ('company', 'NN'),
 ('to', 'TO'),
 ('alter', 'VB'),
 ('its', 'PRP$'),
 ('practices', 'NNS')]

## Define pattern to parse the data

In [None]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs)

(S
  European/JJ
  authorities/NNS
  fined/VBD
  Google/NNP
  (NP a/DT record/NN)
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  (NP power/NN)
  in/IN
  (NP the/DT mobile/JJ phone/NN)
  (NP market/NN)
  and/CC
  ordered/VBD
  (NP the/DT company/NN)
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


## Parse data and visualize

In [None]:
NPChunker = nltk.RegexpParser(pattern) 
result = NPChunker.parse(sent)

In [None]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('European', 'JJ', 'O'),
 ('authorities', 'NNS', 'O'),
 ('fined', 'VBD', 'O'),
 ('Google', 'NNP', 'O'),
 ('a', 'DT', 'B-NP'),
 ('record', 'NN', 'I-NP'),
 ('$', '$', 'O'),
 ('5.1', 'CD', 'O'),
 ('billion', 'CD', 'O'),
 ('on', 'IN', 'O'),
 ('Wednesday', 'NNP', 'O'),
 ('for', 'IN', 'O'),
 ('abusing', 'VBG', 'O'),
 ('its', 'PRP$', 'O'),
 ('power', 'NN', 'B-NP'),
 ('in', 'IN', 'O'),
 ('the', 'DT', 'B-NP'),
 ('mobile', 'JJ', 'I-NP'),
 ('phone', 'NN', 'I-NP'),
 ('market', 'NN', 'B-NP'),
 ('and', 'CC', 'O'),
 ('ordered', 'VBD', 'O'),
 ('the', 'DT', 'B-NP'),
 ('company', 'NN', 'I-NP'),
 ('to', 'TO', 'O'),
 ('alter', 'VB', 'O'),
 ('its', 'PRP$', 'O'),
 ('practices', 'NNS', 'O')]


## NER with spacy

SpaCy’s named entity recognition has been trained on the OntoNotes 5 corpus 

In [10]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm

## Get data

**en_core_web_sm** : English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer

In [None]:
nlp = en_core_web_sm.load()
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
pprint([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'),
 ('Google', 'ORG'),
 ('$5.1 billion', 'MONEY'),
 ('Wednesday', 'DATE')]


In [None]:
doc.text

'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

## TASK

1. Using Python libraries, download Wikipedia's page on topic of your choice and apply NER (using NLTK and spacy). 
1. Compare the results and visualize one paragraph with with entities assigned to words (you can try spacy visualization tool [`displacy`](https://spacy.io/usage/visualizers)) 

In [11]:
# downloading a random wiki page
# based on https://wikipedia-api.readthedocs.io/en/latest/README.html?badge=latest
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')
page_py = wiki_wiki.page('Barack_Obama')

text = page_py.text

#loading model from spacy
nlp = en_core_web_sm.load()

#using model on the text gotten from wikipidia 
doc = nlp(text)

# visuilizing with displacy 
displacy.render(doc, jupyter=True,style="ent", 
                options={'distance': 110,
                         'arrow_stroke': 2,
                         'arrow_width': 8})

In [14]:
# NER but using NLTK
# based on https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

#tokenizing and pos
# def preprocess(sent):
#     sent = nltk.word_tokenize(sent)
#     sent = nltk.pos_tag(sent)
#     return sent

# sent = preprocess(text)

# #parsing
# pattern = 'NP: {<DT>?<JJ>*<NN>}'
# cp = nltk.RegexpParser(pattern)
# cs = cp.parse(sent)

ne_tree = ne_chunk(pos_tag(word_tokenize(text)))
print(ne_tree)

(S
  (PERSON Barack/NNP)
  (PERSON Hussein/NNP Obama/NNP II/NNP)
  (/(
  (/(
  listen/VBN
  )/)
  bə-RAHK/JJ
  hoo-SAYN/JJ
  oh-BAH-mə/NN
  ;/:
  born/VBN
  August/NNP
  4/CD
  ,/,
  1961/CD
  )/)
  is/VBZ
  an/DT
  (GPE American/JJ)
  retired/JJ
  politician/NN
  who/WP
  served/VBD
  as/IN
  the/DT
  44th/CD
  president/NN
  of/IN
  the/DT
  (GPE United/NNP States/NNPS)
  from/IN
  2009/CD
  to/TO
  2017/CD
  ./.
  (PERSON Obama/NNP)
  ,/,
  a/DT
  member/NN
  of/IN
  the/DT
  (ORGANIZATION Democratic/NNP Party/NNP)
  ,/,
  was/VBD
  the/DT
  first/JJ
  African-American/JJ
  president/NN
  of/IN
  the/DT
  (GPE United/NNP States/NNPS)
  ./.
  He/PRP
  previously/RB
  served/VBD
  as/IN
  a/DT
  (GPE U.S./NNP)
  senator/NN
  from/IN
  (GPE Illinois/NNP)
  from/IN
  2005/CD
  to/TO
  2008/CD
  and/CC
  as/IN
  an/DT
  Illinois/NNP
  state/NN
  senator/NN
  from/IN
  1997/CD
  to/TO
  2004/CD
  ,/,
  and/CC
  previously/RB
  worked/VBN
  as/IN
  a/DT
  civil/JJ
  rights/NNS
  lawyer/NN


In [4]:
!pip install wikipedia-api

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia-api
  Downloading Wikipedia_API-0.5.8-py3-none-any.whl (13 kB)
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.5.8


In [7]:
!python -m spacy download en_core_web_sm

2023-02-14 17:47:30.965624: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-14 17:47:30.965750: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-14 17:47:32.682701: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download