### An Introduction to Part of Speech Tagging, Treebanks and Universial Dependencies

# Imports

In [None]:
!pip install stanza

import zipfile
from bs4 import BeautifulSoup
import stanza
from stanza.utils.conll import CoNLL
import pandas as pd
import numpy as np

Collecting stanza
  Downloading stanza-1.3.0-py3-none-any.whl (432 kB)
[K     |████████████████████████████████| 432 kB 4.4 MB/s 
Collecting emoji
  Downloading emoji-1.6.1.tar.gz (170 kB)
[K     |████████████████████████████████| 170 kB 7.8 MB/s 
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.6.1-py3-none-any.whl size=169314 sha256=60f3e45786d7be5e178416684623577ac181d3e6df2e6c034467375377ea4d16
  Stored in directory: /root/.cache/pip/wheels/ea/5f/d3/03d313ddb3c2a1a427bb4690f1621eea60fe6f2a30cc95940f
Successfully built emoji
Installing collected packages: emoji, stanza
Successfully installed emoji-1.6.1 stanza-1.3.0


# Text read-in and Stanza

In [None]:
## Get texts from my github
!wget https://raw.githubusercontent.com/pnadelofficial/FallDHCourseMaterials/main/gibbon_by_paragraph.csv ## CSV of the Gibbon text divided into paragraph
!wget https://raw.githubusercontent.com/pnadelofficial/FallDHCourseMaterials/main/gibbonfortm.xml ## Original XML file

--2021-12-22 21:25:38--  https://raw.githubusercontent.com/pnadelofficial/FallDHCourseMaterials/main/gibbon_by_paragraph.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6471147 (6.2M) [text/plain]
Saving to: ‘gibbon_by_paragraph.csv’


2021-12-22 21:25:38 (217 MB/s) - ‘gibbon_by_paragraph.csv’ saved [6471147/6471147]

--2021-12-22 21:25:38--  https://raw.githubusercontent.com/pnadelofficial/FallDHCourseMaterials/main/gibbonfortm.xml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10542769 (10M) [text/plain]
Saving to: ‘gibbonf

In [None]:
gibbon_text = pd.read_csv('/content/gibbon_by_paragraph.csv', index_col=0) ## The Decline and Fall in paragraphs
gibbon_text 

Unnamed: 0,StringText
Paragraph 1,"In the second century of the Christian era, th..."
Paragraph 2,The principal conquests of the Romans were ach...
Paragraph 3,"His generals, in the early part of his reign, ..."
Paragraph 4,"Happily for the repose of mankind, the moderat..."
Paragraph 5,The only accession which the Roman empire rece...
...,...
Paragraph 2172,The abolition at Rome of the ancient games mus...
Paragraph 2173,"This use of the amphitheatre was a rare, perha..."
Paragraph 2174,When Petrarch first gratified his eyes with a ...
Paragraph 2175,But the clouds of Barbarism were gradually dis...


In [None]:
stanza.download('en')
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2021-12-22 21:25:39 INFO: Downloading default packages for language: en (English)...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/default.zip:   0%|          | 0…

2021-12-22 21:26:00 INFO: Finished downloading models and saved to /root/stanza_resources.
2021-12-22 21:26:00 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
| lemma     | combined |
| depparse  | combined |

2021-12-22 21:26:00 INFO: Use device: cpu
2021-12-22 21:26:00 INFO: Loading: tokenize
2021-12-22 21:26:00 INFO: Loading: pos
2021-12-22 21:26:00 INFO: Loading: lemma
2021-12-22 21:26:00 INFO: Loading: depparse
2021-12-22 21:26:01 INFO: Done loading processors!


In [None]:
gibbon_text['StringText'][2]

'His generals, in the early part of his reign, attempted the reduction of Aethiopia and Arabia Felix. They marched near a thousand miles to the south of the tropic; but the heat of the climate soon repelled the invaders and protected the unwarlike natives of those sequestered regions. The northern  countries of Europe scarcely deserved the expense and labour of conquest. The forests and morasses of Germany were filled with a hardy race of barbarians, who despised life when it was separated from freedom; and though, on the first attack, they seemed to yield to the weight of the Roman power, they soon, by a signal act of despair, regained their independence, and reminded Augustus of the vicissitude of fortune. On the death of that emperor his testament was publicly read in the senate. He bequeathed, as a valuable legacy to his successors, the advice of confining the empire within those limits which nature seemed to have placed as its permanent bulwarks and boundaries; on the west the Atl

In [None]:
doc_p2 = nlp(gibbon_text['StringText'][2])

##  This is a 'Document' object. 
 

*   ```
# doc.text
    ```
returns the raw text of the document

*   ```
# doc.sentences
    ```
returns each sentence of the document as Sentence objects

*   ```
# doc.entities
    ```
returns a list of named entities

*   ```
# doc.tokens or doc.words  
    ```
returns the number of tokens or words in the document




In [None]:
doc_p2

## This is a 'Sentence' object.

In [None]:
first_sentence = doc_p2.sentences[0]
first_sentence.text ## Again, returns the text.

'His generals, in the early part of his reign, attempted the reduction of Aethiopia and Arabia Felix.'

In [None]:
first_word = first_sentence.words[0]
first_word

{
  "id": 1,
  "text": "His",
  "lemma": "his",
  "upos": "PRON",
  "xpos": "PRP$",
  "feats": "Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs",
  "head": 2,
  "deprel": "nmod:poss",
  "start_char": 0,
  "end_char": 3
}

In [None]:
first_token = first_sentence.tokens[0]
first_token

[
  {
    "id": 1,
    "text": "His",
    "lemma": "his",
    "upos": "PRON",
    "xpos": "PRP$",
    "feats": "Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs",
    "head": 2,
    "deprel": "nmod:poss",
    "start_char": 0,
    "end_char": 3
  }
]

## This is a 'Word' object
Each value represents a grammatical catagory of language. Always be aware of what language model you are using. 

In [None]:
first_word.id, first_word.text, first_word.lemma ## The word's ID, the tex and the 'lemma' 

(1, 'His', 'his')

In [None]:
## Part of Speech tagging (may see this online as just POS) for the word: UPOS
print(first_word.upos) ## Return a very general part of speech description. Here, this word 'His' is identified as a pronoun.

print('\n')

print(first_sentence.words[1].upos) ## Here the second word 'general' is identified as a noun. 

print('\n')

for word in first_sentence.words: ## In fact we can iterate through the whole list.
  print(word.upos)

PRON


NOUN


PRON
NOUN
PUNCT
ADP
DET
ADJ
NOUN
ADP
PRON
NOUN
PUNCT
VERB
DET
NOUN
ADP
PROPN
CCONJ
PROPN
PROPN
PUNCT


In [None]:
## XPOS
first_word.xpos ## Return a different form of part of speech description. Here, this word 'His' is identified as a 'PRP$'.

print('\n')

## These part of speech tags provide a more detailed look at the grammatical structure. Here is a link from University of Pennsylvania 
## that shows what each abbreviation stands for: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html,
## and, if you are interested, here is the original paper describing the project: https://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports.

for word in first_sentence.words:
  print(word.xpos)



PRP$
NNS
,
IN
DT
JJ
NN
IN
PRP$
NN
,
VBD
DT
NN
IN
NNP
CC
NNP
NNP
.


In [None]:
print(first_sentence.text)
print('\n')

for word in first_sentence.words:
  print(f'{word.text} ({word.lemma}):\t{word.upos}, {word.xpos}')

His generals, in the early part of his reign, attempted the reduction of Aethiopia and Arabia Felix.


His (his):	PRON, PRP$
generals (general):	NOUN, NNS
, (,):	PUNCT, ,
in (in):	ADP, IN
the (the):	DET, DT
early (early):	ADJ, JJ
part (part):	NOUN, NN
of (of):	ADP, IN
his (he):	PRON, PRP$
reign (reign):	NOUN, NN
, (,):	PUNCT, ,
attempted (attempt):	VERB, VBD
the (the):	DET, DT
reduction (reduction):	NOUN, NN
of (of):	ADP, IN
Aethiopia (Aethiopia):	PROPN, NNP
and (and):	CCONJ, CC
Arabia (Arabia):	PROPN, NNP
Felix (Felix):	PROPN, NNP
. (.):	PUNCT, .


In [None]:
fifth_sentence = doc_p2.sentences[4]
print(fifth_sentence.text)
print('\n')

for word in fifth_sentence.words:
  print(f'{word.text} ({word.lemma}):\t{word.upos}, {word.xpos}')

On the death of that emperor his testament was publicly read in the senate.


On (on):	ADP, IN
the (the):	DET, DT
death (death):	NOUN, NN
of (of):	ADP, IN
that (that):	DET, DT
emperor (emperor):	NOUN, NN
his (he):	PRON, PRP$
testament (testament):	NOUN, NN
was (be):	AUX, VBD
publicly (publicly):	ADV, RB
read (read):	VERB, VBN
in (in):	ADP, IN
the (the):	DET, DT
senate (senate):	NOUN, NN
. (.):	PUNCT, .


In [None]:
## searching using a lemma and xpos
for sentence in doc_p2.sentences:
  for word in sentence.words:
    if (word.lemma == 'be') and (word.xpos == 'VB'):
      print((word, sentence.text)) 

In [None]:
## compare to above
for sentence in doc_p2.sentences:
  for word in sentence.words:
    if (word.lemma == 'be') and (word.xpos == 'VBD'):
      print((word, sentence.text)) 

({
  "id": 7,
  "text": "were",
  "lemma": "be",
  "upos": "AUX",
  "xpos": "VBD",
  "feats": "Mood=Ind|Tense=Past|VerbForm=Fin",
  "head": 8,
  "deprel": "aux:pass",
  "start_char": 409,
  "end_char": 413
}, 'The forests and morasses of Germany were filled with a hardy race of barbarians, who despised life when it was separated from freedom; and though, on the first attack, they seemed to yield to the weight of the Roman power, they soon, by a signal act of despair, regained their independence, and reminded Augustus of the vicissitude of fortune.')
({
  "id": 21,
  "text": "was",
  "lemma": "be",
  "upos": "AUX",
  "xpos": "VBD",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
  "head": 22,
  "deprel": "aux:pass",
  "start_char": 480,
  "end_char": 483
}, 'The forests and morasses of Germany were filled with a hardy race of barbarians, who despised life when it was separated from freedom; and though, on the first attack, they seemed to yield to the weight of the Ro

## An introduction to Universal Dependencies

In [None]:
## Dependency Relation or .deprel
print(first_word.deprel) ## Return a Universal Dependency part of speech description. Here, this word 'His' is identified as a 'nmod:poss'.

## This means it is a nominal modifier and possessive. These Universal Dependency tags correspond with agreed upon categories for parsing
## grammatical structure in text. Look into the English Universal Dependency documentation: https://universaldependencies.org/en/dep/index.html.
## The greatest value of Universal Dependency is that they express detail about the word as well as about the word which it modifies, its so called 'head'.
## We can vizualize the sentence as a tree of modifications.

## Stanza's 'depparse' module creates a dependency tree for each sentence in the Document object. This task
## is trained on thousands of handmade and verified treebanks. From this data, this model can generate
## its own treebanks. 

nmod:poss


In [None]:
## .head will give us the index position of the word which our current word directly modifies
print(first_sentence.text)
print('\n')

for word in first_sentence.words:
  print(f'{word.text} ({word.deprel}, {word.id}):\t {first_sentence.words[word.head-1].text} ({word.head})')

His generals, in the early part of his reign, attempted the reduction of Aethiopia and Arabia Felix.


His (nmod:poss, 1):	 generals (2)
generals (nsubj, 2):	 attempted (12)
, (punct, 3):	 generals (2)
in (case, 4):	 part (7)
the (det, 5):	 part (7)
early (amod, 6):	 part (7)
part (obl, 7):	 attempted (12)
of (case, 8):	 reign (10)
his (nmod:poss, 9):	 reign (10)
reign (nmod, 10):	 part (7)
, (punct, 11):	 generals (2)
attempted (root, 12):	 . (0)
the (det, 13):	 reduction (14)
reduction (obj, 14):	 attempted (12)
of (case, 15):	 Aethiopia (16)
Aethiopia (nmod, 16):	 reduction (14)
and (cc, 17):	 Arabia (18)
Arabia (conj, 18):	 Aethiopia (16)
Felix (conj, 19):	 Aethiopia (16)
. (punct, 20):	 attempted (12)


In [None]:
## The main verb of the sentence is general the root, although it can vary language to language. 
## In English, for instance, the root of a sentence containing the verb 'to be' changes depending on usage.       

In [None]:
## With universal dependencies, we can search for words with certain 
## modifiers or that are modifying other words

## Find what word directly modify the verb
for word in first_sentence.words:
  if first_sentence.words[word.head-1].deprel == 'nsubj':
    print(word.text, word.upos)

## Try this out by changing 'root' to another tag.

His PRON
, PUNCT
, PUNCT


In [None]:
def findRootDep(stanza_doc):
  
  for sentence in stanza_doc.sentences:
    for word in sentence.words:
      if (sentence.words[word.head-1].deprel == 'root') and (word.upos != 'PUNCT'):
        print(word.text, word.upos)
    print(sentence.words[word.head-1].text, sentence.words[word.head-1].upos)
    print('\n')

In [None]:
findRootDep(doc_p2)

generals NOUN
part NOUN
reduction NOUN
attempted VERB


They PRON
miles NOUN
south NOUN
repelled VERB
marched VERB


countries NOUN
scarcely ADV
expense NOUN
deserved VERB


forests NOUN
were AUX
race NOUN
seemed VERB
regained VERB
filled VERB


death NOUN
testament NOUN
was AUX
publicly ADV
senate NOUN
read VERB


He PRON
legacy NOUN
bequeathed VERB




In [None]:
## We can also go the other way, that is from modified word to the word that modifies it

for word in first_sentence.words:
  if (first_sentence.words[word.head-1].deprel != 'root') and (word.upos != 'PUNCT'):
    print((word.text, first_sentence.words[word.head-1].text))

('His', 'generals')
('in', 'part')
('the', 'part')
('early', 'part')
('of', 'reign')
('his', 'reign')
('reign', 'part')
('attempted', '.')
('the', 'reduction')
('of', 'Aethiopia')
('Aethiopia', 'reduction')
('and', 'Arabia')
('Arabia', 'Aethiopia')
('Felix', 'Aethiopia')


In [None]:
first_sentence.dependencies ##displays the sentence as a grammatical 'tree'. Let's exlore what that means.

[({
    "id": 2,
    "text": "generals",
    "lemma": "general",
    "upos": "NOUN",
    "xpos": "NNS",
    "feats": "Number=Plur",
    "head": 12,
    "deprel": "nsubj",
    "start_char": 4,
    "end_char": 12
  }, 'nmod:poss', {
    "id": 1,
    "text": "His",
    "lemma": "his",
    "upos": "PRON",
    "xpos": "PRP$",
    "feats": "Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs",
    "head": 2,
    "deprel": "nmod:poss",
    "start_char": 0,
    "end_char": 3
  }), ({
    "id": 12,
    "text": "attempted",
    "lemma": "attempt",
    "upos": "VERB",
    "xpos": "VBD",
    "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
    "head": 0,
    "deprel": "root",
    "start_char": 46,
    "end_char": 55
  }, 'nsubj', {
    "id": 2,
    "text": "generals",
    "lemma": "general",
    "upos": "NOUN",
    "xpos": "NNS",
    "feats": "Number=Plur",
    "head": 12,
    "deprel": "nsubj",
    "start_char": 4,
    "end_char": 12
  }), ({
    "id": 2,
    "text": "generals

In [None]:
for dep in first_sentence.dependencies:
  print(dep)
  print('\n')

({
  "id": 2,
  "text": "generals",
  "lemma": "general",
  "upos": "NOUN",
  "xpos": "NNS",
  "feats": "Number=Plur",
  "head": 12,
  "deprel": "nsubj",
  "start_char": 4,
  "end_char": 12
}, 'nmod:poss', {
  "id": 1,
  "text": "His",
  "lemma": "his",
  "upos": "PRON",
  "xpos": "PRP$",
  "feats": "Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs",
  "head": 2,
  "deprel": "nmod:poss",
  "start_char": 0,
  "end_char": 3
})


({
  "id": 12,
  "text": "attempted",
  "lemma": "attempt",
  "upos": "VERB",
  "xpos": "VBD",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
  "head": 0,
  "deprel": "root",
  "start_char": 46,
  "end_char": 55
}, 'nsubj', {
  "id": 2,
  "text": "generals",
  "lemma": "general",
  "upos": "NOUN",
  "xpos": "NNS",
  "feats": "Number=Plur",
  "head": 12,
  "deprel": "nsubj",
  "start_char": 4,
  "end_char": 12
})


({
  "id": 2,
  "text": "generals",
  "lemma": "general",
  "upos": "NOUN",
  "xpos": "NNS",
  "feats": "Number=Plur",
  "he

In [None]:
## As above, can we find all of the passive verbs? 

for dep in first_sentence.dependencies:
  if (dep[1] == 'nmod:poss'):
    for word in dep:
      if type(word) != str:
        print(word.text)
    print('\n')

generals
His


reign
his




In [None]:
## This analysis is very scalable but always be sure to understand what your Document object is.
## Here, I have moved up a rung and taken a whole chapter as my Document.

gibbon = open('/content/gibbonfortm.xml')
soup_chapter = BeautifulSoup(gibbon, 'lxml')## This is the same code as what I showed you on 10/13.

chapter_dict = {}

for i in range(len(soup_chapter.find_all('div', attrs={"type": "textpart", "subtype": "chaptertext"}))):
  chapter_dict[f"Chapter {i+1}"] = soup_chapter.find_all('div', attrs={"type": "textpart", "subtype": "chaptertext"})[i].get_text()

In [None]:
doc_ch_1 = nlp(chapter_dict['Chapter 1']) ## This line will take the longest, as we must convert the string text to a Stanza doc, not more than a couple minutes though.

In [None]:
## Let's look at all of the proper nouns that are modified by adjectives.
## We will return to this example when we discuss sentiment analysis.

for sentence in doc_ch_1.sentences:
  for word in sentence.words:
    if (sentence.words[word.head-1].xpos == 'NNP') and (word.xpos == 'JJ'): ## NNP are all proper nouns; JJ is all adjectives
        print((word.text, sentence.words[word.head-1].text))

('Roman', 'senate')
('virtuous', 'Agricola')
('Lower', 'Danube')
('modern', 'Europe')
('ancient', 'Baetica')
('Ancient', 'Gaul')
('modern', 'France')
('Celtic', 'Gaul')
('Lower', 'Germany')
('Ottoman', 'Porte')
('ancient', 'Greece')
('Achaean', 'league')
('Roman', 'Asia')
('genuine', 'Mauritania')


In [None]:
def getProperNounChunk(doc):
  bigram_list = []

  for sentence in doc.sentences:
    for word in sentence.words:
      if (sentence.words[word.head-1].xpos == 'NNP') and (word.xpos == 'JJ'):
        bigram_list.append((word.text, sentence.words[word.head-1].text))
  return bigram_list

In [None]:
doc_ch_71 = nlp(chapter_dict['Chapter 71'])
getProperNounChunk(doc_ch_71)

[('ancient', 'Rome'),
 ('noble', 'Annibaldi'),
 ('venerable', 'Bede'),
 ('fair', 'Jacova'),
 ('ancient', 'Rome')]

In [None]:
for bigram in first_sentence.dependencies: ## We can use dependenies to search for Word objects.
  for dep in bigram:
    if (type(dep) != str) and (dep.text == 'principal'):
      print(dep.text,first_sentence.words[dep.head-1].text)    

## Let's Brainstorm some grammatical units that we would want to isolate. 
(some ideas) 

*   Noun chunks
*   Prepositional phrase



# Bigram comparison

In [None]:
def findBigrams(doc, search_word):
  bigram_dict = {}
  bigram_list = []

  for sentence in doc.sentences:  
    for bigram in sentence.dependencies:
      for dep in bigram:
        if (type(dep) != str) and (dep.text == search_word):
          bigram_list.append((bigram[0].text,bigram[2].text))

  for bigram in bigram_list:
    if bigram in bigram_dict:
      bigram_dict[bigram] = bigram_dict[bigram] + 1
    else:
      bigram_dict[bigram] = 1

  return bigram_dict, bigram_list    

In [None]:
!wget https://raw.githubusercontent.com/gregorycrane/DHFall2021/master/texts/hume/hume-all.txt

--2021-12-22 21:27:38--  https://raw.githubusercontent.com/gregorycrane/DHFall2021/master/texts/hume/hume-all.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6894911 (6.6M) [text/plain]
Saving to: ‘hume-all.txt’


2021-12-22 21:27:39 (310 MB/s) - ‘hume-all.txt’ saved [6894911/6894911]



In [None]:
# hume_full = open('/content/hume-all.txt')
# hume = [] 

# for line in hume_full:
#   for word in line.replace('.', ' ').replace(',', ' ').split(' '):
#     if len(hume) <= 1000000:
#       hume.append(word) 

# hume_string = ' '.join(hume)

# hume_doc = nlp(hume_string) ## Will take 28 minutes
# CoNLL.write_doc2conll(hume_doc, "hume_output.conllu") ## exports data so we don't needto run this cell everytime

In [None]:
!git clone https://github.com/pnadelofficial/FallDHCourseMaterials

Cloning into 'FallDHCourseMaterials'...
remote: Enumerating objects: 420, done.[K
remote: Counting objects: 100% (420/420), done.[K
remote: Compressing objects: 100% (417/417), done.[K
remote: Total 420 (delta 161), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (420/420), 73.15 MiB | 5.53 MiB/s, done.
Resolving deltas: 100% (161/161), done.
Checking out files: 100% (158/158), done.


In [None]:
with zipfile.ZipFile('/content/FallDHCourseMaterials/stanza_data/hume.zip', 'r') as zip_ref:
    zip_ref.extractall('/content')

In [None]:
hume_doc = CoNLL.conll2doc('/content/hume/hume_output.conllu')

In [None]:
# gibbon = []

# for chapter in soup_chapter.find_all('div', attrs={"type": "textpart", "subtype": "chaptertext"}):
#   for word in chapter.get_text().split(' '):
#     if len(gibbon) <= 1000000:
#       gibbon.append(word)

# len(gibbon)
# gibbon_string = ' '.join(gibbon) 

# gibbon_doc = nlp(gibbon_string) ## Will take 35 minutes
# CoNLL.write_doc2conll(gibbon_doc, "gibbon_output.conllu") ## exports data so we don't need to run this cell everytime

In [None]:
with zipfile.ZipFile('/content/FallDHCourseMaterials/stanza_data/gibbon.zip', 'r') as zip_ref:
    zip_ref.extractall('/content')

In [None]:
gibbon_doc = CoNLL.conll2doc('/content/gibbon/gibbon_output.conllu')

In [None]:
def compareGibbontoHume(gibbon, hume, search_word):
  gib_dict, gib_list = findBigrams(gibbon, search_word)
  hume_dict, hume_list = findBigrams(hume, search_word)

  print(f'There are {sum(gib_dict.values())} instance(s) in Gibbon and {sum(hume_dict.values())} instance(s) in Hume.') 
  print('\n')

  print('Comparison:')
  print('Gibbon', 'Hume', 'Bigram',sep='\t')
  for tup in sorted(gib_dict,key=gib_dict.get,reverse=True):    
    if tup in hume_list:
      hume_val = hume_dict[tup]
    else:
      hume_val = 0
    print(gib_dict[tup], hume_val, tup,sep='\t')

In [None]:
###########################################
## Search Terms from reading ##
  # searchs = 'balance' #Parker, p. 167
  # searchs = 'temperate' #Parker, p. 167
  # searchs = 'philosophic' #Parker, p. 168
  # searchs = 'impartial'  #Parker, p. 168
## Among others ##
###########################################

## (⊃｡•́‿•̀｡)⊃━☆ﾟ.*･｡ﾟ Interactive function (◕‿◕✿) ##
search_term = input("Enter search term:")
compareGibbontoHume(gibbon_doc, hume_doc, search_term)

Enter search term:e
There are 0 instance(s) in Gibbon and 13278 instance(s) in Hume.


Comparison:
Gibbon	Hume	Bigram


## For next week: Consider how you can improve this model of comparison


*   Case-sensitivity
*   Remove punctuation
*   Add other functionality from Dr. Crane's code (ex.Hume to Gibbon, not just Gibbon to Hume)
*   How can we generate more relavant search terms
*   Statistical relivance 

