In [3]:
%autosave 300
%dirs

Autosaving every 300 seconds


[]

# introduction

spaCy relies on machine learning models that were trained on a large amount of carefully-labeled texts. These texts were, in fact, often labeled and corrected by hand. This is similar to our topic modeling work from the previous lesson, except our topic model wasn’t using labeled data.

The English-language spaCy model that we’re going to use in this lesson was trained on an annotated corpus called “OntoNotes”: 2 million+ words drawn from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech,” which were meticulously tagged by a group of researchers and professionals for people’s names and places, for nouns and verbs, for subjects and objects, and much more. Like a lot of other major machine learning projects, OntoNotes was also sponsored by the Defense Advaced Research Projects Agency (DARPA), the branch of the Defense Department that develops technology for the U.S. military.

paCy offers models for other languages including Chinese, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian.

spaCy offers language and tokenization support for other language via external dependencies — such as PyviKonlpy for Korean.

__to download language model:__

!python -m spacy download en_core_web_sm


notes: spacy.load() does not work

# import package

In [26]:
import spacy
import en_core_web_sm
from spacy import displacy
from collections import Counter
import pandas as pd
from pprint import pprint


In [8]:
import os
os.listdir('data/NYT-Obituaries')

['1945-Adolf-Hitler.txt',
 '1915-F-W-Taylor.txt',
 '1975-Chiang-Kai-shek.txt',
 '1984-Ethel-Merman.txt',
 '1953-Jim-Thorpe.txt',
 '1964-Nella-Larsen.txt',
 '1955-Margaret-Abbott.txt',
 '1984-Lillian-Hellman.txt',
 '1959-Cecil-De-Mille.txt',
 '1928-Mabel-Craty.txt',
 '1973-Eddie-Rickenbacker.txt',
 '1989-Ferdinand-Marcos.txt',
 '1991-Martha-Graham.txt',
 '1997-Deng-Xiaoping.txt',
 '1938-George-E-Hale.txt',
 '1885-Ulysses-Grant.txt',
 '1909-Sarah-Orne-Jewett.txt',
 '1957-Christian-Dior.txt',
 '1987-Clare-Boothe-Luce.txt',
 '1976-Jacques-Monod.txt',
 '1954-Getulio-Vargas.txt',
 '1979-Stan-Kenton.txt',
 '1990-Leonard-Bernstein.txt',
 '1972-Jackie-Robinson.txt',
 '1998-Fred-W-Friendly.txt',
 '1991-Leo-Durocher.txt',
 '1915-B-T-Washington.txt',
 '1997-James-Stewart.txt',
 '1981-Joe-Louis.txt',
 '1983-Muddy-Waters.txt',
 '1942-George-M-Cohan.txt',
 '1989-Samuel-Beckett.txt',
 '1962-Marilyn-Monroe.txt',
 '2000-Charles-M-Schulz.txt',
 '1967-Gregory-Pincus.txt',
 '1894-R-L-Stevenson.txt',
 '1978

# basics

In [42]:
filepath = "data/NYT-Obituaries/1945-Adolf-Hitler.txt"
text = open(filepath, encoding='utf-8').read()


In [43]:
nlp = en_core_web_sm.load()

In [44]:
document = nlp(text)


In [34]:
# count of how many entity
len(document.ents)

17

In [35]:
# details of entity recongized in the document
[x.label_ for x in document.ents]

['ORG',
 'GPE',
 'NORP',
 'GPE',
 'GPE',
 'GPE',
 'DATE',
 'ORG',
 'PERSON',
 'PERSON',
 'PERSON',
 'PERSON',
 'GPE',
 'NORP',
 'LOC',
 'ORG',
 'GPE']

In [13]:
displacy.render(document, style="ent")


# spacy named entity

Each of the named entities in document.ents contains more information about itself, which we can access by iterating through the document.ents with a simple for loop.

<table>
  <tr><td>
    <img src="https://miro.medium.com/max/1400/1*qQggIPMugLcy-ndJ8X_aAA.png"
         alt="Fashion MNIST sprite"  width="600">
  </td></tr>

In [14]:
document.ents


(May 2, 1945,
 OBITUARY,
 Hitler Fought Way,
 Adolf Hitler,
 one,
 Austrian,
 Germany,
 Reich,
 Europe,
 Lenin,
 Mussolini,
 Lenin,
 the Empire of the Czars,
 Mussolini,
 Rome,
 Hitler,
 Germany,
 Hohenzollerns,
 Lenin,
 Mussolini,
 Hitler,
 1914-18,
 three,
 Lenin,
 the Russian Revolution,
 Mussolini,
 Socialist,
 Hitler,
 Germans,
 world revolution,
 Mussolini,
 Hitler,
 1918,
 Germany,
 supreme Fuehrer,
 nine,
 Europe,
 millions,
 Germans,
 Fatherland,
 Austria,
 7,000,000,
 More than 2,000,000,
 Germans,
 Sudeten,
 Czechoslovakia,
 10,000,000,
 Czechs,
 Slovaks,
 State,
 Central Europe,
 Nazi,
 more than six years,
 January, 1933,
 Hitler,
 Poland,
 Anglo,
 Czechoslovakia Hitler,
 Poland,
 France,
 England,
 less than a year later,
 Europe,
 one,
 "Mein Kampf",
 Czechoslovakia,
 Munich,
 Prague,
 German,
 first,
 June, 1934,
 Ernst Roehm,
 Nazis,
 Nazi,
 thousands,
 Jews,
 Catholic,
 Nazis,
 Hitler,
 Hitler,
 these Pastor Niemoeller,
 Niemoeller,
 Christianity,
 Nazi,
 State,
 Germ

For each named_entity in document.ents, we will extract the named_entity and its corresponding ^named_entity.label_.

In [30]:
# for named_entity in document.ents:
#     print(named_entity, named_entity.label_)


pprint([(x.text,x.label_) for x in document.ents])

[('Palace', 'ORG'),
 ('Berlin', 'GPE'),
 ('Nordic', 'NORP'),
 ('Wotan', 'GPE'),
 ('Favorite City', 'GPE'),
 ('Munich', 'GPE'),
 ('1935', 'DATE'),
 ('Prinzregentenstrasse', 'ORG'),
 ('Frau Angella Raubal', 'PERSON'),
 ('Martin Hammisch', 'PERSON'),
 ('Haus Wachenfeld', 'PERSON'),
 ('Hitler', 'PERSON'),
 ('Berchtesgaden', 'GPE'),
 ('Bavarian', 'NORP'),
 ('Alps', 'LOC'),
 ('Fuehrer', 'ORG'),
 ('Austria', 'GPE')]


In [16]:
import math

number_of_chunks = 80

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)


In [17]:
chunked_documents = list(nlp.pipe(text_chunks))


## to pandas

In [22]:
org = []

# for named_entity in document.ents:
#     print(named_entity, named_entity.label_)

for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "ORG":
            org.append(named_entity.text)

people_tally = Counter(org)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])

In [23]:
df

Unnamed: 0,character,count
0,State,8
1,Fuehrer,5
2,Reichstag,5
3,Hindenburg,3
4,the Iron Cross,2
...,...,...
83,Siegfried Wagner,1
84,neurasthenia,1
85,the Vienna Academy,1
86,the Brown\n\n House,1


# spacy tokens

During the above example, we were working on entity level, in the following example, we are demonstrating token-level entity annotation using the BILUO tagging scheme to describe the entity boundaries.

- "B" means the token begins an entity
- "I" means it is inside an entity, 
- "O" means it is outside an entity, and "" means no entity tag is set.

<table>
  <tr><td>
    <img src="https://miro.medium.com/max/1400/1*_sYTlDj2p_p-pcSRK25h-Q.png"
         alt="Fashion MNIST sprite"  width="600">
  </td></tr>



In [45]:
pprint([(x,x.ent_iob_,x.ent_type_)for x in document])

[(May, 'B', 'DATE'),
 (2, 'I', 'DATE'),
 (,, 'I', 'DATE'),
 (1945, 'I', 'DATE'),
 (

 , 'O', ''),
 (OBITUARY, 'B', 'ORG'),
 (

 , 'O', ''),
 (Hitler, 'B', 'PERSON'),
 (Fought, 'I', 'PERSON'),
 (Way, 'I', 'PERSON'),
 (to, 'O', ''),
 (Power, 'O', ''),
 (Unique, 'O', ''),
 (in, 'O', ''),
 (Modern, 'O', ''),
 (History, 'O', ''),
 (

 , 'O', ''),
 (BY, 'O', ''),
 (THE, 'O', ''),
 (NEW, 'O', ''),
 (YORK, 'O', ''),
 (TIMES, 'O', ''),
 (

 , 'O', ''),
 (Adolf, 'B', 'PERSON'),
 (Hitler, 'I', 'PERSON'),
 (,, 'O', ''),
 (one, 'B', 'CARDINAL'),
 (-, 'O', ''),
 (time, 'O', ''),
 (Austrian, 'B', 'NORP'),
 (vagabond, 'O', ''),
 (who, 'O', ''),
 (rose, 'O', ''),
 (to, 'O', ''),
 (be, 'O', ''),
 (the, 'O', ''),
 (dictator, 'O', ''),
 (of, 'O', ''),
 (Germany, 'B', 'GPE'),
 (,, 'O', ''),
 (", 'O', ''),
 (augmenter, 'O', ''),
 (of, 'O', ''),
 (the, 'O', ''),
 (Reich, 'B', 'PERSON'),
 (", 'O', ''),
 (and, 'O', ''),
 (the, 'O', ''),
 (scourge, 'O', ''),
 (of, 'O', ''),
 (Europe, 'B', 'LOC'),
 (,, 'O', ''),

In [46]:
Counter([x.label_ for x in document.ents])

Counter({'DATE': 181,
         'ORG': 111,
         'PERSON': 363,
         'CARDINAL': 60,
         'NORP': 243,
         'GPE': 321,
         'LOC': 29,
         'EVENT': 12,
         'WORK_OF_ART': 10,
         'ORDINAL': 18,
         'TIME': 13,
         'PRODUCT': 3,
         'LAW': 8,
         'MONEY': 3,
         'PERCENT': 1,
         'FAC': 3,
         'LANGUAGE': 1})

In [47]:
Counter([x.label_ for x in document.ents]).most_common(3)

[('PERSON', 363), ('GPE', 321), ('NORP', 243)]

In [48]:
sentences = [x for x in document.sents]
print(sentences[5])

Like Lenin and Mussolini, Hitler came out of the blood and chaos of 1914-18, but of the three he was the strangest phenomenon.


In [49]:
displacy.render(nlp(str(sentences[5])), jupyter=True, style='ent')

Using spaCy’s built-in displaCy visualizer, here’s what the above sentence and its dependencies look like:

In [50]:
displacy.render(nlp(str(sentences[5])), style='dep', jupyter = True, options = {'distance': 120})

In [51]:
# as dictionary
'''Like Lenin and Mussolini, Hitler came out of the blood and 
chaos of 1914-18, but of the three he was the strangest phenomenon.'''
dict([(str(x), x.label_) for x in nlp(str(sentences[5])).ents])


{'Lenin': 'PERSON',
 'Mussolini': 'PERSON',
 'Hitler': 'PERSON',
 '1914-18': 'DATE',
 'three': 'CARDINAL'}