# 00 Introduction to NER (Named Entity Recognition)

In [1]:
"""
Key Terminology:

    Natural Language Processing (NLP)
    
    Named Entity Recognition (NER)
    
    Information Extraction (IE)
    
    Gazetteer (Rules-Based Method)
    
    Linguistic Ambiguity
    
    Domain Adaptation
    
    Generalize



Key Libraries:

    NLP           ==> spaCy & NLTK
    
    Word Vectors  ==> Gensim
"""

# images
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://miro.medium.com/max/602/1*bx85lgIdG9PWdCCnfNpsjQ.png")

What is an entity?

An entity is a "Thing" existing in the real world or in our imagination.

Words are a way to describe a "Thing", but aren't the "Thing" itself.

Instead of always describing something, we nearly have a name for everything:

e.g.: Lord Voldemort
- He Who Must Not Be Named
- You-Know-Who
- Voldemort
- You Know Who
- Dark Lord Voldemort
- Dark Lord
- Tom Riddle
- Tom Marvolo Riddle
- He-Who-Must-Not-Be-Named


Even when we forget the name of an entity, language enables us to describe the entity so, that other know what we mean:

e.g.: Lord Voldemort
- a fictional character and the main antagonist in J. K. Rowling's series of Harry Potter novels
- that evil wizard from Harry Potter missing a nose


https://www.wikidata.org/wiki/Q176132

In [2]:
import spacy
# https://spacy.io/models/en#en_core_web_lg
# needs to be extra installed 700MB+ 
#!python3 -m spacy download en_core_web_lg
nlp = spacy.load("en_core_web_lg")

# The issues with entity recognition of general models
show_doc = "University Bremen"
spacy.displacy.render(nlp(show_doc), style="ent")

show_doc = "Universität Bremen"
spacy.displacy.render(nlp(show_doc), style="ent")

show_doc = "University Stuttgart"
spacy.displacy.render(nlp(show_doc), style="ent")

show_doc = "Universität Stuttgart"
spacy.displacy.render(nlp(show_doc), style="ent")

show_doc = "University Wuppertal"
spacy.displacy.render(nlp(show_doc), style="ent")

show_doc = "Universität Wuppertal"
spacy.displacy.render(nlp(show_doc), style="ent")

## Harry Potter domain

In [3]:
show_doc = "Dursley is the uncle of Harry Potter."
spacy.displacy.render(nlp(show_doc), style="ent")

show_doc = "Mr. Dursley is the uncle of Harry Potter."
spacy.displacy.render(nlp(show_doc), style="ent")

show_doc = "Durley is the uncle of Harry Potter."
spacy.displacy.render(nlp(show_doc), style="ent")

show_doc = "Mr. Durley is the uncle of Harry Potter."
spacy.displacy.render(nlp(show_doc), style="ent")

show_doc = "Mr. Randomname is the uncle of Harry Potter."
spacy.displacy.render(nlp(show_doc), style="ent")

show_doc = "Lord Randomname is the uncle of Harry Potter."
spacy.displacy.render(nlp(show_doc), style="ent")

In [4]:
show_doc = "Lord Voldemort is the evil character in Harry Potter."
spacy.displacy.render(nlp(show_doc), style="ent")
spacy.displacy.render(nlp(show_doc), style='dep', jupyter=True)

In [5]:
show_doc = "Voldemort is the evil character in the book Harry Potter and the Sorcerer's Stone."
spacy.displacy.render(nlp(show_doc), style="ent")

In [7]:
show_doc = "Voldemort is my dog."
spacy.displacy.render(nlp(show_doc), style="ent")

show_doc = "Voldemort is a company."
spacy.displacy.render(nlp(show_doc), style="ent")

# 01 Get data and do some preprocessing

This Chapter contains:
- preprocessing Text
- loading Text into spacy

In [2]:
import requests, re

# get the first Harry Potter Book
r = requests.get('''http://www.pauladaunt.com/books/Children's/Harry_Potter1-4/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer's%20Stone.txt''')
hp = r.content.decode('utf-8')
print('length: ',len(hp))
print('\n')
print(hp[:550])

length:  439742


Harry Potter and the Sorcerer's Stone


CHAPTER ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say
that they were perfectly normal, thank you very much. They were the last
people you'd expect to be involved in anything strange or mysterious,
because they just didn't hold with such nonsense.

Mr. Dursley was the director of a firm called Grunnings, which made
drills. He was a big, beefy man with hardly any neck, although he did
have a very large mustache. Mrs. Dursley was thin and blonde and had
nearly 


In [3]:
# eliminate linebreak, multiple spaces and multiple dots and similar patterns
hp_flat = re.sub('\s+', ' ', hp.replace('\n\n','. ').replace('\n',' ').strip())
hp_flat = re.sub('[.]+', '.', hp_flat)
hp_flat = hp_flat.replace(':.','.').replace(';.','.').replace(',.','.')
hp_flat[:550]

"Harry Potter and the Sorcerer's Stone. CHAPTER ONE. THE BOY WHO LIVED. Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly tw"

In [4]:
# check characters
chars = ''.join(sorted(set(hp.replace('\n','').replace('\t',''))))
print('Characters used in corpus:',chars)

# filter for symbols
symbols = re.sub('[A-Za-z0-9!?., ]','', chars)
print('Symbols:',symbols)

# get frequency of each symbol
from collections import Counter
print('\n')
print({k:v for k,v in Counter(hp_flat).items() if k in symbols})

Characters used in corpus:  !"'()*,-.0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ\abcdefghijklmnopqrstuvwxyz~
Symbols: "'()*-:;\~


{"'": 3141, ';': 129, '-': 1990, '"': 4758, '(': 30, ')': 33, ':': 48, '~': 1, '*': 2, '\\': 1}


In [5]:
# check usage of symbols in text
def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

for symbol in symbols:
    symbol_position = find_nth(hp_flat, symbol, 1)
    print(f"""{symbol}: {hp_flat[max(0,symbol_position-50):symbol_position+50]}""")

": g a tantrum and throwing his cereal at the walls. "Little tyke," chortled Mr. Dursley as he left the
': Harry Potter and the Sorcerer's Stone. CHAPTER ONE. THE BOY WHO LIVED. Mr. and 
(: er daughter and how Dudley had learned a new word ("Won't!"). Mr. Dursley tried to act normally. Whe
): er and how Dudley had learned a new word ("Won't!"). Mr. Dursley tried to act normally. When Dudley 
*: ue, but Ron kicked him behind their cauldron. "Doi* push it," he muttered, "I've heard Snape can tur
-: n't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it wa
::  to persuade people to call him by his proper name: Voldemort." Professor McGonagall flinched, but D
;: ey's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have 
\: to save me?". "Of course," said Quirrell coolly. "\Why do you think he wanted to referee your next m
~: er ours. Some of 'em came outta kinda trances. Don~ reckon they could've done if h

In [6]:
# check bracket contents
re.findall("""\([^()]+\)""", hp_flat)

['("Won\'t!")',
 "(he'd be able to watch what he wanted on television for a change and maybe even have a go on Dudley's computer)",
 '(brown with orange puff balls)',
 '(as he shouted at Uncle Vernon through the locked door of his cupboard)',
 '(or maybe hoped)',
 "(usually Uncle Vernon's sister, Marge)",
 '(two minutes to go)',
 '(Order of Merlin, First Class, Grand Sorc., Chf. Warlock, Supreme Mugwump, International Confed. of Wizards)',
 '(black)',
 '(black)',
 '(dragon hide or similar)',
 '(black, silver fastenings)',
 '(Grade 1)',
 '(pewter, standard size 2)',
 '(chocolate and raspberry with chopped nuts)',
 '(Bewitch Your Friends and Befuddle Your Enemies with the Latest Revenges: Hair Loss, Jelly-Legs, Tongue- Tying and Much, Much More)',
 '("It says pewter on yer list")',
 '(five Knuts a scoop)',
 '(feeling foolish)',
 '(the sandwiches lay forgotten)',
 '(though I have none)',
 '("I do hope they start right away, there\'s so much to learn, I\'m particularly interested in Transf

In [7]:
# delete unnecessary symbols
hp_flat = re.sub("[*~]", "", hp_flat).replace("\\", "")

# delete brackets and content, if they don't seem to be important
hp_flat = re.sub("""\([^()]+\)""", "",hp_flat).replace('(','').replace(')','')

The text should be ready to be feeded into the spacy pipeline.

In [8]:
import spacy
nlp = spacy.load("en_core_web_lg")

# nlp is a natural language processing function of spacy, that includes many useful functions
doc = nlp(hp_flat)

# transform the doc into a callable list of sentences for easier usage
hp_sentences = []
for sentence in doc.sents:
    #filter empty sentences
    if sentence.text.strip():
        hp_sentences.append(sentence)
hp_sentences[:10]

[Harry Potter and the Sorcerer's Stone.,
 CHAPTER ONE.,
 THE BOY WHO LIVED.,
 Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.,
 They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.,
 Mr. Dursley was the director of a firm called Grunnings, which made drills.,
 He was a big, beefy man with hardly any neck, although he did have a very large mustache.,
 Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors.,
 The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.,
 The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it.]

# 02 Explore some spacy functions

this chapter contains:
- exploring parts of spacy
- first check for entities
- the issue of general models

In [9]:
# https://spacy.io/usage/linguistic-features

# let's see if we can identify some entities using spacy
for sentence in hp_sentences[:6]:
    print('\n')
    print(sentence)
    for ent in sentence.ents:
        print(ent.text, ent.label_)
        
# Not bad! Spacy got some good hits there! But does spacy really recognize those entities or was it just luck?



Harry Potter and the Sorcerer's Stone.
Harry Potter PERSON


CHAPTER ONE.
ONE CARDINAL


THE BOY WHO LIVED.


Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.
Dursley PERSON
four CARDINAL
Privet Drive FAC


They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.


Mr. Dursley was the director of a firm called Grunnings, which made drills.
Dursley PERSON
Grunnings ORG


In [10]:
# We can actually checkout what spacy connects with a single word

import numpy as np

# A functions that calls word-vectors from spacy, containing words, that are associated with the input word
def spacy_similarity(word):
    ms = nlp.vocab.vectors.most_similar(
        np.asarray([nlp.vocab.vectors[nlp.vocab.strings[word]]]), n=10)
    words = [nlp.vocab.strings[w] for w in ms[0][0]]
    distances = ms[2]
    print(words)

In [11]:
# Seems like Spacy knows some Harry Potter stuff
print(spacy_similarity("Harry"))
print(spacy_similarity("Potter"))

# As we can see <Potter> is in the top 5 in the <Harry>-Vector
# Also <Harry> is in the top 5 in the <Potter>-Vector
# No surprise, spacy nailed that one!

['HARRY', 'Harry', 'harry', 'POTTER', 'Potter', 'potter', 'Hermione', 'HERMIONE', 'hermione', 'Hallows']
None
['POTTER', 'Potter', 'potter', 'HARRY', 'harry', 'Harry', 'Hallows', 'hallows', 'deathly', 'Deathly']
None


In [12]:
# Spacy mainly associates locations with <Dursley>, but still detected <Dursley> as a Person. Good job spacy! 
print("Vector for <Dursely>:")
spacy_similarity("Dursley")
print('\n')

# Spacy does this by recognizing patterns. The striking pattern for spacy was <Mr.> <Noun>.
temp = nlp('Mr. Dursley is a person.')
for ent in temp.ents:
    print('Mr. Dursley is a person. #',ent.text, ent.label_)
    
# If we remove the <Mr.> pattern, spacy fails to detect <Dursley> as a person, although we made it obvious.
temp = nlp('Dursley is a person.')
for ent in temp.ents:
    print('Dursley is a person. #', ent.text, ent.label_)
    
# Obviously Spacy doesn't know anything about Mr. Dursley from Harry Potter.
temp = nlp('Dursley from Harry Potter.')
for ent in temp.ents:
    print('Dursley from Harry Potter. #', ent.text, ent.label_)
    
# If spacy doesn't detect any pattern, it simply falls back to the default tag.    
temp = nlp('Dursley')
for ent in temp.ents:
    print('Dursley #', ent.text, ent.label_)

Vector for <Dursely>:
['Dursley', 'dursley', 'wallasey', 'prescot', 'whitchurch', 'petersfield', 'clitheroe', 'Westhill', 'westhill', 'consett']


Mr. Dursley is a person. # Dursley PERSON
Dursley is a person. # Dursley GPE
Dursley from Harry Potter. # Dursley GPE
Dursley from Harry Potter. # Harry Potter PERSON
Dursley # Dursley GPE


In [13]:
# Nagini is a talking snake and the pet of Voldemort, doesn't look like spacy knows Nagini, yet
try:
    spacy_similarity("Nagini")
except KeyError:
    print("Nagini not in data.")

Nagini not in data.


In [14]:
# Oh no, what is spacy doing here?
temp = nlp('Nagini is a person.')
for ent in temp.ents:
    print('Nagini is a person. #',ent.text, ent.label_)

Nagini is a person. # Nagini ORG


In [15]:
# Seems like the pattern <token> is a person, always makes spacy default to ORG ...
temp = nlp('Randomname is a person.')
for ent in temp.ents:
    print('Randomname is a person. #',ent.text, ent.label_)
    
# ... unless we make it even obvious ...
temp = nlp('Harry is a person.')
for ent in temp.ents:
    print('Harry is a person. #',ent.text, ent.label_)

# ... or may be spacy doesn't really understand us?
temp = nlp('Harry is a organization.')
for ent in temp.ents:
    print('Harry is a organization. #',ent.text, ent.label_)

Randomname is a person. # Randomname ORG
Harry is a person. # Harry PERSON
Harry is a organization. # Harry PERSON


# 03 Get Metadata about the domain of interest from Wikidata

This chapter contains:
- collecting metadata for entities from wikidata

Recognizing entities in an unknown domain can be very difficult.

Normally we would have to label entities manually, but with the help of the internet we may save some tedious work.

One of the biggest sources for information are Knowledge Bases like wikidata.

With some luck, we can even manage to find a whole group working on our specific domain: e.g.  https://www.wikidata.org/wiki/Wikidata:WikiProject_Harry_Potter

Besides wikidata there are various other sources, which may need some more preprocessing, but still can be very helpful.

Warning: You may always check the data you didn't created yourself.

In [16]:
import sys, pandas as pd
from SPARQLWrapper import SPARQLWrapper, JSON

# We are using the Wikidata SPARQL API to easily receive Metadata of good quality
endpoint_url = "https://query.wikidata.org/sparql"


############################################
# Creating a query for receiving data - Here the domain is the "Harry Potter Universe"

# ?item wdt:P1441 wd:Q8337.               -> items that are <present in> (P1441) <Harry Potter Books> (Q8337)
# ?item wdt:P31  wd:Q3658341.             -> items that are <instance of> (P31) <literaly character> (Q3658341)
# SERVICE wikibase:label { bd:service ... -> Select in <english> (en)
# Optional{?item skos:altLabel ?altLabel. -> <Optional> items that have <alternative labels>
# FILTER (lang(?altLabel) = "en")}        -> Filter only alternative lables (?altLabel) by language english
############################################
query = """
            SELECT ?item ?itemLabel ?altLabel
            WHERE 
            {
              ?item wdt:P1441 wd:Q8337.
              ?item wdt:P31 wd:Q3658341.
              SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"}.
              Optional{?item skos:altLabel ?altLabel . FILTER (lang(?altLabel) = "en")}
            }

        """

What the query looks like in the API:

https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20%3FaltLabel%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20wdt%3AP1441%20wd%3AQ8337.%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ3658341.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22%7D.%0A%20%20Optional%7B%3Fitem%20skos%3AaltLabel%20%3FaltLabel%20.%20FILTER%20%28lang%28%3FaltLabel%29%20%3D%20%22en%22%29%7D%0A%7D%0A

In [17]:
# Function to receive the data in json format
def get_results(endpoint_url, query):
    user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
    # TODO adjust user agent; see https://w.wiki/CX6
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

# Function to unpack the wanted values and remove empty results
def try2unpack(x):
    try:
        x=x['value']
    except (KeyError, TypeError):
        pass
    
    try:
        if 'http://www.wikidata.org/entity/Q' in x or 'http://www.wikidata.org/entity/P' in x:
            return x.rsplit('/',1)[1]
        else:
            return x
    except TypeError:
        return x
    
# Funtion to transform the data into a pandas DataFrame
def json2pandas(data):
    return pd.DataFrame(data['results']['bindings'], columns=data['head']['vars']).applymap(lambda x: try2unpack(x))


# Execute
results = get_results(endpoint_url, query)
characters = json2pandas(results).fillna('')
characters.sort_values('item')

Unnamed: 0,item,itemLabel,altLabel
259,Q1000118,Peter Pettigrew,Wormtail
258,Q1000118,Peter Pettigrew,Scabbers
116,Q10264933,Demelza Robins,
117,Q10266348,Dilys Derwent,
118,Q10269234,Damocles Rowle,
...,...,...,...
111,Q9617281,Anthony Goldstein,
112,Q9631186,Artemisia Lufkin,
113,Q9640654,Avery,
114,Q9659949,Bob Ogden,


In [18]:
# Convert data to a dictionary, so that every word representation points on it's entity

characters_dictionary = {}

# 1) point alternative word representation to the entity
for index, row in characters.iterrows():
    if row['altLabel']:
        characters_dictionary[ row['altLabel'] ]  = row['item']
        
# 2) point primary word representationto entity - By doing this 2) it may overwrite data from 1).
for index, row in characters.iterrows():
    if row['itemLabel']:
        characters_dictionary[ row['itemLabel'] ] = row['item']

list(characters_dictionary.items())[:10]

[('Spout-Hole', 'Q15257494'),
 ('Cedrella Weasley', 'Q15269290'),
 ('Callidora Longbottom', 'Q15269305'),
 ('Ursula Black', 'Q15269368'),
 ('Sirius Black', 'Q713701'),
 ('Melania Black', 'Q15272785'),
 ('Hesper Black', 'Q15272848'),
 ('Irma Black', 'Q15272917'),
 ('Ronald Bilius "Ron" Weasley', 'Q173998'),
 ('Ronald Bilius Weasley', 'Q173998')]

In [19]:
# 'Lord Voldemort' and 'You Know Who' refer now to the same identity: Q176132.
print('Lord Voldemort:', characters_dictionary.get('Lord Voldemort'))
print('Lord Voldemort:', characters_dictionary.get('You Know Who'))

# Although Lord Voldemort is known as 'Tom Riddle', 'Tom Riddle' is also the father of Voldemort: Q2182794.
print('\nTom Riddle:', characters_dictionary.get('Tom Riddle'))

# A word representation can point to two different entities.
# To figure out which is the correct entity, we have to include the surrounding context of the word representation.
# For now we leave 'Tom Riddle' as Q2182794, the father of Voldemort (Q176132)

Lord Voldemort: Q176132
Lord Voldemort: Q176132

Tom Riddle: Q2182794


We finally collected some clean representations for our entities. In the next step we need to identify the entities in the real text.

# 04 Gazetteer and NER (Rules-Based NER)

This chapter contains:
- creating own entity labels for spacy, with very low effort

In [20]:
# Let's see if we can fix Dursley with some own easy methods

# Let's see what wikidata entitiy we have for <Dursley>
characters[characters['itemLabel'].str.contains('Dursley')]

# We have 3 entities
    # Dudley Dursley (son) #Q5700415
    # Vernon Dursley (father) #Q11910388
    # Petunia Dursley (mother) #Q1963397

Unnamed: 0,item,itemLabel,altLabel
89,Q5700415,Dudley Dursley,
272,Q11910388,Vernon Dursley,Mr. Dursley
273,Q11910388,Vernon Dursley,Uncle Vernon
277,Q1963397,Petunia Dursley,Aunt Petunia
278,Q1963397,Petunia Dursley,Mrs. Dursley
279,Q1963397,Petunia Dursley,Petunia Evans


In [21]:
# get all entities for the Dursleys
dursleys_entities = [v for k,v in characters_dictionary.items() if 'Dursley' in k]

# get all variations for the Dursleys
dursleys_variations = [[k for k,v in characters_dictionary.items() if entity==v] for entity in dursleys_entities]

# flatten list
dursleys_variations = [variation for entity_variations in dursleys_variations for variation in entity_variations]

# add single names
dursleys_variations = set(dursleys_variations + [name.split()[0] for name in dursleys_variations] + [name.split()[-1] for name in dursleys_variations])
dursleys_variations

{'Aunt',
 'Aunt Petunia',
 'Dudley',
 'Dudley Dursley',
 'Dursley',
 'Evans',
 'Mr.',
 'Mr. Dursley',
 'Mrs.',
 'Mrs. Dursley',
 'Petunia',
 'Petunia Dursley',
 'Petunia Evans',
 'Uncle',
 'Uncle Vernon',
 'Vernon',
 'Vernon Dursley'}

In [22]:
# finding entities via list

sentence = nlp('Mr. and Mrs. Dursley visited their son Dudley.')

print(sentence)
print('\n')
for token in sentence:
    if token.text in dursleys_variations:
        print(token)

Mr. and Mrs. Dursley visited their son Dudley.


Mr.
Mrs.
Dursley
Dudley


In [23]:
# create a more complex rule
tokenized_sentence = [token.text for token in sentence]

window = 3
window_bigram = []
token_already_used = []
for i in range(len(tokenized_sentence)):
    try:
        actual_token = tokenized_sentence[i]
        if actual_token in ['Mr.', 'Mrs.']:
            for j in range(i+1, i+1+window):
                next_token = tokenized_sentence[j]
                window_bigram = actual_token+' '+next_token

                if window_bigram in dursleys_variations:
                    token_already_used.append(j)
                    print(window_bigram)
                    
        else:
            if actual_token in dursleys_variations and i not in token_already_used:
                print(actual_token)
            
    except IndexError:
        pass

Mr. Dursley
Mrs. Dursley
Dudley


# 05 Manipulate spaCy's Named Entity Recognition

This chapter contains:
- adding custom entities to spacy

In [24]:
import spacy
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
import json

In [25]:
# create customized labels
def creat_training_data(data, data_type):
    patterns = []
    
    for item in data:
        pattern = {
            "label": data_type,
            "pattern": item           
        }
        patterns.append(pattern)
    return patterns
        
# prepare costumized entities, possibly you may modify or clean them
characters = list(characters_dictionary.keys())    
characters += ' '.join(characters).split()
stoplist = ['the','and', 'i', 'he']
characters = [character for character in characters if character.lower() not in stoplist]
characters += [character.upper() for character in characters] + [character.title() for character in characters]

# apply the new entities and save them as "pattern"
patterns = creat_training_data(data=sorted(set(characters)), data_type="HP_CHAR")

print(patterns[:10])

[{'label': 'HP_CHAR', 'pattern': '"BILL"'}, {'label': 'HP_CHAR', 'pattern': '"Bill"'}, {'label': 'HP_CHAR', 'pattern': '"RON"'}, {'label': 'HP_CHAR', 'pattern': '"ROSIE"'}, {'label': 'HP_CHAR', 'pattern': '"Ron"'}, {'label': 'HP_CHAR', 'pattern': '"Rosie"'}, {'label': 'HP_CHAR', 'pattern': '"TEDDY"'}, {'label': 'HP_CHAR', 'pattern': '"Teddy"'}, {'label': 'HP_CHAR', 'pattern': "'MAD-EYE'"}, {'label': 'HP_CHAR', 'pattern': "'Mad-Eye'"}]


In [26]:
# create a new blank model and save the new "patterns" in it
def generate_rules(patterns):
    nlp = English()
    ruler = EntityRuler(nlp)
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler)
    # the model will be saved as 'hp_ner', you may choose a different name
    nlp.to_disk('hp_ner')
    
generate_rules(patterns)

In [29]:
# we can load our customized patterns also from a file
nlp = spacy.load('hp_ner')

# a quick test shows, that our Model still labels text as expected
sentence = 'Mr. and Mrs. Dursley visited their son Dudley.'
doc = nlp(sentence)

for ent in doc.ents:
    print(ent.text, ent.label_)

Mr. HP_CHAR
Mrs. Dursley HP_CHAR
Dudley HP_CHAR


In [30]:
# let's retry the former wrong labeled sentences

doc = nlp('Nagini is a company')
for ent in doc.ents:
    print(ent.text, ent.label_)
    
doc = nlp('Dursley is a company')
for ent in doc.ents:
    print(ent.text, ent.label_)

Nagini HP_CHAR
Dursley HP_CHAR


# 06 Training a spaCy NER model

This chapter contains:
- training a customized model

In [31]:
# random will be used to shuffle the train data
import random

In [33]:
# use customized entities to tag the data, we reparse every sentence with our new model using nlp()
hp_sentences_new = [nlp(sentence.text) for sentence in hp_sentences]

In [35]:
# create train data

#############
#
# The training data concists of the raw sentence as text and the position of all entities and there labels
# Dudley is the son of Dursley. entities: [(0,7, 'HP_CHAR'), (21,27, 'HP_CHAR')] 
# 0----6...............21---27.
#
# TRAIN_DATA[(text, {"entities":[(start_entity, end_entity, label)]})]
#
# In this example we are only using the customized label 'HP_CHAR'. 
# In more complex models, there can also be multiple labels, the data structure would be the same.
#
##############

from tqdm import tqdm
train_data = []

#loop over sentences
for sentence in tqdm(hp_sentences_new):
    # get and save entites per sentence
    entities = []
    for ent in sentence.ents:
        entities.append( (ent.start_char, ent.end_char, ent.label_) )
        
    # if sentence doesn't contain any entities skip it
    if len(entities)>0:
        result = [sentence.text, {"entities": entities}]
        train_data.append(result)
        
train_data[:5]

100%|██████████| 7391/7391 [00:00<00:00, 53503.71it/s]


[["Harry Potter and the Sorcerer's Stone.",
  {'entities': [(0, 12, 'HP_CHAR')]}],
 ['THE BOY WHO LIVED.', {'entities': [(0, 17, 'HP_CHAR')]}],
 ['Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.',
  {'entities': [(0, 3, 'HP_CHAR'), (8, 20, 'HP_CHAR')]}],
 ['Mr. Dursley was the director of a firm called Grunnings, which made drills.',
  {'entities': [(0, 11, 'HP_CHAR')]}],
 ['Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors.',
  {'entities': [(0, 12, 'HP_CHAR')]}]]

In [36]:
#training the actual model
def train_spacy(data, iterations):
    train_data = data
    
    # create a blank model
    nlp = spacy.blank("en")
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
        
    # add entity labels to the new model
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
            
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    
    # machine learning
    with nlp.disable_pipes(*other_pipes):
        optimizer = nlp.begin_training()
        for itn in tqdm(range(iterations)):
            print("Starting iteration " + str(itn))
            random.shuffle(train_data)
            losses = {}
            for text, annotations in train_data:
                nlp.update(
                    [text],
                    [annotations],
                    drop = 0.2,
                    sgd = optimizer,
                    losses = losses
                )
            print(losses)
    return nlp

# run the model with 30 iterations
#nlp= train_spacy(data = train_data, iterations = 30)

# save the model disk for later reload
#nlp.to_disk("hp_ner_model")

  proc.begin_training(
  proc.begin_training(
  0%|          | 0/30 [00:00<?, ?it/s]

Starting iteration 0


  3%|▎         | 1/30 [02:27<1:11:07, 147.16s/it]

{'ner': 1343.4488567641315}
Starting iteration 1


  7%|▋         | 2/30 [05:02<1:10:50, 151.79s/it]

{'ner': 429.6318789849923}
Starting iteration 2


 10%|█         | 3/30 [07:41<1:09:50, 155.21s/it]

{'ner': 321.8166859355644}
Starting iteration 3


 13%|█▎        | 4/30 [10:16<1:07:16, 155.26s/it]

{'ner': 371.4937843162249}
Starting iteration 4


 17%|█▋        | 5/30 [12:56<1:05:24, 156.98s/it]

{'ner': 271.7208533134708}
Starting iteration 5


 20%|██        | 6/30 [15:34<1:02:52, 157.20s/it]

{'ner': 286.43854036229544}
Starting iteration 6


 23%|██▎       | 7/30 [18:13<1:00:32, 157.94s/it]

{'ner': 231.19553804370543}
Starting iteration 7


 27%|██▋       | 8/30 [20:53<58:06, 158.49s/it]  

{'ner': 177.20698522298508}
Starting iteration 8


 30%|███       | 9/30 [24:11<59:48, 170.86s/it]

{'ner': 199.3591256304135}
Starting iteration 9


 33%|███▎      | 10/30 [27:39<1:00:46, 182.30s/it]

{'ner': 212.03158236458927}
Starting iteration 10


 37%|███▋      | 11/30 [31:06<1:00:09, 189.95s/it]

{'ner': 139.06240680536624}
Starting iteration 11


 40%|████      | 12/30 [34:31<58:17, 194.29s/it]  

{'ner': 154.96043161341953}
Starting iteration 12


 43%|████▎     | 13/30 [37:57<56:07, 198.09s/it]

{'ner': 194.7991615954541}
Starting iteration 13


 47%|████▋     | 14/30 [41:24<53:28, 200.54s/it]

{'ner': 150.81754314918365}
Starting iteration 14


 50%|█████     | 15/30 [45:06<51:47, 207.17s/it]

{'ner': 158.36678989770317}
Starting iteration 15


 53%|█████▎    | 16/30 [49:10<50:52, 218.06s/it]

{'ner': 147.6767240298115}
Starting iteration 16


 57%|█████▋    | 17/30 [53:16<49:06, 226.66s/it]

{'ner': 127.65077386526642}
Starting iteration 17


 60%|██████    | 18/30 [57:50<48:09, 240.80s/it]

{'ner': 105.09673162892827}
Starting iteration 18


 63%|██████▎   | 19/30 [1:02:02<44:45, 244.11s/it]

{'ner': 132.0116254106755}
Starting iteration 19


 67%|██████▋   | 20/30 [1:06:10<40:54, 245.49s/it]

{'ner': 121.40402985431997}
Starting iteration 20


 70%|███████   | 21/30 [1:10:17<36:52, 245.89s/it]

{'ner': 139.82463400002737}
Starting iteration 21


 73%|███████▎  | 22/30 [1:14:25<32:52, 246.51s/it]

{'ner': 84.64178850856159}
Starting iteration 22


 77%|███████▋  | 23/30 [1:18:42<29:06, 249.47s/it]

{'ner': 93.21968562463945}
Starting iteration 23


 80%|████████  | 24/30 [1:23:19<25:47, 257.84s/it]

{'ner': 96.6871883819936}
Starting iteration 24


 83%|████████▎ | 25/30 [1:28:09<22:16, 267.37s/it]

{'ner': 109.59376955643506}
Starting iteration 25


 87%|████████▋ | 26/30 [1:33:03<18:21, 275.40s/it]

{'ner': 70.5665457706036}
Starting iteration 26


 90%|█████████ | 27/30 [1:37:45<13:52, 277.42s/it]

{'ner': 97.33595069507032}
Starting iteration 27


 93%|█████████▎| 28/30 [1:43:07<09:41, 290.76s/it]

{'ner': 92.62564450873163}
Starting iteration 28


 97%|█████████▋| 29/30 [1:48:46<05:05, 305.20s/it]

{'ner': 83.8828541889144}
Starting iteration 29


100%|██████████| 30/30 [1:54:22<00:00, 228.76s/it]

{'ner': 125.54900472928666}





In [55]:
# test the model
nlp = spacy.load('hp_ner_model')
doc = nlp("""Dorsley was the director of a firm called Grunnings, which made drills.""")

for ent in doc.ents:
    print(ent.text, ent.label_)

Dorsley HP_CHAR


In [56]:
doc = nlp("""Mr. Dursley was the director of a firm called Grunnings, which made drills.""")

for ent in doc.ents:
    print(ent.text, ent.label_)

Mr. Dursley HP_CHAR


In [94]:
# <Randomword> is not in the corpus, but is also detected as HP_Char without any context
# spacy learnt that this word may fit "HP_CHAR"
# due the low training data we used and also only one label, most new words will be labeled by very low patterns

doc = nlp("""Randomword""")

for ent in doc.ents:
    print(ent.text, ent.label_)

Randomword HP_CHAR


In [100]:
doc = nlp("""Randomwo""")

for ent in doc.ents:
    print(ent.text, ent.label_)doc = nlp("""Randomw""")

for ent in doc.ents:
    print(ent.text, ent.label_)

In [98]:
doc = nlp("""Randomw""")

for ent in doc.ents:
    print(ent.text, ent.label_)

Randomw HP_CHAR


In [104]:
doc = nlp("""Rando""")

for ent in doc.ents:
    print(ent.text, ent.label_)

Rando HP_CHAR


In [105]:
doc = nlp("""Rand""")

for ent in doc.ents:
    print(ent.text, ent.label_)

In [106]:
try:
    [char for char in final_characters if "Rand" in final_characters]
except NameError:
    print('not in corpus')

not in corpus


In [150]:
# So far our model learnt some a few patterns

# alone the model doesn't recognize <War>
doc = nlp("""War""")
for ent in doc.ents:
    print(ent.text, ent.label_)

# but in combination with <Harry> it is recognized as part of Harry
doc = nlp("""Harry War""")
for ent in doc.ents:
    print(ent.text, ent.label_)
    
# we can see that <Harry War> wasn't in the original data
try:
    [char for char in final_characters if "Harry War" in final_characters]
except NameError:
    print('not in corpus')

Harry War HP_CHAR
not in corpus


In [151]:
doc = nlp("""
Harry James Potter was an English half-blood wizard, and one of the most famous wizards of modern times. 
The only child and son of James and Lily Potter (née Evans), 
Harry's birth was overshadowed by a prophecy, 
naming either himself or Neville Longbottom as the one with the power to vanquish Lord Voldemort.
After half of the prophecy was reported to Voldemort, courtesy of Severus Snape, Harry was chosen 
as the target due to his many similarities with the Dark Lord. In turn, this caused the Potter 
family to go into hiding. 
Voldemort made his first vain attempt to circumvent the prophecy when 
Harry was a year and three months old. During this attempt, he murdered 
Harry's parents as they tried to protect him, but this unsuccessful attempt to kill 
Harry led to Voldemort's first downfall. 
This downfall marked the end of the War, and to Harry henceforth being known as "The Boy Who Lived", as he was the only known survivor of the Killing Curse.
""".replace('\n',' ').strip() )

for ent in doc.ents:
    print(ent.text, ent.label_)

Harry James HP_CHAR
Potter HP_CHAR
James HP_CHAR
Lily Potter HP_CHAR
Harry HP_CHAR
Neville Longbottom HP_CHAR
Lord Voldemort HP_CHAR
Voldemort HP_CHAR
Severus Snape HP_CHAR
Harry HP_CHAR
Dark Lord HP_CHAR
Potter HP_CHAR
Voldemort HP_CHAR
Harry HP_CHAR
Harry HP_CHAR
Harry HP_CHAR
Voldemort HP_CHAR
Harry HP_CHAR
Who HP_CHAR


In [80]:
for sentence in hp_sentences[:100]:
    doc = nlp(sentence.text)
    for ent in doc.ents:
        print(ent.text, ent.label_)

Harry Potter HP_CHAR
THE BOY WHO LIVED HP_CHAR
Mr. HP_CHAR
Mrs. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Mrs. Dursley HP_CHAR
Dudley HP_CHAR
Mrs. HP_CHAR
Potter HP_CHAR
Mrs. Dursley HP_CHAR
Mrs. Dursley HP_CHAR
Dudley HP_CHAR
Mr. HP_CHAR
Mrs. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Mrs. Dursley HP_CHAR
Dudley HP_CHAR
Mr. Dursley HP_CHAR
Mrs. Dursley HP_CHAR
Dudley HP_CHAR
Dudley HP_CHAR
Mr. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Most HP_CHAR
Mr. Dursley HP_CHAR
Harry HP_CHAR
Mr. Dursley HP_CHAR
Potter HP_CHAR
Potter HP_CHAR
Harry HP_CHAR
Harry HP_CHAR
Mrs. Dursley HP_CHAR
Mr. Dursley HP_CHAR
You-Know-Who HP_CHAR
Mr. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Shoo HP_CHAR
Mr. Dursley HP_CHAR
Mr. Dursley HP_CHAR
Mrs. Dursley HP_CHAR
Mrs. HP_CHAR
Dudley HP_CHAR
Mr. Dursley HP_CHAR


In [85]:
for sentence in hp_sentences[900:920]:
    doc = nlp(sentence.text)
    for ent in doc.ents:
        print(ent.text, ent.label_)

Mr. HP_CHAR
Potter HP_CHAR
Mr. HP_CHAR
Potter HP_CHAR
Harry HP_CHAR
Uncle Vernon HP_CHAR
Uncle Vernon HP_CHAR
Aunt Petunia HP_CHAR
Uncle Vernon HP_CHAR
Dudley HP_CHAR
Aunt Petunia HP_CHAR
Uncle Vernon HP_CHAR


In [89]:
for sentence in hp_sentences[2000:2020]:
    doc = nlp(sentence.text)
    for ent in doc.ents:
        print(ent.text, ent.label_)

Harry HP_CHAR
Mr. HP_CHAR
Ollivander HP_CHAR
Try HP_CHAR
Harry HP_CHAR
Mr. HP_CHAR
Ollivander HP_CHAR
Harry HP_CHAR
Mr. HP_CHAR
Ollivander HP_CHAR
Mr. HP_CHAR
Ollivander HP_CHAR
Not HP_CHAR
Harry HP_CHAR
Hagrid HP_CHAR
Mr. HP_CHAR
Ollivander HP_CHAR


# Bonus Material: 07 Generating Custom Word Vectors in Gensim (Word2Vec)

In [47]:
import json, re
from gensim.models.word2vec import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
import multiprocessing

In [50]:
def training(model_name):
    sentences = hp_sentences
    cores = multiprocessing.cpu_count()
    w2v_model = Word2Vec(
        min_count = 5, # min freq of word
        window = 2, # surrounding words of w2v
        vector_size = 500 , # dimensionality of a token
        sample = 6e-5,
        alpha = 0.03,
        min_alpha = 0.0007,
        negative = 20,
        workers = cores-1
    )
      
    w2v_model.build_vocab(sentences)
    w2v_model.train(sentences, total_examples = w2v_model.corpus_count, epochs=30)
    w2v_model.save(f"""word_vectors/{model_name}.model""")
    w2v_model.wv.save_word2vec_format(f"""word_vectors/word2vec_{model_name}.txt""")
    
training("hp_ner_model_01")    

In [51]:
def gen_similarity(word):
    model = KeyedVectors.load_word2vec_format("word_vectors/word2vec_hp_ner_model_01.txt", binary=False)
    results = model.most_similar(positive=[word])
    print(results)

In [52]:
gen_similarity("Harry")

[('shouted', 0.9997987747192383), ('noticed', 0.9997981786727905), ('Theyre', 0.9997920393943787), ('For', 0.9997915625572205), ('turning', 0.9997914433479309), ('sharply', 0.9997906684875488), ('decided', 0.9997904896736145), ('Quirrell', 0.999790370464325), ('You', 0.9997897744178772), ('safely', 0.9997897148132324)]


In [53]:
gen_similarity("Gryffindor")

[('Slytherin', 0.9997866153717041), ('Theres', 0.9997844099998474), ('taken', 0.9997795224189758), ('points', 0.9997770190238953), ('If', 0.9997751712799072), ('Slytherins', 0.9997743368148804), ('place', 0.9997742772102356), ('So', 0.9997740983963013), ('warned', 0.9997740387916565), ('This', 0.9997732639312744)]


# Bonus Material: 08 Importing Custom Word Vectors from Gensim into spaCy

In [54]:
import spacy
import subprocess
import sys

word_vectors = "word_vectors/word3vechp_ner_model_01.txt"
model_name = "hp_model_test"

def load_word_vectors(model_name, word_vectors):
    subprocess.run([sys.executable,
                   "-m",
                    "spacy",
                   "init-model",
                   "en",
                   model_name,
                   "--vectors-loc",
                   word_vectors]
    )
load_word_vectors(model_name, word_vectors)