# Named Entity Recognition

-------------------
**Contents of this notebook**

[Finding all named entities in a document](#section-1)

[Finding the most frequent Named Entities of a given type](#section-2)

[Finding sentences that contain a givan Named Entity keyword](#section-3)

[Tuning the Named Entity Recognizer](#section-4)

-------------------

In this notebook, we're going to use spaCy to find Named Entities in a text.

In [None]:
#Import the libraries we need
import spacy
from collections import Counter

#Download the language model you're interested in
!python -m spacy download en_core_web_md

In [None]:
#Load language model
nlp = spacy.load('en_core_web_md')

In [None]:
#Create spaCy document
text = open('soderberg-corpus/1897_Drizzle.txt', encoding='utf-8').read()
document = nlp(text)

<a id='section-1'></a>
#### Finding all named entities in a document

In [None]:
# We can use `.ents` to pull out all the Named Entities spaCy reocgnizes in the document
document.ents

In [None]:
#Get Named Entities and their label
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

In [None]:
#Visualize all the Named Entities using displacy
from spacy import displacy
displacy.render(document, style="ent")

In [None]:
#Get only Named Entities of a certain type (e.g. people with PERSON)
for named_entity in document.ents:
    if named_entity.label_ == 'PERSON':
        print(named_entity)

<a id='section-2'></a>
#### Finding the most frequent Named Entities of a given type

In [None]:
#Define a function that finds Named Entities of a given label 
def find_most_frequent_NE(doc, NE_label=None):
    
    named_entities = []
    
    for named_entity in document.ents:
        if named_entity.label_ == NE_label or NE_label == None:
            named_entities.append(named_entity.text)        
    return(Counter(named_entities).most_common())

In [None]:
#Call your function for a given NE (e.g. PERSON, or DATE or TIME)
find_most_frequent_NE(document, NE_label='DATE')

<a id='section-3'></a>
#### Finding sentences that contain a givan Named Entity keyword

And find all sentences that contain a given keyword and the associated NER label for that keyword in that sentence.

In [None]:
from IPython.display import Markdown, display
import re

def get_ner_in_context(keyword, document, desired_ner_labels= False):
    
    if desired_ner_labels != False:
        desired_ner_labels = desired_ner_labels
    else:
        desired_ner_labels = ['PERSON', 'NORP', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART', 'LAW', 'LANGUAGE', 'DATE', 'TIME', 'PERCENT', 'MONEY', 'QUANTITY', 'ORDINAL', 'CARDINAL']  
        
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        #process each sentence
        sentence_doc = nlp(sentence.text)
        for named_entity in sentence_doc.ents:
            #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
            if keyword.lower() in named_entity.text.lower()  and named_entity.label_ in desired_ner_labels:
                #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
                #sentence_text = sentence.text
            
                sentence_text = re.sub('\n', ' ', sentence.text)
                sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)

                display(Markdown('---'))
                display(Markdown(f"**{named_entity.label_}**"))
                display(Markdown(sentence_text))

In [None]:
get_ner_in_context('autumn', document)

<a id='section-4'></a>
#### Tuning the Named Entity Recognizer

We're going to use the `EntityRuler` to customize the Named Entity Recognizer. 

The `EntityRuler` allows us to create a set of patterns with corresponding labels. Once we have created the`EntityRuler` and given it a set of instructions/patterns, we can then add it to the spaCy pipeline as a new pipe. Below shows how to add an `EntityRuler` pipeline component to the nlp pipeline.

In [None]:
#Import the libraries we need
import spacy

#Download the language model you're interested in
!python -m spacy download en_core_web_md

In [None]:
#Load language model
nlp = spacy.load('en_core_web_md')

In [None]:
text = open('soderberg-corpus/1897_Drizzle.txt', encoding='utf-8').read()

In [None]:
document = nlp(text)

In [None]:
#Get Named Entities and their label
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

I want to add some characters in the story (the Devil, the Lord, etc.) to the NER that are currently not recognized as PERSONs.

In [None]:
#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler", before="ner")

#List of Entities and Patterns
patterns = [
                {"label": "PERSON", "pattern": "Devil"},
    {"label": "PERSON", "pattern": "Lord"},
    {"label": "PERSON", "pattern": "the good Lord"},
    {"label": "PERSON", "pattern": "God"}
            ]

#Add patterns to the ruler
ruler.add_patterns(patterns)

In [None]:
#Create new spaCy document to check updated Named Entities
document = nlp(text)
    
#Get Named Entities and their label
for named_entity in document.ents:
    print(named_entity.text, named_entity.label_)

Most frequent Named Entities for a given Named Entity

In [None]:
#Count the most frequent entities for a given Named Entity

named_entities = []

for named_entity in document.ents:
    if named_entity.label_ == 'PERSON':
        named_entities.append(named_entity.text)

entity_tally = Counter(named_entities)
most_frequent_entities = entity_tally.most_common()
most_frequent_entities

_Acknowledgements_: This notebook is inspired by Melanie Walsh’s [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/12-Named-Entity-Recognition.html) and William Mattingly's [Introduction to spaCy](https://github.com/wjbmattingly/tap-2023-spacy-01/tree/main).