# Named Entity Recognition


A named entity is an object, person, location or organization which has been assigned a proper name. *Named Entity Recognition* is a computational technique that seeks to identify all the named entities that that are mentioned within texts. Applications making use of Named Entity Recognition can generally extract the most of the occurrences of such named entities and it can also characterise such entities using pre-defined categories such as ‘Person’, ‘Location’, ‘Work of Art’, ‘Organisation’. 

Named Entity Recognition applications typically make use of statistical models created using Machine Learning algorithms. Such models are often trained using large numbers of texts in which all the people, locations, organisations and named objects have been labelled manually by human readers. On the basis of a careful analysis of the frequencies and the contexts of all of these named entitities, computers can eventually be enabled to recognise similar types of entitities in new, unlabelled texts. 

## Stanza

One of the tools that can be used for NER is [Stanza](https://stanfordnlp.github.io/stanza/). Stanza is based on the `Stanford CoreNLP` application. `Stanza` can be installed via `pip`. 

In [None]:
# import sys
# !pip install stanza
# !pip install tqdm   

Once the package has been installed successfully, you can import the code. 

In [None]:
import stanza

If you want to recognise named entities, you firstly need to create a `Pipeline` object. The `lang` parameter specifies the language model you want to work with. 'en' stands for English. 

In [None]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')

You can find more models in [the documentation of the stanza package](https://stanfordnlp.github.io/stanza/available_models.html). The language code for 'Dutch' is 'nl'.

The `Pipeline` object, which was named 'nlp' in the code above, can be used to analyse the named entities in strings. This text can be supplie directly as a parameter of the `nlp` object. 

In [None]:
sentence = '''James Joyce (born February 2, 1882, Dublin, Ireland - died January 13, 1941,
    Zürich, Switzerland) was Irish novelist noted for his experimental use of language and 
    exploration of new literary methods in such large works of fiction as Ulysses (1922)
    and Finnegans Wake (1939).'''

doc = nlp(sentence)


The result of this call is assigned to a variable named `doc`. This variable has a property named `ents` which lists all the named entities that were found in the string. Each named entity has a `text` and a `type` property. 

You can navigate across all of the named entities in a `for` loop. 

In [None]:
for ne in doc.ents:
    print( f'{ne.text} (type: {ne.type})' )

The model for the English language was trained on the basis of an large annotated corpus named *[OntoNotes](https://catalog.ldc.upenn.edu/LDC2013T19)*. This model can assign named entities in the following categories:

* PERSON: People, including fictional 
* NORP: Nationalities or religious or political groups 
* FAC: Buildings, airports, highways, bridges, etc. 
* ORG: Companies, agencies, institutions, etc. 
* GPE: Countries, cities, states 
* LOC: Non-GPE locations, mountain ranges, bodies of water 
* PRODUCT: Objects, vehicles, foods, etc. (not services) 
* EVENT: Named hurricanes, battles, wars, sports events, etc. 
* WORK_OF_ART: Titles of books, songs, etc. 
* LAW: Named documents made into laws. 
* LANGUAGE: Any named language 
* DATE: Absolute or relative dates or periods 
* TIME: Times smaller than a day 
* PERCENT: Percentage, including "%" 
* MONEY: Monetary values, including unit 
* QUANTITY: Measurements, as of weight or distance 
* ORDINAL: "first", "second", etc. 
* CARDINAL: Numerals that do not fall under another type 




## Finding NER tags in longer texts

One important limitation of the *Stanza* tagger is that it can only be applied to texts consisting of less than 1,000,000 characters. The parser roughly requires 1GB of memory per 100,000 characters, and texts containing more than 1,000,000 characters tends to cause memory allocation errors. 

The code below tries to avoid such errors. It safely sets the `max_length` of the texts to be parsed to 5000. The code divides the full text into segments, each of which are shorter than this `max_length`.  

After this, these shorter segments are all parsed one by one. These tagged texts are stored in a dictionary named `tagged_segments`. 

Tagging texts of ca. 5000 characters still demands quite some memory space. The code below may take some time to complete because of this.   

In [None]:
from os.path import join
from nltk.tokenize import sent_tokenize
from tqdm.notebook import tqdm, trange

segments = dict()
tagged_segments = dict()
segment_nr = 0

text = 'ARoomWithAView.txt'
dir = 'Corpus'
path = join( dir, text )
max_length = 5000

with open(path, encoding = 'utf-8') as file_handler:
    full_text = file_handler.read()
    
print( f'Total number of characters in {text}: {len(full_text)}' )

sentences = sent_tokenize(full_text)

length = 0 
segment = ''

for s in sentences:
    length += len(s)
    if length < max_length:
        segment += s + ' '
    else:
        segments[segment_nr] = segment
        segment = s + ' '
        length = 0 
        segment_nr += 1
        
if len(segment) > 0:
    segments[segment_nr] = segment
    
print(f'{len(segments)} segments were created.')
print( 'Annotating the text segments ... ')    
for i in tqdm(segments,total = len(segments),desc="Progress: "):
    tagged_segments[i] = nlp(segments[i])
print('Done')

The annotated texts can be analysed in a variety of ways. The code below lists the personal names that are mentioned most frequently in the text. 


In [None]:
from tdmh import sortedByValue
from collections import Counter

names = []
freq = Counter()


for doc in tagged_segments:
    for named_entity in tagged_segments[doc].ents:
        if named_entity.type == 'NORP':
            name = named_entity.text
            #name = name.strip()
            freq.update([name])

for ne,count in freq.most_common(20):
    print(ne,count)

## spaCy

The NLP library named *spaCy* is a second tools that you can work with the analyse named entities. For more information on how to install *spaCy*, or on how to load specific language models, please read the notebook on *NLP*.

*spaCy* offers support for a wide range of languages. As is the case for `stanza`, the model for the English language was trained on the basis of an large annotated corpus named *[OntoNotes](https://catalog.ldc.upenn.edu/LDC2013T19)*. This model can be downloaded using the following command.

In [None]:
!python -m spacy download en_core_web_sm

After the model has been downloaded, it needs to be  loaded, so that you can work with it in your code. The `load()` method in `spaCy` creates a new object which can be used to add linguistic and semantic annotations. Ii the code below, is object is given the name `nlp`. 

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

This `nlp` object can annotate a given string in a number of ways. SpaCy can be used not only to describe properties such as the parts of speech and the lemmatised versions of words, but also find the named entities. 
In the code below, the output of the `nlp()` method is assigned to a variable named `tagged_text`.

In [None]:
tagged_text = nlp("James Joyce (born February 2, 1882, Dublin, Ireland - died January 13, 1941, Zürich, Switzerland) was Irish novelist noted for his experimental use of language and exploration of new literary methods in such large works of fiction as Ulysses (1922) and Finnegans Wake (1939).")

The tags added by `nlp()` can be visualised effectively using the `render()` method from the `displacy` module. When you add the parameter `style`, with value `ent`, this visulation concentrates on the named entities that have been found. 

In [None]:
from spacy import displacy
displacy.render(tagged_text, style="ent")

The meaning of specifc *spaCy* codes can be found the `explain()` method, as is demonstated in the following code.  

In [None]:
tags = ['PERSON','NORP','FAC','ORG','GPE','LOC','PRODUCT','EVENT','WORK_OF_ART','LAW','LANGUAGE','DATE','TIME','PERCENT','MONEY','QUANTITY','ORDINAL','CARDINAL']

for t in tags: 
    print( f'{t}: {spacy.explain(t)} ' )