# 13. Named Entity Recognition


A named entity is an object, person, location or organization which has been assigned a proper name. *Named Entity Recognition* is a computational technique that seeks to identify all the named entities that that are mentioned within texts. Applications making use of Named Entity Recognition can generally extract the most of the occurrences of such named entities and it can also characterise such entities using pre-defined categories such as ‘Person’, ‘Location’, ‘Work of Art’, ‘Organisation’. 

Named Entity Recognition applications typically make use of statistical models created using Machine Learning algorithms. Such models are often trained using large numbers of texts in which all the people, locations, organisations and named objects have been labelled manually by human readers. On the basis of a careful analysis of the frequencies and the contexts of all of these named entitities, computers can eventually be enabled to recognise similar types of entitities in new, unlabelled texts. 

This notebook explains how you can work with Named Entity Recognition using the `stanza`. This package includes [a range of state-of-the-art NLP models](https://stanfordnlp.github.io/stanza/available_models.html) which enable you to carry out tasks such as *part of speech tagging*, *sentiment analysis* and *named entity recognition*. 

The package can be installed as follows. After the installation, you probably need to restart the kernel before you can make use of the package. 

In [9]:
#pip install stanza

After a successful installation, you should be able to import the package. 

In [10]:
import stanza

Next, you need to create an NLP pipeline. As you you do this, you need to specify the language of the texts in your corpus, together with the type of NLP models you want to work with. In the cell below, the pipeline is given the name `nlp`.

In [11]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')

2024-10-30 11:36:40 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

2024-10-30 11:36:40 INFO: Use device: cpu
2024-10-30 11:36:40 INFO: Loading: tokenize
2024-10-30 11:36:40 INFO: Loading: ner
2024-10-30 11:36:40 INFO: Done loading processors!


This pipeline object can then be used to parse natural language texts. If the 'ner' model is selected, the pipeline will create a variable named `ents`, containing all the named entities that were found in the text. 

In [14]:
sentence = "James Joyce (born February 2, 1882, Dublin, Ireland - died January 13, 1941, Zürich, Switzerland) was Irish novelist noted for his experimental use of language and exploration of new literary methods in such large works of fiction as Ulysses (1922) and Finnegans Wake (1939)."

doc = nlp(sentence)

for named_entity in doc.ents:
    print(f"{named_entity.text}\ttype: {named_entity.type}")


James Joyce	type: PERSON
February 2, 1882	type: DATE
Dublin	type: GPE
Ireland	type: GPE
January 13, 1941	type: DATE
Zürich	type: GPE
Switzerland	type: GPE
Irish	type: NORP
Ulysses (1922)	type: WORK_OF_ART
Finnegans Wake	type: WORK_OF_ART
1939	type: DATE


`stanza` works with the following pre-defined NER labels: 

* PERSON: People, including fictional 
* NORP: Nationalities or religious or political groups 
* FAC: Buildings, airports, highways, bridges, etc. 
* ORG: Companies, agencies, institutions, etc. 
* GPE: Countries, cities, states 
* LOC: Non-GPE locations, mountain ranges, bodies of water 
* PRODUCT: Objects, vehicles, foods, etc. (not services) 
* EVENT: Named hurricanes, battles, wars, sports events, etc. 
* WORK_OF_ART: Titles of books, songs, etc. 
* LAW: Named documents made into laws. 
* LANGUAGE: Any named language 
* DATE: Absolute or relative dates or periods 
* TIME: Times smaller than a day 
* PERCENT: Percentage, including "%" 
* MONEY: Monetary values, including unit 
* QUANTITY: Measurements, as of weight or distance 
* ORDINAL: "first", "second", etc. 
* CARDINAL: Numerals that do not fall under another type 

## Finding NER tags in longer texts

When you work with `stanza`, it is usefl to bear in mind that its parser roughly requires 1GB of memory per 100,000 characters. Texts containing more than 1,000,000 characters tends to cause memory allocation errors. 

The code below tries to avoid such errors. The code divides the full text into segments, each of which are shorter than a maximum named `max_length`. The  `max_length` of the texts to be parsed has been set safely to 100,000. 

After this, these shorter segments are all parsed one by one. These tagged texts are stored in a dictionary named `tagged_segments`. 

Tagging texts of ca. 100,000 characters still demands quite some memory space. The code below may take some time to complete because of this.   

In [21]:
from os.path import join
from nltk.tokenize import sent_tokenize

segments = dict()
tagged_segments = dict()
segment_nr = 0

text = 'PrideAndPrejudice.txt'
dir = 'Corpus'
path = join( dir, text )
max_length = 100000

with open(path, encoding = 'utf-8') as file_handler:
    full_text = file_handler.read()
    
print( f'Total number of characters in {text}: {len(full_text)}' )

sentences = sent_tokenize(full_text)

length = 0 
segment = ''

for s in sentences:
    length += len(s)
    if length < max_length:
        segment += s + ' '
    else:
        segments[segment_nr] = segment
        segment = s + ' '
        length = 0 
        segment_nr += 1
        
if len(segment) > 0:
    segments[segment_nr] = segment
    
print(f"The text has been divided into {len(segments)} segments.")
    
print( 'Annotating the text segments ... ')    
for i in segments:
    print(i)
    tagged_segments[i] = nlp(segments[i])
print('Done')

Total number of characters in PrideAndPrejudice.txt: 685008
The text has been divided into 7 segments.
Annotating the text segments ... 
0
1
2
3
4
5
6
Done


The annotated texts can be analysed in a variety of ways. The code below lists the personal names that are mentioned most frequently in the text. 


In [None]:
from tdm import sortedByValue

freq = dict()

for doc in tagged_segments:
    for named_entity in tagged_segments[doc].ents:
        if named_entity.label_ == 'PERSON':
            name = str(named_entity) 
            name = name.strip()
            freq[ name ] = freq.get( name , 0 ) + 1
        
for name in reversed( sortedByValue(freq) ):
    print( f'{name}: {freq[name]}' )

To view all the works of art that are referred to in the texts, for instance, you need to replace the label 'PERSON' in the code below with the tag 'WORK_OF_ART'.

### Exercise 13.1

Using the code that is explained in this notebook, try to find all the location that are mentioned in *Pride and Prejudice*. 