# Named Entity Recognition


A named entity is an object, person, location or organization which has been assigned a proper name. *Named Entity Recognition* is a computational technique that seeks to identify all the named entities that that are mentioned within texts. Applications making use of Named Entity Recognition can generally extract the most of the occurrences of such named entities and it can also characterise such entities using pre-defined categories such as ‘Person’, ‘Location’, ‘Work of Art’, ‘Organisation’. 

Named Entity Recognition applications typically make use of statistical models created using Machine Learning algorithms. Such models are often trained using large numbers of texts in which all the people, locations, organisations and named objects have been labelled manually by human readers. On the basis of a meticulous analysis of the nature and the contexts of all of these named entitities, computers can eventually be enabled to recognise similar types of entitities in new, unlabelled texts. 

## spaCy

This notebook explains how you can work with Named Entity Recognition using an NLP library named *spaCy*. For more information on how to install *spaCy*, or on how to load specific langauge models, please read the notebook on *NLP*.

*spaCy* offers support for a wide range of languages. The model for the English language was trained on the basis of an large annotated corpus named *[OntoNotes](https://catalog.ldc.upenn.edu/LDC2013T19)*. This model can be downloaded using the following command

In [None]:
!python -m spacy download en_core_web_sm

After the model has been downloaded, it needs to be  loaded, so that you can work with it in your code. The `load()` method in `spaCy` creates a new object which can be used to add linguistic and semantic annotations. Ii the code below, is object is given the name `nlp`. 

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

This `nlp` object can annotate a given string in a number of ways. SpaCy can be used not only to describe properties such as the parts of speech and the lemmatised versions of words, but also find the named entities. 
In the code below, the output of the `nlp()` method is assigned to a variable named `tagged_text`.

In [None]:
tagged_text = nlp("James Joyce (born February 2, 1882, Dublin, Ireland - died January 13, 1941, Zürich, Switzerland) was Irish novelist noted for his experimental use of language and exploration of new literary methods in such large works of fiction as Ulysses (1922) and Finnegans Wake (1939).")

The tags added by `nlp()` can be visualised effectively using the `render()` method from the `displacy` module. When you add the parameter `style`, with value `ent`, this visulation concentrates on the named entities that have been found. 

In [None]:
from spacy import displacy
displacy.render(tagged_text, style="ent")

*spaCy* works with the following pre-defined NER labels: 

* PERSON: People, including fictional 
* NORP: Nationalities or religious or political groups 
* FAC: Buildings, airports, highways, bridges, etc. 
* ORG: Companies, agencies, institutions, etc. 
* GPE: Countries, cities, states 
* LOC: Non-GPE locations, mountain ranges, bodies of water 
* PRODUCT: Objects, vehicles, foods, etc. (not services) 
* EVENT: Named hurricanes, battles, wars, sports events, etc. 
* WORK_OF_ART: Titles of books, songs, etc. 
* LAW: Named documents made into laws. 
* LANGUAGE: Any named language 
* DATE: Absolute or relative dates or periods 
* TIME: Times smaller than a day 
* PERCENT: Percentage, including "%" 
* MONEY: Monetary values, including unit 
* QUANTITY: Measurements, as of weight or distance 
* ORDINAL: "first", "second", etc. 
* CARDINAL: Numerals that do not fall under another type 

The meaning of specifc *spaCy* codes can be found the `explain()` method, as is demonstated in the following code.  

In [None]:
tags = ['PERSON','NORP','FAC','ORG','GPE','LOC','PRODUCT','EVENT','WORK_OF_ART','LAW','LANGUAGE','DATE','TIME','PERCENT','MONEY','QUANTITY','ORDINAL','CARDINAL']

for t in tags: 
    print( f'{t}: {spacy.explain(t)} ' )

## Finding NER tags in longer texts

One important limitation of the *spaCy* tagger is that it can only be applied to texts consisting of less than 1,000,000 characters. The parser roughly requires 1GB of memory per 100,000 characters, and texts containing more than 1,000,000 characters tends to cause memory allocation errors. 

The code below tries to avoid such errors. It safely sets the `max_length` of the texts to be parsed to 500,000. The code divides the full text into segments, each of which are shorter than this `max_length`.  

After this, these shorter segments are all parsed one by one. These tagged texts are stored in a dictionary named `tagged_segments`. 

Tagging texts of ca. 500,000 characters still demands quite some memory space. The code below may take some time to complete because of this.   

In [None]:
from os.path import join
from nltk.tokenize import sent_tokenize

segments = dict()
tagged_segments = dict()
segment_nr = 0

text = 'ARoomWithAView.txt'
dir = 'Corpus'
path = join( dir, text )
max_length = 500000

with open(path, encoding = 'utf-8') as file_handler:
    full_text = file_handler.read()
    
print( f'Total number of characters in {text}: {len(full_text)}' )

sentences = sent_tokenize(full_text)

length = 0 
segment = ''

for s in sentences:
    length += len(s)
    if length < max_length:
        segment += s + ' '
    else:
        segments[segment_nr] = segment
        segment = s + ' '
        length = 0 
        segment_nr += 1
        
if len(segment) > 0:
    segments[segment_nr] = segment
    

print( 'Annotating the text segments ... ')    
for i in segments:
    print(i)
    tagged_segments[i] = nlp(segments[i])
print('Done')

The annotated texts can be analysed in a variety of ways. The code below lists the personal names that are mentioned most frequently in the text. 


In [None]:
from tdm import sortedByValue

freq = dict()

for doc in tagged_segments:
    for named_entity in tagged_segments[doc].ents:
        if named_entity.label_ == 'PERSON':
            name = str(named_entity) 
            name = name.strip()
            freq[ name ] = freq.get( name , 0 ) + 1
        
for name in reversed( sortedByValue(freq) ):
    print( f'{name}: {freq[name]}' )

To view all the works of art that are referred to in the texts, for instance, you need to replace the label 'PERSON' in the code below with the tag 'WORK_OF_ART'.

The NER parser, as mentioned, also tried to identify the locations that are mentioned in the text. The locations are assigned the code 'GPE'. Once we have identified the geographical locations that are mentioned in a text, we can also try to visualise all of these locations on a map. 

The code below firstly stores all the locations that are found by *scraPy* in a list named `locations`. 

In [None]:
from tdm import sortedByValue

freq = dict
locations = []

for doc in tagged_segments:
    for named_entity in tagged_segments[doc].ents:
        if named_entity.label_ == 'GPE':
            name = str(named_entity) 
            name = name.strip()
            locations.append( name )
        
locations = list( set(locations) )

for l in locations:
    print(l)

The Named Entity Recognition process certainly does not function flawlessly. It probabaly misses some of the locations that are mentioned in the text, and it is also likely that it will have labelled some non-locations as locations. If necessary, you can edit the `locations` list that is created manually, making us of the `remove()` method, for instance. 

The following code tries to find the geographic coordinates (i.e. the longitude and the latitude) of the items listed in `locations`, using the `openStreetMap` API.

In [None]:
import pandas as pd
import xml.etree.ElementTree as ET
import re
import requests

def remove_brackets(text):
    text = re.sub( '(\[)|(\])' , '' , text )
    return text

locations_coord = dict()
    
for loc in locations:
    if loc not in locations_coord:
        url = 'https://nominatim.openstreetmap.org/search?q='+ loc + '&format=xml'
        url = re.sub( '\s+' , '%20' , url )

        response = requests.get( url )
        root = ET.fromstring( response.text )
        el = root.findall('place')

        count = 0
        if el is not None:
            for place in el:
                count += 1
                lat = place.attrib['lat']
                lon = place.attrib['lon']
                if count == 1:
                    locations_coord[ loc ] = ( lat , lon )
                    print(f'{loc}: {lat} {lon}')
     

Finally, the locations are drawn on a map using `Leaflet`, A Javascript library for creating interactive maps. 

In [None]:
out = open( 'map.html' , 'w' , encoding = 'utf-8')
import re

out.write('''
<!DOCTYPE html>
<html>
<head>

                <title>Locations</title>

                <meta charset="utf-8" />
                <meta name="viewport" content="width=device-width, initial-scale=1.0">

                <link rel="shortcut icon" type="image/x-icon" href="docs/images/favicon.ico" />

    <link rel="stylesheet" href="https://unpkg.com/leaflet@1.7.1/dist/leaflet.css" integrity="sha512-xodZBNTC5n17Xt2atTPuE1HxjVMSvLVW9ocqUKLsCC5CXdbqCmblAshOMAS6/keqq/sMZMZ19scR4PsZChSR7A==" crossorigin=""/>
    <script src="https://unpkg.com/leaflet@1.7.1/dist/leaflet.js" integrity="sha512-XQoYMqMTK8LvdxXYG3nZ448hOEQiglfqkJs1NOQV44cWnUrBc8PkAOcXy20w0vlaXaVUearIOBhiXZ5V3ynxwA==" crossorigin=""></script>



</head>
<body>





<div id="mapid" style="width: 600px; height: 400px;"></div>
<script>

                var mymap = L.map('mapid').setView([52.0799838, 4.3113461], 4);

                L.tileLayer('https://api.mapbox.com/styles/v1/{id}/tiles/{z}/{x}/{y}?access_token=pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpejY4NXVycTA2emYycXBndHRqcmZ3N3gifQ.rJcFIG214AriISLbB6B5aw', {
                                maxZoom: 18,
                                attribution: 'Map data &copy; <a href="https://www.openstreetmap.org/">OpenStreetMap</a> contributors, ' +
                                                '<a href="https://creativecommons.org/licenses/by-sa/2.0/">CC-BY-SA</a>, ' +
                                                'Imagery  <a href="https://www.mapbox.com/">Mapbox</a>',
                                id: 'mapbox/streets-v11',
                                tileSize: 512,
                                zoomOffset: -1
                }).addTo(mymap); 
''')

for l in locations_coord:
    display_name = re.sub( '\'' , '' , l )
    out.write( f' L.marker([ { locations_coord[l][0] }, { locations_coord[l][1] }  ]).addTo(mymap) ')
    out.write( f" .bindPopup('{display_name}.') ")  
    out.write( ';' )
    
out.write(
'''
</script>



</body>
</html>

''')

out.close()

In [None]:
from IPython.display import IFrame
from os.path import exists

IFrame(src= 'map.html' , width=700, height=600)