# Named Entity Recognition (NER)

Named entity recognition (NER) is a branch of natural language processing that focuses on extracting the text of names or other semantically distinct ideas from a larger text and tagging according to its special meaning within a system. NER can be broken into two broad categories: algorithms which use deterministic rules to find names (find all tokens that match the regular expression: `([A-Z][a-z]+)`) and statisical models which make guesses about where an entity begins and ends.

In this class, we'll use `spaCy`'s small English NER (`en_core_web_sm`) model to explore how statistical models can recognize named entities. Then, we'll see some of the practical considerations we must take when working with NER. Our workflow will look like this:
* Reacquaint ourselves with `spaCy`
* Pass a single chapter of Gibbon's *Decline and Fall* into the `spaCy` parser
* Examine the NER results
* Filter out all place names
* Use the Pleiades gazetteer to get coordinates for all valid place names
* Save data to CSV

Finally, you'll have the chance to boil down our process into a function and test it out on other chapters from Gibbon (or other texts). Keep in mind that, for next class, you must have your own version of the data we produce today. That's because, next week, Carolyn Talmadge from the DataLab will be showing us how to turn our CSVs into webmaps and applications. 

## Preparing our data

In [1]:
# redownload spaCy's small model - should see 'requirement already satisfied'
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import pandas as pd
import spacy
nlp = spacy.load('en_core_web_sm') # good idea to initialize here

In [3]:
# downloading gibbon text from my gh
import wget
import os
if not os.path.isfile('gibbon_text.csv'):
    wget.download('https://raw.githubusercontent.com/pnadelofficial/FallDHCourseMaterials/main/gibbon_text.csv')

In [4]:
gibbon_by_chapter = pd.read_csv('gibbon_text.csv').rename(columns={'Unnamed: 0':'chapter'})
gibbon_by_chapter

Unnamed: 0,chapter,StringText
0,Chapter 1,"\nIn the second century of the Christian era, ..."
1,Chapter 2,\nIt is not alone by the rapidity or extent of...
2,Chapter 3,\nThe obvious definition of a monarchy seems t...
3,Chapter 4,"\nThe mildness of Marcus, which the rigid disc..."
4,Chapter 5,\nThe power of the sword is more sensibly felt...
...,...,...
66,Chapter 67,\nThe respective merits of Rome and Constantin...
67,Chapter 68,\nThe siege of Constantinople by the Turks att...
68,Chapter 69,\nIn the first ages of the decline and fall of...
69,Chapter 70,"\nIn the apprehension of modern times, Petrarc..."


In [5]:
first_chapter = gibbon_by_chapter['StringText'][0]

## Using `spaCy`'s off-the-shelf NER model

This model was trained on a wide variety of sources, so we can't expect it to be completely accurate. We'll revisit that problem soon. We can train our own NER model and it would do better, but this will take some time to do and requires a lot of set up. If you're interested in training your own model, reach out to me and we can work together on it.

In [6]:
# pass the first chapter into spaCy parser
first_chapter_doc = nlp(first_chapter)

In [7]:
for entity in first_chapter_doc.ents: # can access NER with the .ents attribute
    print(entity.text, entity.label_, sep='\t')

the second century	DATE
Christian	NORP
Rome	GPE
Roman	NORP
more than fourscore years	DATE
Nerva	GPE
Trajan	GPE
Hadrian	NORP
two	CARDINAL
Antonines	NORP
two	CARDINAL
Marcus Antoninus	GPE
Romans	NORP
senate	ORG
seven first centuries	DATE
Augustus	ORG
Rome	GPE
every day	DATE
Augustus	ORG
Rome	GPE
Parthians	NORP
Crassus	PRODUCT
Aethiopia	GPE
Arabia Felix	PERSON
a thousand miles	QUANTITY
Europe	LOC
Germany	GPE
first	ORDINAL
Roman	NORP
Augustus	PERSON
senate	ORG
the Atlantic Ocean	LOC
Rhine	PERSON
Danube	PERSON
Euphrates	LOC
Arabia	GPE
Africa	LOC
Augustus	ORG
first	ORDINAL
Caesars	ORG
Imperial	ORG
Roman	NORP
Roman	NORP
the first century	DATE
Christian	NORP
Britain	GPE
Caesar	ORG
Augustus	ORG
Gaul	PERSON
Britain	GPE
continental	ORG
about forty years	DATE
Roman	NORP
Britons	PERSON
Caractacus	ORG
Boadicea	GPE
Druids	ORG
Imperial	ORG
Domitian	PERSON
Agricola	ORG
Caledonians	NORP
Grampian	NORP
Roman	NORP
Britain	GPE
Agricola	ORG
Ireland	GPE
one	CARDINAL
Britons	PERSON
Agricola	ORG
Britain	GPE
two

Wales	GPE
Phoenicia	ORG
Palestine	GPE
America	GPE
Europe	LOC
Syria	GPE
Euphrates	LOC
the Red Sea	LOC
Arabs	NORP
Roman	NORP
Egypt	GPE
Africa	LOC
Asia	LOC
Egypt	GPE
Roman	NORP
Ptolemies	ORG
Mamelukes	ORG
Turkish	NORP
Nile	LOC
five hundred miles	QUANTITY
Cancer	PERSON
Mediterranean	LOC
Cyrene	PERSON
first	ORDINAL
Greek	NORP
Egypt	GPE
Barca	GPE
Cyrene	PERSON
Africa	LOC
above fifteen hundred miles	QUANTITY
Mediterranean	LOC
Sahara	LOC
an hundred miles	QUANTITY
Romans	NORP
Africa	LOC
Phoenician	NORP
Libyans	NORP
Tripoli	GPE
Algiers	LOC
Numidia	GPE
Massinissa	PERSON
Jugurtha	GPE
Augustus	ORG
Numidia	GPE
at least two-thirds	CARDINAL
Mauritania	GPE
Caesariensis	PRODUCT
Mauritania	GPE
Moors	ORG
Tingi	ORG
Tangier	ORG
Tingitana	GPE
Sallè	NORP
Ocean	LOC
Romans	NORP
Mequinez	ORG
Morocco	GPE
Segelmessa	PERSON
Roman	NORP
Africa	LOC
Mount Atlas	ORG
Roman	NORP
Africa	LOC
Spain	GPE
about twelve miles	QUANTITY
Atlantic	LOC
Mediterranean	LOC
Hercules	GPE
two	CARDINAL
European	NORP
Gibraltar	LOC
the Mediter

In [8]:
# filter by place name
for  entity in first_chapter_doc.ents:
    if (entity.label_ == 'GPE') or (entity.label_ == 'LOC'):
        print(entity.text)

Rome
Nerva
Trajan
Marcus Antoninus
Rome
Rome
Aethiopia
Europe
Germany
the Atlantic Ocean
Euphrates
Arabia
Africa
Britain
Britain
Boadicea
Britain
Ireland
Britain
Scotland
Antoninus Pius
Antoninus
Edinburgh
gloomy hills
Rome
Trajan
the Euxine Sea
Trajan
Philip
Tigris
Armenia
the Persian gulf
Arabia
India
Bosphorus
Colchos
Iberia
Albania
Armenia
Mesopotamia
Assyria
Jupiter
Jupiter
Armenia
Mesopotamia
Assyria
Euphrates
Trajan
Antoninus
Caledonia
the Upper Egypt
Italy
Rome
Antoninus
Marcus
Euphrates
Europe
Rome
Italy
Spain
East
oblong
Rome
Britain
Lower
the Upper Germany
Noricum
Pannonia
Maesia
Syria
Egypt
Africa
Spain
Italy
Marseilles
Mediterranean
Italy
Misenum
Naples
Liburnians
Misenum
Mediterranean
Provence
Britain
Spain
Europe
Mediterranean
the Atlantic Ocean
Lusitania
Baetica
Portugal
East
North
Grenada
Andalusia
Baetica
Spain
Asturias
Biscay
Castilles
Murcia
Valencia
Catalonia
Arragon
Tarragona
Rome
Alps
Rhine
Ocean
France
Alsace
Switzerland
Rhine
Liege
Luxemburg
Mediterranean
Langu

In [9]:
# putting all of the data into a dictionary
import collections
place_freq = collections.defaultdict(int)
for entity in first_chapter_doc.ents:
    if (entity.label_ == 'GPE') or (entity.label_ == 'LOC'):
        place_freq[entity.text] += 1 # the utility of defaultdict!
place_freq = dict(place_freq)

In [10]:
# dict -> df
place_freq_df = pd.DataFrame.from_dict(place_freq, orient='index').reset_index().rename(columns={'index':'place_name',0:'frequency'})
place_freq_df

Unnamed: 0,place_name,frequency
0,Rome,12
1,Nerva,1
2,Trajan,5
3,Marcus Antoninus,1
4,Aethiopia,1
...,...,...
147,Great Britain,1
148,Sardinia,1
149,Sicily,1
150,Malta,1


In [11]:
# downloading the Pleiades data we need (also from my gh)
if not os.path.isfile('places.csv'):
    wget.download('https://raw.githubusercontent.com/pnadelofficial/FallDHCourseMaterials/main/places.csv')
if not os.path.isfile('names.csv'):
    wget.download('https://raw.githubusercontent.com/pnadelofficial/FallDHCourseMaterials/main/names.csv')

In [13]:
places = pd.read_csv('places.csv')
places.columns

Index(['created', 'description', 'details', 'provenance', 'title', 'uri', 'id',
       'representative_latitude', 'representative_longitude',
       'bounding_box_wkt'],
      dtype='object')

In [15]:
# let's find 'Rome' in places
places.loc[places['title'] == 'Roma']

Unnamed: 0,created,description,details,provenance,title,uri,id,representative_latitude,representative_longitude,bounding_box_wkt
21483,2018-06-07T19:48:13Z,The capital of the Roman Republic and Empire.,<p>The Barrington Atlas Directory notes: Roma/...,Barrington Atlas: BAtlas 43 B2 Roma,Roma,https://pleiades.stoa.org/places/423025,423025,41.891775,12.486137,"POLYGON ((12.486137 41.891775, 12.486137 41.89..."


In [16]:
names = pd.read_csv('names.csv')
names.columns

Index(['created', 'description', 'details', 'provenance', 'title', 'uri', 'id',
       'place_id', 'name_type', 'language_tag', 'attested_form',
       'romanized_form_1', 'romanized_form_2', 'romanized_form_3',
       'association_certainty', 'transcription_accuracy',
       'transcription_completeness', 'year_after_which', 'year_before_which'],
      dtype='object')

In [18]:
# let's find 'Rome' in names
names.loc[names['romanized_form_1'] == 'Rome'].place_id

20810    423025
Name: place_id, dtype: int64

In [19]:
def get_pleiades_id(term):
    """
    Iterates through all of the possible names in the names.csv file
    Returns None if no matched names
    """
    name_row = names.loc[names['attested_form'] == term]
    if len(name_row) == 1:
        return int(name_row.place_id.iloc[0])
    else:
        name_row = names.loc[names['romanized_form_1'] == term]
        if len(name_row) == 1:
            return int(name_row.place_id.iloc[0])
        else:
            name_row = names.loc[names['romanized_form_2'] == term]
            if len(name_row) == 1:
                return int(name_row.place_id.iloc[0])
            else:
                name_row = names.loc[names['romanized_form_3'] == term]
                if len(name_row) == 1:
                    return int(name_row.place_id.iloc[0])
                else:
                    return None

The above function looks very complicated, but all it's doing is checking the results of several `loc` and `iloc` calls from our `DataFrame`. You will very rarely see `for` loops when using `pandas`. Instead, you will see programmers taking advantage of the very efficient search methods in `pandas` like `loc`, sometimes called 'broadcasting'. Check out this [Medium post](https://medium.com/@michaeleby1/broadcasting-versus-iteration-6c06539dc1d5) for a further discussion of the benefits of broadcasting over writing loops.

In [20]:
place_freq_df['pleiades_id'] = place_freq_df['place_name'].apply(get_pleiades_id)
place_freq_df

Unnamed: 0,place_name,frequency,pleiades_id
0,Rome,12,423025.0
1,Nerva,1,
2,Trajan,5,
3,Marcus Antoninus,1,
4,Aethiopia,1,
...,...,...,...
147,Great Britain,1,
148,Sardinia,1,472014.0
149,Sicily,1,462492.0
150,Malta,1,462311.0


Why are there so many `NaN` values? How do we deal with them? How do we want to deal with them?

In [21]:
place_freq_final = place_freq_df.dropna().reset_index(drop=True)
place_freq_final

Unnamed: 0,place_name,frequency,pleiades_id
0,Rome,12,423025.0
1,Euphrates,6,912849.0
2,Arabia,2,981506.0
3,Africa,7,775.0
4,Ireland,1,20487.0
5,India,1,50004.0
6,Iberia,1,863807.0
7,Caledonia,1,89132.0
8,Misenum,2,432941.0
9,Naples,3,433014.0


Now that we have the Pleiades IDs, we can finally get the coordiantes from `places.csv`. Let's look back at the Roman example. 

In [24]:
rid = place_freq_final.pleiades_id.iloc[0]
places.loc[places['id'] == rid].representative_latitude.iloc[0]

41.891775

In [25]:
# could've just one function here, but not too much trouble to do two
def get_lat(pl_id):
    places_row = places.loc[places['id'] == pl_id]
    if len(places_row) == 1:
        return places_row.representative_latitude.iloc[0]
    
def get_long(pl_id):
    places_row = places.loc[places['id'] == pl_id]
    if len(places_row) == 1:
        return places_row.representative_longitude.iloc[0]

In [26]:
place_freq_final['lat'] = place_freq_final['pleiades_id'].apply(get_lat)
place_freq_final['long'] = place_freq_final['pleiades_id'].apply(get_long)

In [27]:
place_freq_final

Unnamed: 0,place_name,frequency,pleiades_id,lat,long
0,Rome,12,423025.0,41.891775,12.486137
1,Euphrates,6,912849.0,35.54331,39.606018
2,Arabia,2,981506.0,27.5,32.5
3,Africa,7,775.0,32.5,7.5
4,Ireland,1,20487.0,53.184028,-7.717526
5,India,1,50004.0,22.5,77.5
6,Iberia,1,863807.0,41.836468,44.689138
7,Caledonia,1,89132.0,57.5,-4.5
8,Misenum,2,432941.0,40.786279,14.084884
9,Naples,3,433014.0,40.839995,14.25287


In [28]:
place_freq_final.to_csv('ch1gibbon_places.csv')

## In-class Activity

Now pair up and try to take all of the above steps and turn it into a single function, so that all you need to do is pass in the text of a chapter and you get back a DataFrame with place names and coordinates. To walkthrough the steps again:

* Consider what all of your inputs are
* Use `spaCy` syntax to parse input text
* Filter and count by place name and label
* Use `pandas` to manipulate the data into a useful form

Try out some different chapters. Do you notice anything about the places? Do the types of places or the accuracy of Pleiades get better or worse over the chapters? I encourage you to Google or look up any entities you don't know on Wikipedia.

In [41]:
def find_name_and_coords(gibbon_chapter):
    second_chapter = gibbon_chapter
    second_chapter_doc = nlp(second_chapter)
    
#    for entity in second_chapter_doc.ents: # can access NER with the .ents attribute
#       print(entity.text, entity.label_, sep='\t')

    import collections
    place_freq = collections.defaultdict(int)
    for entity in second_chapter_doc.ents:
        if (entity.label_ == 'GPE') or (entity.label_ == 'LOC'):
            place_freq[entity.text] += 1 # the utility of defaultdict!
    place_freq = dict(place_freq)
    place_freq_df = pd.DataFrame.from_dict(place_freq, orient='index').reset_index().rename(columns={'index':'place_name',0:'frequency'})
    place_freq_df
    
    places = pd.read_csv('places.csv')
    names = pd.read_csv('names.csv')
    #get_pleiades_id(term)

    place_freq_df['pleiades_id'] = place_freq_df['place_name'].apply(get_pleiades_id)
    place_freq_df
    
    place_freq_final = place_freq_df.dropna().reset_index(drop=True)
    place_freq_final
    
    place_freq_final['lat'] = place_freq_final['pleiades_id'].apply(get_lat)
    place_freq_final['long'] = place_freq_final['pleiades_id'].apply(get_long)
    
    print(place_freq_final)
    place_freq_final.to_csv('gibbon_places.csv')

In [40]:
find_name_and_coords()

   place_name  frequency  pleiades_id        lat       long
0        Rome         32     423025.0  41.891775  12.486137
1        Nile          4     727172.0  19.211409  30.567330
2      Athens         11     579885.0  37.972634  23.722746
3      Sparta          1     570685.0  37.077905  22.427298
4       Padua          2     393473.0  45.409561  11.876975
5     Arpinum          1     432700.0  41.648422  13.609876
6     Etruria          1     413122.0  42.758127  11.546721
7      Latium          2     432900.0  41.590543  13.192265
8      Alesia          1     177434.0  47.536622   4.503884
9      Africa          4        775.0  32.500000   7.500000
10  Euphrates          1     912849.0  35.543310  39.606018
11      Capua          1     432754.0  41.086092  14.250207
12     Verona          2     383816.0  45.442130  10.995736
13      Troas          1     550944.0  39.837052  26.336944
14      Milan          1     383706.0  45.463746   9.188060
15     London          1      79574.0  5