## SpaCy Named Entity Extraction

### Objective
This notebook documents experiments done with the dLOC as Data newspaper data. The goal of the experiment was to identify names of places mentioned in association with hurricane mentions already identified using AntConc. The document used for this experiment contains all of the collacates surrounding the word "hurricane" found in the Barbados Mercury and  from 1783-1848. 
#### Created by Molly Castro
#### Date: 9/14/2020


Install Spacy and download the English language model 

In [11]:
pip install spacy && python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/Users/molly/opt/anaconda3/lib/python3.7/site-packages/en_core_web_sm -->
/Users/molly/opt/anaconda3/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')
Note: you may need to restart the kernel to use updated packages.


In [12]:
import spacy 
nlp=spacy.load("en_core_web_sm")


Bring in the document you plan to use (the output from the Antconc collocation)

In [17]:
import os.path

In [18]:
fname = os.path.expanduser('~/desktop/antconc_results_bm.txt')
with open(fname) as fp:
    data = fp.read()


Bring up the loaded NLP object on the text

In [19]:
doc = nlp(data)

I started with this “test” to bring up the first ten entities (ent) and their labels. Complete list of labels for named entities in SpaCy can be found here: https://spacy.io/api/annotation#named-entities

In [22]:
for ent in doc.ents [:10]:
    print('{} -> {}'.format(ent.string, ent.label_))

1 -> CARDINAL
Benjamin Alleyne Cox -> PERSON
Eig -> PERSON
Ortley Capt -> ORG
London  -> GPE
each day  -> DATE
ten  -> CARDINAL
more than one  -> CARDINAL
2 -> CARDINAL
LIC  -> ORG


For this project we are interested in geopolitical entities in SpaCy, or GPE. Others of interest for future explorations might be LOC for non-GPE locations, such as mountain ranges or bodies of water. Or EVENT which includes named hurricanes.

To pull up just the GPE for the entire document we use this:


In [23]:
[ent for ent in doc.ents if ent.label == spacy.symbols.GPE] 

[London,
 London,
 Spcights Town,
 London,
 Londo,
 Adventire,
 London,
 Cumberland,
 Benjamin,
 Meffrs,
 Melis,
 America,
 St. Michael,
 AA00047511_00035.txt,
 Jamaica,
 Cuba,
 Pert Roya,
 Iphigenia,
 Cadiz,
 Sept,
 Sty Joliph's,
 France,
 St. Michael,
 Island,
 Thu,
 Pallas,
 Jamaica,
 —By,
 Island,
 Goverument,
 Ordinary,
 Norway,
 Sweden,
 Ordinary,
 France,
 St. Bartholomew,
 Charlestown,
 South Carolina,
 San Guillermo,
 Great Britain,
 St. Lucia,
 Lat,
 Yorkshire,
 Dominica,
 Lydia,
 St. Vincent,
 Dominica,
 St. Mich. Regt,
 Antigua,
 St. Mich,
 St. Denys,
 Barb,
 Ordinary,
 Ordinary,
 Youn SrooneR.,
 Ceuta,
 S.E.,
 Island,
 England,
 Jamaica,
 Cuba,
 Dominica,
 Tr,
 Barbados,
 Concord,
 Barbados,
 Lancasler,
 Barbados,
 St. Pierre,
 St.  ,
 Island,
 the West Indies,
 Island,
 St. Lucia,
 Pifunids,
 St. Lucia,
 St. Lucia,
 Cobham,
 ——,
 Dominica,
 Lodge,
 St. Denys,
 Ordinary,
 Jamaica,
 Sugar,
 Newfoundtand,
 States,
 Ordinary,
 Oct.,
 St. Thomas,
 Ordinary,
 Island,
 Island,
 