# Named Entity Recognition for GeoPolitical Entities

## About

This notebook is a simple test.  It performs the following: 

* Loads data from an S3 (Minio) source populated by [Gleaner](https://gleaner.io) into a Pandas data frame mediated by Dask
* Loads SWEET labels into an additional Pandas data frame
* Leverages SpaCy to extract entities from the data graph descriptions.  
* Compare the extracted entities (from descriptions) and the keywords from the data graph to the SWEET labels to looks for matches.  

This pattern could be used with other elements of the data graph or extracted entities and other vocabularies for linking. 

Additionally it provides an exmaple of comparing generated knowledge graphs from various sources with each other as well to resolve relations.  

## Domain Entities

What we really need is a domain entity linker like described at:
https://github.com/allenai/scispacy#entitylinker in SciSpaCY.  However, we are not going to use SciSpaCY obviously, as it's more for biomedical.  

For now we will stick to SpaCY in this example.  Later it would be good train an entity extractor for the geoscience for SpaCY or other NLP packages. 

## Checking WikiData

We could do something like with SWEET but for WikiDATA.  However, a search at https://query.wikidata.org/ like:

```
SELECT * 
WHERE { 
  SERVICE wikibase:mwapi 
          { bd:serviceParam wikibase:api "EntitySearch" . 
           bd:serviceParam wikibase:endpoint "www.wikidata.org" . 
           bd:serviceParam mwapi:search "cascade" . 
           bd:serviceParam mwapi:language "en" . 
           ?item wikibase:apiOutputItem mwapi:item . 
           ?num wikibase:apiOrdinal true . 
          } 
}
```

returns a large amount of results.   Reference details about this call at https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI. 

Resolving if any of these are in fact a true match is difficult to know.  Even presenting these to a human for assessment is non-trivial.  

## Reporting

Resolving this issue of how to present these results back in an effective and usable manner to data providers is an important aspect.  I am not sure what might be the best approach at this time though it might vary from community to community. 

## UN References
* https://unstats.un.org/sdgapi/swagger/#!/GeoArea/V1SdgGeoAreaListGet
* https://unstats.un.org/sdgapi/swagger/
* https://en.wikipedia.org/wiki/United_Nations_geoscheme
* https://github.com/iodepo/odis-arch/tree/schema-dev-jm/data/un-countries

```
curl -X GET --header 'Accept: application/json' 'https://unstats.un.org/sdgapi/v1/sdg/GeoArea/List'
```



### Core imports

In [1]:
# Import packages
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  ## remove pandas future warning

import scispacy
import spacy
#import en_core_sci_sm
from spacy import displacy
import getpass
import pandas as pd
import urllib.request, json
import dask, boto3
import s3fs
import dask.dataframe as dd

## Model Installs

If you know you have these installed already you can skip the install in the next two cells.  It can take a while to download and install the models

In [14]:
# Install a SciSpaCY model
!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz

In [15]:
# Install SpaCY Web Large
!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz
# !pip install -q https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1.tar.gz


### Helping function(s)
The following block is a SPARQL to Pandas feature.  You may need to run it to load the function per standard notebook actions.

In [2]:
#@title
def get_sparql_dataframe(service, query):
    """
    Helper function to convert SPARQL results into a Pandas data frame.
    """
    sparql = SPARQLWrapper(service)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    result = sparql.query()

    processed_results = json.load(result.response)
    cols = processed_results['head']['vars']

    out = []
    for row in processed_results['results']['bindings']:
        item = []
        for c in cols:
            item.append(row.get(c, {}).get('value'))
        out.append(item)

    return pd.DataFrame(out, columns=cols)

### Pandas config

In [3]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

## Gleaner Data

First lets load up some of the data Gleaner has collected.  This is just simple data graph objects and not any graphs or other processed products from Gleaner.

In [4]:
# ## Anonymous S3 File system
# oss = s3fs.S3FileSystem(
#     anon=True,
#     client_kwargs = {"endpoint_url":"https://oss.geodex.org"}
# )

# Access controlled s3
# session = boto3.Session(profile_name='default' ,   region_name="us-east-1")
# s3 = session.client('s3')  # needed later for listing objects
# s3r = session.resource('s3')
# oss = s3fs.S3FileSystem( profile="default")

## Manual code access
ACCESS_CODE = getpass.getpass()
SECRET_CODE = getpass.getpass()

oss = s3fs.S3FileSystem(
    anon=False,
    key=ACCESS_CODE,
    secret=SECRET_CODE,
    client_kwargs = {"endpoint_url":"http://192.168.202.114:49155"}
)

In [6]:
import json

@dask.delayed()
def read_a_file(fn):
    # or preferably open in text mode and json.load from the file
    with oss.open(fn, 'rb') as f:
        #return json.loads(f.read().replace('\n',' '))
        return json.loads(f.read().decode("utf-8", "ignore").replace('\n',' '))


In [9]:
filenames = oss.ls('gleaner.oih/summoned/edmo')
output = [read_a_file(f) for f in filenames]

gldf = pd.DataFrame(columns=['name', 'url', "keywords", "description"])

for doc in range(len(output)):
#for doc in range(10):
  try:
    jld = output[doc].compute()
  except:
    print("Doc has bad encoding")

  # TODO  Really need to flatten and or frame this, framing would also allow a default value which I have
  # to try except here for missing entries in a data graph

  desc = ""
  kws = ""

  try:
      desc = jld["description"]
  except:
    pass

  try:
    kws = jld["keywords"]
  except:
    pass

  # kws = jld["keywords"]
  name = jld["name"]
  url = jld["url"]  
  gldf = gldf.append({'name':name, 'url':url, 'keywords':kws, 'description': desc}, ignore_index=True)


In [10]:
gldf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4616 entries, 0 to 4615
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         4616 non-null   object
 1   url          4616 non-null   object
 2   keywords     4616 non-null   object
 3   description  4616 non-null   object
dtypes: object(4)
memory usage: 144.4+ KB


### Quick Check

Lets grab one of our descriptions and feed it through spacy and check our entities.  The datafram ID 122 was a good one to try, feel free to experiment.

In [18]:
text = gldf.at[1,'description']
print(text)

Estonian Marine Institute is one of many Tartu University&rsquo;s contemporary scientific institutions. The main purpose of Estonian Marine Institute is marine research and promotion of the given sphere in Estonia and the Baltic region. This institute is one of the biggest organizations in Estonia carrying out marine exploration contributing research in several marine study fields. Our research ranges from water physics to biology, from microscopic scale to ecosystems having unique expert opinion and qualification in most research fields. Estonian Marine Institute is an educational basis for the marine biology-oriented postgraduate students and is actively improving the advanced and extensive higher marine education in Estonia. The active progress of marine sciences in Estonia in the last decades has enhanced the awarness of Estonian science around the Baltic Sea countries and also in Europe. Estonian Marine Institute has very international nature and thus the research is conducted in 

In [19]:
kwtest = gldf.at[1,'keywords']
print(kwtest)




In [20]:
from spacy import displacy
import spacy
import en_core_web_lg  # intersting need for import here...

nlp = en_core_web_lg.load()
# nlp = spacy.load("en_core_web_trf")
# doc2 = nlp(''.join(kwtest)) # nlp(text)
doc2 = nlp(text) # nlp(text)

displacy_image = displacy.render(doc2, jupyter = True, style = 'ent')

### Description Entities

Now lets loop on all of the data graphs and pull the descriptions.  We will then use spacy to identify entities and hold these for later use.

NORP == Nationalities or religious or political groups
GPE == spacy.explain("GPE")

In [21]:
# you can request to define a lebel with
spacy.explain("NORP")

'Nationalities or religious or political groups'

In [22]:
from spacy import displacy
import spacy
import en_core_web_lg  # intersting need for import here...

nlp = en_core_web_lg.load()
df2 = pd.DataFrame(columns=['name', 'label', 'text', 'url'])

for i in range(len(gldf)):
  doc3 = nlp( gldf.at[i,'description'])
  for entity in doc3.ents:
    df2 = df2.append({'name': gldf.at[i,'name'], 'label': entity.label_, 'text':entity.text, 'url': gldf.at[i,'url']}, ignore_index=True)

In [23]:
df2.head(10)

Unnamed: 0,name,label,text,url
0,"Oceaneering International, Inc.",ORG,Oceaneering,https://edmo.seadatanet.org/report/3600
1,Estonian Marine Institute,ORG,Estonian Marine Institute,https://edmo.seadatanet.org/report/714
2,Estonian Marine Institute,CARDINAL,one,https://edmo.seadatanet.org/report/714
3,Estonian Marine Institute,GPE,Tartu,https://edmo.seadatanet.org/report/714
4,Estonian Marine Institute,ORG,Estonian Marine Institute,https://edmo.seadatanet.org/report/714
5,Estonian Marine Institute,GPE,Estonia,https://edmo.seadatanet.org/report/714
6,Estonian Marine Institute,NORP,Baltic,https://edmo.seadatanet.org/report/714
7,Estonian Marine Institute,GPE,Estonia,https://edmo.seadatanet.org/report/714
8,Estonian Marine Institute,ORG,Estonian Marine Institute,https://edmo.seadatanet.org/report/714
9,Estonian Marine Institute,GPE,Estonia,https://edmo.seadatanet.org/report/714


In [27]:
df_gpe = df2[df2['label']=="GPE"].drop_duplicates( keep='first')

In [29]:
df_gpe.head(20)

Unnamed: 0,name,label,text,url
3,Estonian Marine Institute,GPE,Tartu,https://edmo.seadatanet.org/report/714
5,Estonian Marine Institute,GPE,Estonia,https://edmo.seadatanet.org/report/714
27,Oxford Archaeology (South),GPE,Oxford,https://edmo.seadatanet.org/report/5102
30,Oxford Archaeology (South),GPE,England,https://edmo.seadatanet.org/report/5102
35,Appalachian State University,GPE,Boone,https://edmo.seadatanet.org/report/3516
36,Appalachian State University,GPE,N.C.,https://edmo.seadatanet.org/report/3516
47,PowerGen Plc,GPE,UK,https://edmo.seadatanet.org/report/75
51,PowerGen Plc,GPE,Haslemere,https://edmo.seadatanet.org/report/75
52,PowerGen Plc,GPE,Surrey,https://edmo.seadatanet.org/report/75
54,PowerGen Plc,GPE,United Kingdom,https://edmo.seadatanet.org/report/75


In [30]:
gpeunique = df_gpe['text'].unique()
print(gpeunique)

['Tartu' 'Estonia' 'Oxford' ... 'San Mateo CA' 'M&uuml;lheim' 'Nashville']


## Notes

At this point we could compare this list against a list of known countries we can match against.  Those could even have WKT or grid cell data with them to allow for better search or display.

The NRE model is going to find a LOT of false positive GPEs at this stage and how much we can improve that is not known.  However, matching againstg a known list of
countries would likely work and allow us to feed some level improved location/spatial data back into the graph.


In [32]:
with urllib.request.urlopen("https://raw.githubusercontent.com/iodepo/odis-arch/schema-dev-jm/data/un-countries/un-countries-with-regions-391.json") as url:
    data = json.loads(url.read().decode())
    undf = pd.DataFrame(data)

In [33]:
undf.head(10)

Unnamed: 0,geoAreaCode,geoAreaName
0,4,Afghanistan
1,248,Åland Islands
2,8,Albania
3,12,Algeria
4,16,American Samoa
5,20,Andorra
6,24,Angola
7,660,Anguilla
8,10,Antarctica
9,28,Antigua and Barbuda
