# Named Entity Recognition for GeoPolitical Entities

## About

This notebook is a simple test.  It performs the following: 

* Loads data from an S3 (Minio) source populated by [Gleaner](https://gleaner.io) into a Pandas data frame mediated by Dask
* Loads SWEET labels into an additional Pandas data frame
* Leverages SpaCy to extract entities from the data graph descriptions.  
* Compare the extracted entities (from descriptions) and the keywords from the data graph to the SWEET labels to looks for matches.  

This pattern could be used with other elements of the data graph or extracted entities and other vocabularies for linking. 

Additionally it provides an exmaple of comparing generated knowledge graphs from various sources with each other as well to resolve relations.  

## Domain Entities

What we really need is a domain entity linker like described at:
https://github.com/allenai/scispacy#entitylinker in SciSpaCY.  However, we are not going to use SciSpaCY obviously, as it's more for biomedical.  

For now we will stick to SpaCY in this example.  Later it would be good train an entity extractor for the geoscience for SpaCY or other NLP packages. 

## Checking WikiData

We could do something like with SWEET but for WikiDATA.  However, a search at https://query.wikidata.org/ like:

```
SELECT * 
WHERE { 
  SERVICE wikibase:mwapi 
          { bd:serviceParam wikibase:api "EntitySearch" . 
           bd:serviceParam wikibase:endpoint "www.wikidata.org" . 
           bd:serviceParam mwapi:search "cascade" . 
           bd:serviceParam mwapi:language "en" . 
           ?item wikibase:apiOutputItem mwapi:item . 
           ?num wikibase:apiOrdinal true . 
          } 
}
```

returns a large amount of results.   Reference details about this call at https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI. 

Resolving if any of these are in fact a true match is difficult to know.  Even presenting these to a human for assessment is non-trivial.  

## Reporting

Resolving this issue of how to present these results back in an effective and usable manner to data providers is an important aspect.  I am not sure what might be the best approach at this time though it might vary from community to community. 




## Installs


### Pip installs

In [None]:
# %%capture
# !pip install -q spacy
# !pip install -q scispacy
# !pip install -q mimesis
# !pip install -q minio 
# !pip install -q SPARQLWrapper
# !pip install -q boto3
# !pip install -q rdflib
# !pip install -q rdflib-jsonld
# !pip install -q PyLD==2.0.2
# !pip install -q qwikidata
# !pip install 'fsspec>=0.3.3'
# !pip install s3fs


In [None]:
# Install a SciSpaCY model
!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz

[K     |████████████████████████████████| 17.0MB 233kB/s 
[?25h  Building wheel for en-core-sci-sm (setup.py) ... [?25l[?25hdone


In [None]:
# Install SpaCY Web Large
!pip install -q https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.3.1/en_core_web_lg-2.3.1.tar.gz

[K     |████████████████████████████████| 782.7MB 23kB/s 
[?25h  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone


### Core imports

In [4]:
# Import packages
import scispacy
import spacy
#import en_core_sci_sm
from spacy import displacy
import pandas as pd
import dask, boto3
import dask.dataframe as dd

  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)


### Helping function(s)
The following block is a SPARQL to Pandas feature.  You may need to run it to load the function per standard notebook actions.

In [5]:
#@title
def get_sparql_dataframe(service, query):
    """
    Helper function to convert SPARQL results into a Pandas data frame.
    """
    sparql = SPARQLWrapper(service)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    result = sparql.query()

    processed_results = json.load(result.response)
    cols = processed_results['head']['vars']

    out = []
    for row in processed_results['results']['bindings']:
        item = []
        for c in cols:
            item.append(row.get(c, {}).get('value'))
        out.append(item)

    return pd.DataFrame(out, columns=cols)

### Pandas config

In [6]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

  pd.set_option('display.max_colwidth', -1)


## Gleaner Data

First lets load up some of the data Gleaner has collected.  This is just simple data graph objects and not any graphs or other processed products from Gleaner. 

In [7]:
# Set up our S3FileSystem object
import s3fs 

oss = s3fs.S3FileSystem(
      anon=True,
      key=key,
      secret=secret,
      client_kwargs = {"endpoint_url":"https://oss.geodex.org"}
   )

NameError: name 'key' is not defined

In [None]:
# A simple example of grabbing one item...  

import json 

jld = ""
with oss.open('gleaner/summoned/opentopo/231f7fa996be8bd5c28b64ed42907b65cca5ee30.jsonld', 'rb') as f:
  #print(f.read())
   jld = f.read().decode("utf-8", "ignore").replace('\n',' ')
   json = json.loads(jld)

text = json['name']
print(text)

High Resolution Topography near Santa Cruz, CA 2017


In [None]:
import json

@dask.delayed()
def read_a_file(fn):
    # or preferably open in text mode and json.load from the file
    with oss.open(fn, 'rb') as f:
        #return json.loads(f.read().replace('\n',' '))
        return json.loads(f.read().decode("utf-8", "ignore").replace('\n',' '))

filenames = oss.ls('gleaner/summoned/opentopo')
output = [read_a_file(f) for f in filenames]

In [None]:
gldf = pd.DataFrame(columns=['name', 'url', "keywords", "description"])

for doc in range(len(output)):
#for doc in range(10):
  try:
    jld = output[doc].compute()
  except:
    print("Doc has bad encoding")

  # TODO  Really need to flatten and or frame this

  desc = jld["description"]
  kws = jld["keywords"]
  name = jld["name"]
  url = jld["url"]  
  gldf = gldf.append({'name':name, 'url':url, 'keywords':kws, 'description': desc}, ignore_index=True)


In [None]:
gldf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 899 entries, 0 to 898
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         899 non-null    object
 1   url          899 non-null    object
 2   keywords     899 non-null    object
 3   description  899 non-null    object
dtypes: object(4)
memory usage: 28.2+ KB


### Quick Check

Lets grab one of our descriptions and feed it through spacy and check our entities.  The datafram ID 122 was a good one to try, feel free to experiment.

In [None]:
text = gldf.at[122,'description']
print(text)

kwtest = gldf.at[1,'keywords']
print(kwtest)

High lake levels are reducing beach area along the Lake Michigan coastline and allowing wave action to erode the bases of coastal bluffs at the highest rate of the past 30 years. Sediment budget calculations have shown that bluff erosion is the dominant source of sand and gravel-sized particles that are mobilized into beaches and the nearshore system. Researchers have found that the leading cause of bluff erosion is shallow to intermediate depth translational landslides. Therefore, estimating lake sediment budgets depends on an understanding of the mechanisms that lead to landslide failure. This study will provide a comprehensive analysis of bluff stability for bluffs affected by landslide failure coupled with an analysis of bluff composition to determine the composition of sediment contributions of coastal bluffs to the southeast Lake Michigan sediment budget. This dataset is part of a series of repeat surveys documenting temporal changes to a 0.5 km extent of unconsolidated coastal b

In [None]:
from spacy import displacy
import spacy
import en_core_web_lg  # intersting need for import here...

nlp = en_core_web_lg.load()
doc2 = nlp(text)
displacy_image = displacy.render(doc2, jupyter = True, style = 'ent')

### Description Entities

Now lets loop on all of the data graphs and pull the descriptions.  We will then use spacy to identify entities and hold these for later use. 

In [None]:
from spacy import displacy
import spacy
import en_core_web_lg  # intersting need for import here...

nlp = en_core_web_lg.load()
df2 = pd.DataFrame(columns=['name', 'label', 'text', 'url'])

for i in range(len(gldf)):
  doc3 = nlp( gldf.at[i,'description'])
  for entity in doc3.ents:
    df2 = df2.append({'name': gldf.at[i,'name'], 'label': entity.label_, 'text':entity.text, 'url': gldf.at[i,'url']}, ignore_index=True)

In [None]:
df2.head(5)

Unnamed: 0,name,label,text,url
0,"High Resolution Topography near Santa Cruz, CA 2017",ORG,the National Center for Airborne Laser Mapping,https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.042020.6339.2
1,"High Resolution Topography near Santa Cruz, CA 2017",ORG,NCALM,https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.042020.6339.2
2,"High Resolution Topography near Santa Cruz, CA 2017",PERSON,Alison Duvall,https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.042020.6339.2
3,"High Resolution Topography near Santa Cruz, CA 2017",ORG,the University of Washington,https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.042020.6339.2
4,"High Resolution Topography near Santa Cruz, CA 2017",GPE,Watsonville,https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.042020.6339.2


## Get SWEET labels

Grab the SWEET lables into a data frame to match against. 


In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import shapely

cor = "http://cor.esipfed.org/sparql"

In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON

swt1 = """
PREFIX ufokn: <http://schema.ufokn.org/core/v1/>
SELECT *
where
{
  ?sub rdfs:label ?text
}
"""

swtdf = get_sparql_dataframe(cor, swt1)

In [None]:
swtdf.head(5)

Unnamed: 0,sub,text
0,https://www.inf.ufrgs.br/bdi/ontologies/geocoreontology#UFRGS:GeoCoreOntology_age_of,age of
1,https://www.inf.ufrgs.br/bdi/ontologies/geocoreontology#UFRGS:GeoCoreOntology_constituted_by,constituted by
2,https://www.inf.ufrgs.br/bdi/ontologies/geocoreontology#UFRGS:GeoCoreOntology_generated_by,generated by
3,https://www.inf.ufrgs.br/bdi/ontologies/geocoreontology#UFRGS:GeoCoreOntology_generated_in,generated in
4,https://www.inf.ufrgs.br/bdi/ontologies/geocoreontology#UFRGS:GeoCoreOntology_has_age,has age


## Description entities checked against SWEET labels

We can now look for the identified entities in the descriptions aginst the SWEET labels.  

### (Merge the data frames)

We can use some of the pandas goodness like in
https://stackoverflow.com/questions/61106803/python-panda-search-for-value-in-a-df-from-another-df



In [None]:
# swtdf and df
# review the following snippet example too
# df_merged = pd.merge(df_address, df_CountryMapping, left_on=df_address["Country"].str.lower(), right_on=df_CountryMapping["NAME"].str.lower(), how="left")

## CHEAT
#dfcheat = df.append({'label': "test label", 'text':"Seasonal ice"}, ignore_index=True)
#dfcheat['text'] = dfcheat['text'].str.lower()

df2['text'] = df2['text'].str.lower()
swtdf['text'] = swtdf['text'].str.lower()

check = df2.merge(swtdf, on=['text'], how='outer').dropna()


In [None]:
check.head()

Unnamed: 0,name,label,text,url,sub
1244,"Mackenzie, Canterbury, New Zealand 2015",ORG,lidar,https://doi.org/10.5069/G99G5JWQ,http://sweetontology.net/matrInstrument/LIDAR
1245,Jemez River Basin Snow-off Lidar Survey,ORG,lidar,https://doi.org/10.5069/G9RB72JV,http://sweetontology.net/matrInstrument/LIDAR
1246,Jemez River Basin Snow-off Lidar Survey,ORG,lidar,https://doi.org/10.5069/G9RB72JV,http://sweetontology.net/matrInstrument/LIDAR
1247,"Yosemite National Park, CA: Rockfall Studies",ORG,lidar,https://doi.org/10.5069/G9D798B8,http://sweetontology.net/matrInstrument/LIDAR
1248,"Hurunui Rivers, Canterbury, New Zealand 2013",ORG,lidar,https://doi.org/10.5069/G9PC3093,http://sweetontology.net/matrInstrument/LIDAR


In [None]:
def display_():     
  pd.set_option("display.max_rows", None)     
  from IPython.core.display import display     
  display(check)

display_()

Unnamed: 0,name,label,text,url,sub
1244,"Mackenzie, Canterbury, New Zealand 2015",ORG,lidar,https://doi.org/10.5069/G99G5JWQ,http://sweetontology.net/matrInstrument/LIDAR
1245,Jemez River Basin Snow-off Lidar Survey,ORG,lidar,https://doi.org/10.5069/G9RB72JV,http://sweetontology.net/matrInstrument/LIDAR
1246,Jemez River Basin Snow-off Lidar Survey,ORG,lidar,https://doi.org/10.5069/G9RB72JV,http://sweetontology.net/matrInstrument/LIDAR
1247,"Yosemite National Park, CA: Rockfall Studies",ORG,lidar,https://doi.org/10.5069/G9D798B8,http://sweetontology.net/matrInstrument/LIDAR
1248,"Hurunui Rivers, Canterbury, New Zealand 2013",ORG,lidar,https://doi.org/10.5069/G9PC3093,http://sweetontology.net/matrInstrument/LIDAR
1249,Jemez River Basin Snow-on Lidar Survey,ORG,lidar,https://doi.org/10.5069/G9W37T86,http://sweetontology.net/matrInstrument/LIDAR
1250,Jemez River Basin Snow-on Lidar Survey,ORG,lidar,https://doi.org/10.5069/G9W37T86,http://sweetontology.net/matrInstrument/LIDAR
1251,"Marlborough, New Zealand 2018",ORG,lidar,https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.072019.2193.1,http://sweetontology.net/matrInstrument/LIDAR
1252,2010 CU-Boulder Campus and Flatirons,ORG,lidar,https://doi.org/10.5069/G9ZC80SR,http://sweetontology.net/matrInstrument/LIDAR
1253,2010 CU-Boulder Campus and Flatirons,ORG,lidar,https://doi.org/10.5069/G9ZC80SR,http://sweetontology.net/matrInstrument/LIDAR


## Keyword check against SWEET Lables

### ToDo
Do lookup based on Keywords