# OIH Data Science Notebook:  Geocoverage inspection on combined parquet

This notebook demonstrates query approach for the pre-processed resources from the OIH Graph

Notes:

Need to download the model with _python -m spacy download en_core_web_lg_

you can also downloiad with

```
pip install -q
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz
```

In [11]:
import duckdb

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  ## remove pandas future warning

import scispacy
import spacy
#import en_core_sci_sm
from spacy import displacy
import getpass
import pandas as pd
import urllib.request, json
import dask, boto3
import s3fs

In [12]:
# Install SpaCY Web Large
# !pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz


## Pre-processed OIH Graph

In [13]:
## load the combined graph
url = "http://ossapi.oceaninfohub.org/public/combined.parquet"
duckdb.install_extension("httpfs")

# Instantiate the DuckDB connection
con = duckdb.connect()
# con.execute("CREATE TABLE my_table AS SELECT * FROM read_parquet('{}')".format(url))  # load from url
con.execute("CREATE TABLE my_table AS SELECT * FROM read_parquet('./inputs/combined.parquet')") # load from local parquet


<duckdb.DuckDBPyConnection at 0x7f44806689f0>

In [14]:

# Now you can execute SQL queries on the Parquet file as if it was a regular table
# r = con.execute("SELECT DISTINCT provder FROM my_table").fetchdf()
# r = con.execute(" SELECT DISTINCT provder, type, ANY_VALUE(s),  COUNT(*) AS count FROM my_table GROUP BY provder, type  order by count desc").fetchdf()
r = con.execute(" SELECT DISTINCT provder, type, COUNT(*) AS count FROM my_table GROUP BY provder, type").fetchdf()

print(r)


                provder                      type   count
0          oceanexperts              schema:Event   20606
1          oceanexperts             schema:Course     491
2          oceanexperts     schema:CourseInstance     491
3                   pdh  schemawrong:Organization    3562
4                   pdh       schemawrong:Dataset   32807
5                   NaN            schmea:Dataset   20101
6             africaioc    schema:ResearchProject     176
7             africaioc       schema:Organization      52
8             africaioc            schema:Vehicle      30
9             africaioc              schema:Event      59
10            africaioc       schema:CreativeWork       1
11            africaioc            schema:Dataset       1
12            africaioc             schema:Person       1
13             aquadocs       schema:CreativeWork  261364
14     invemardocuments             schema:Person   13351
15     invemardocuments       schema:CreativeWork   18647
16       invem

In [15]:
# test data

text = "Estonian Marine Institute is one of many Tartu University&rsquo;s contemporary scientific institutions. The main purpose of Estonian Marine Institute is marine research and promotion of the given sphere in Estonia and the Baltic region. This institute is one of the biggest organizations in Estonia carrying out marine exploration contributing research in several marine study fields. Our research ranges from water physics to biology, from microscopic scale to ecosystems having unique expert opinion and qualification in most research fields. Estonian Marine Institute is an educational basis for the marine biology-oriented postgraduate students and is actively improving the advanced and extensive higher marine education in Estonia. The active progress of marine sciences in Estonia in the last decades has enhanced the awarness of Estonian science around the Baltic Sea countries and also in Europe. Estonian Marine Institute has very international nature and thus the research is conducted in high level and in close co-operation with other specialist around the world. The success of the Estonian marine scientists at the international level refers to their high-rated scientific publications and increase in the number of international projects. On the other hand a significant part of the research is aimed at finding solutions to the local scientific problems important for Estonia&rsquo;s well-being. We hope our website provides you easy access to all of the information you seek about the Estonian Marine Institute and useful links to other related themes."

In [16]:
from spacy import displacy
import spacy
import en_core_web_lg  # intersting need for import here...

nlp = en_core_web_lg.load()
# nlp = spacy.load("en_core_web_trf")
# doc2 = nlp(''.join(kwtest)) # nlp(text)
doc2 = nlp(text) # nlp(text)

displacy_image = displacy.render(doc2, jupyter = True, style = 'ent')

### Description Entities

Now lets loop on all of the data graphs and pull the descriptions.  We will then use spacy to identify entities and hold these for later use.

NORP == Nationalities or religious or political groups
GPE == spacy.explain("GPE")

In [17]:
# load the parquet into pandas
oihdf = pd.read_parquet("./inputs/combined.parquet")

In [18]:
oihdf.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 685406 entries, 0 to 20100
Data columns (total 7 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   s         685406 non-null  object
 1   type      685406 non-null  object
 2   name      652304 non-null  object
 3   keywords  427200 non-null  object
 4   url       520435 non-null  object
 5   desc      465528 non-null  object
 6   provder   665305 non-null  object
dtypes: object(7)
memory usage: 41.8+ MB


In [19]:
oihdf.head(10)

Unnamed: 0,s,type,name,keywords,url,desc,provder
0,<https://ioc-africa.org/dbs/jsonld/oceanProjec...,schema:ResearchProject,Integrated Natural Resource Management Program...,,,,africaioc
1,<https://ioc-africa.org/dbs/jsonld/previousExp...,schema:ResearchProject,1 cruise,,,,africaioc
2,<https://ioc-africa.org/dbs/jsonld/previousExp...,schema:ResearchProject,TR SALOUI Cruise,,,,africaioc
3,<https://ioc-africa.org/dbs/jsonld/oceanProjec...,schema:ResearchProject,SEYCHELLES THIRD FISCAL SUSTAINABILITY AND CLI...,,,,africaioc
4,<https://ioc-africa.org/dbs/jsonld/previousExp...,schema:ResearchProject,Kenya Shallow-water CrustaceanTrawl Survey,,,,africaioc
5,<https://ioc-africa.org/dbs/jsonld/previousExp...,schema:ResearchProject,VT 51 / OISO 6 Cruise,,,,africaioc
6,<https://ioc-africa.org/dbs/jsonld/oceanProjec...,schema:ResearchProject,Ghana - West Africa Regional Fisheries Program,,,,africaioc
7,<https://ioc-africa.org/dbs/jsonld/oceanProjec...,schema:ResearchProject,Marine and Coastal Environment Management,,,,africaioc
8,<https://ioc-africa.org/dbs/jsonld/previousExp...,schema:ResearchProject,VT 79 / OISO 12 Cruise,,,,africaioc
9,<https://ioc-africa.org/dbs/jsonld/oceanProjec...,schema:ResearchProject,Integrated Coastal and Marine Biodiversity Man...,,,,africaioc


In [20]:
%%time 
from spacy import displacy
import spacy
import en_core_web_lg  # intersting need for import here...
import numpy as np
from multiprocessing import Pool

nlp = en_core_web_lg.load()
nlpdf = pd.DataFrame(columns=['name', 'label', 'text', 'url'])

for i in range(len(oihdf)):
# for i in range(100):
    text = ' '.join(oihdf.at[i,'desc'].astype(str))
    doc3 = nlp(text)
    for entity in doc3.ents:
        nlpdf = nlpdf.append({'name': oihdf.at[i,'name'], 'label': entity.label_, 'text':entity.text, 'url': oihdf.at[i,'s']}, ignore_index=True)


KeyboardInterrupt: 

In [12]:
nlpdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1705 entries, 0 to 1704
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    1705 non-null   object
 1   label   1705 non-null   object
 2   text    1705 non-null   object
 3   url     1705 non-null   object
dtypes: object(4)
memory usage: 53.4+ KB


In [13]:
nlpdf.head(10)

Unnamed: 0,name,label,text,url
0,0 Integrated Natural Resource Management Pr...,ORG,Integrated Natural Resource Management Program...,0 <https://ioc-africa.org/dbs/jsonld/oceanP...
1,0 Integrated Natural Resource Management Pr...,ORG,Hang Rai,0 <https://ioc-africa.org/dbs/jsonld/oceanP...
2,0 Integrated Natural Resource Management Pr...,GPE,Ninh Thuan,0 <https://ioc-africa.org/dbs/jsonld/oceanP...
3,0 Integrated Natural Resource Management Pr...,ORG,NTT Fault Model Aaron Samuel Bracho Mosquera R...,0 <https://ioc-africa.org/dbs/jsonld/oceanP...
4,0 Integrated Natural Resource Management Pr...,ORG,JOSE DE CALDAS,0 <https://ioc-africa.org/dbs/jsonld/oceanP...
5,0 Integrated Natural Resource Management Pr...,ORG,BOGOTÁ,0 <https://ioc-africa.org/dbs/jsonld/oceanP...
6,0 Integrated Natural Resource Management Pr...,ORG,UDFJC University,0 <https://ioc-africa.org/dbs/jsonld/oceanP...
7,0 Integrated Natural Resource Management Pr...,ORG,Antofagasta ->,0 <https://ioc-africa.org/dbs/jsonld/oceanP...
8,0 Integrated Natural Resource Management Pr...,ORG,Facultad de ciencias,0 <https://ioc-africa.org/dbs/jsonld/oceanP...
9,0 Integrated Natural Resource Management Pr...,DATE,mar y,0 <https://ioc-africa.org/dbs/jsonld/oceanP...


In [14]:
test_df= oihdf.head(1000)


In [None]:
%%time

from spacy import displacy
import spacy
import en_core_web_lg  # intersting need for import here...
import numpy as np
from multiprocessing import Pool

nlp = en_core_web_lg.load()
# nlpdf = pd.DataFrame(columns=['name', 'label', 'text', 'url'])

def getents(input):
    nlpdf = pd.DataFrame(columns=['name', 'label', 'text', 'url'])
    tj =  input #' '.join(input)        #' '.join(input.astype(str))
    if not tj is None:
        doc3 = nlp(tj)
        for entity in doc3.ents:
            nlpdf = nlpdf.append({'name': oihdf.at[i,'name'], 'label': entity.label_, 'text':entity.text, 'url': oihdf.at[i,'s']}, ignore_index=True)

    return nlpdf

    # Create a Pool of processes
with Pool(processes=10) as pool:
    # Apply the process_data function to each chunk of the DataFrame
    result_dfs = pool.map(getents, oihdf['desc'])

final_df = pd.concat(result_dfs)


#     # Create a Pool of processes
# with Pool(processes=10) as pool:
#     # Apply the process_data function to each chunk of the DataFrame
#     defresult_dfs = pool.map(getents, oihdf['desc'])

    
# deffinal_df = pd.concat(defresult_dfs)

# combined_df = pd.concat([final_df, deffinal_df], ignore_index=True)

    

In [16]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1248737 entries, 0 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   name    1248737 non-null  object
 1   label   1248737 non-null  object
 2   text    1248737 non-null  object
 3   url     1248737 non-null  object
dtypes: object(4)
memory usage: 47.6+ MB


In [15]:
combined_df.head(10)

Unnamed: 0,name,label,text,url
0,99 OISO5 (VT 4...,ORG,Integrated Natural Resource Management Program...,99 <https://ioc-africa.org/dbs/jsonld/previ...
0,99 OISO5 (VT 4...,CARDINAL,1,99 <https://ioc-africa.org/dbs/jsonld/previ...
0,99 OISO5 (VT 4...,GPE,Kenya,99 <https://ioc-africa.org/dbs/jsonld/previ...
1,99 OISO5 (VT 4...,ORG,CrustaceanTrawl,99 <https://ioc-africa.org/dbs/jsonld/previ...
0,99 OISO5 (VT 4...,ORG,VT 51,99 <https://ioc-africa.org/dbs/jsonld/previ...
1,99 OISO5 (VT 4...,CARDINAL,6,99 <https://ioc-africa.org/dbs/jsonld/previ...
0,99 OISO5 (VT 4...,GPE,Ghana,99 <https://ioc-africa.org/dbs/jsonld/previ...
0,99 OISO5 (VT 4...,ORG,Marine,99 <https://ioc-africa.org/dbs/jsonld/previ...
1,99 OISO5 (VT 4...,ORG,Coastal Environment Management,99 <https://ioc-africa.org/dbs/jsonld/previ...
0,99 OISO5 (VT 4...,CARDINAL,12,99 <https://ioc-africa.org/dbs/jsonld/previ...


In [23]:
filtered_df = final_df[final_df['label'] == 'GPE']

# Count the occurrences of each unique term in the "text" column
counts = filtered_df['text'].value_counts()

# Convert the counts Series to a DataFrame
new_df = counts.reset_index()

# Rename the columns
new_df.columns = ['place', 'count']

new_df.head(10)

Unnamed: 0,place,count
0,Iran,5391
1,Argentina,3648
2,Nigeria,3306
3,Kenya,2700
4,Florida,2106
5,India,2062
6,Cuba,2006
7,Canada,1521
8,California,1351
9,Venezuela,1284


In [24]:
new_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11679 entries, 0 to 11678
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   place   11679 non-null  object
 1   count   11679 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 182.6+ KB


In [25]:
new_df.to_csv('descCounts.csv', index=False)
