<a href="https://colab.research.google.com/github/ayushanand18/pyobis/blob/dataset-viz/notebooks/contributions_quantification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quantifying and visualizing contributions from person and organization

This notebook entails the quantitative analysis of contributions made by individuals and organisations to the OBIS data base. 

We will use `pyobis` to fetch data.

## Installing `pyobis`

In [1]:
try:
  import pyobis
except:
  !pip install -q "git+https://github.com/iobis/pyobis.git"

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
  Building wheel for pyobis (PEP 517) ... [?25l[?25hdone


#### importing other modules

In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

In [3]:
from pyobis import occurrences as occ
from pyobis import dataset

## Grabbing data

For our analysis we will need to restrict ourselves to only a specific datasets. We will fetch all the datasets recorded between 2017-2018 and see the distribution among providers.

In [4]:
res = dataset.search(startdate="2017-01-01", enddate="2018-12-12")["results"]

Now we will convert this data into a pandas DataFrame.

In [5]:
data = pd.DataFrame(res)
data

Unnamed: 0,id,url,archive,published,created,updated,core,extensions,statistics,extent,...,citation_id,abstract,intellectualrights,feed,institutes,contacts,nodes,keywords,downloads,records
0,aac5ca81-638a-4335-9aa7-5c2bda67a362,https://ipt.inbo.be/resource?r=lbbg_zeebrugge,https://ipt.inbo.be/archive.do?r=lbbg_zeebrugge,2022-06-10T14:46:22.000Z,2022-06-01T14:47:54.732Z,2022-06-11T08:55:36.376Z,occurrence,[],"{'Event': 0, 'absence': 0, 'dropped': 0, 'Occu...","POLYGON((-17.77747 12.355771,-17.77747 57.4918...",...,,This animal tracking dataset is derived from S...,"To the extent possible under law, the publishe...","{'id': '2b52ff52-bd4f-4800-97b6-882bc7698a22',...","[{'name': 'University of Amsterdam, Faculty of...","[{'role': None, 'type': 'creator', 'givenname'...",[{'id': '4bf79a01-65a9-4db6-b37b-18434f26ddfc'...,"[{'keyword': 'animal movement', 'thesaurus': '...","[{'year': 2022, 'downloads': 638, 'records': 3...",651423
1,74d67d71-2d25-4fa1-9a1e-df71c6af891e,https://www.marine.csiro.au/ipt/resource?r=bio...,https://www.marine.csiro.au/ipt/archive.do?r=b...,2022-08-03T02:04:26.000Z,2020-06-21T12:17:59.993Z,2022-08-04T18:52:19.151Z,occurrence,[],"{'Event': 0, 'absence': 0, 'dropped': 495943, ...",,...,,The Australian Marine Microbial Biodiversity I...,This work is licensed under a Creative Common...,"{'id': '42683bd2-2415-405b-8ddf-f7e9ca0d339e',...",[{'name': 'CSIRO National Collections and Mari...,"[{'role': None, 'type': 'creator', 'givenname'...",[{'id': '2a57cd59-6799-4579-955e-27c9af97aea4'...,"[{'keyword': 'occurrence', 'thesaurus': 'GBIF ...","[{'year': 2022, 'downloads': 1580, 'records': ...",535546
2,2ac6b17f-3b3a-4e0f-b712-3b640bf79147,https://www.marine.csiro.au/ipt/resource?r=bio...,https://www.marine.csiro.au/ipt/archive.do?r=b...,2022-08-02T23:52:58.000Z,2020-06-21T12:44:11.592Z,2022-08-03T21:09:35.585Z,occurrence,[],"{'Event': 0, 'absence': 0, 'dropped': 290825, ...",,...,,The Australian Marine Microbial Biodiversity I...,This work is licensed under a Creative Common...,"{'id': '42683bd2-2415-405b-8ddf-f7e9ca0d339e',...",[{'name': 'CSIRO National Collections and Mari...,"[{'role': None, 'type': 'creator', 'givenname'...",[{'id': '2a57cd59-6799-4579-955e-27c9af97aea4'...,"[{'keyword': 'occurrence', 'thesaurus': 'GBIF ...","[{'year': 2022, 'downloads': 1858, 'records': ...",504709
3,2c4f07c6-1c02-4082-88da-9e51b7897f24,https://www.marine.csiro.au/ipt/resource?r=bio...,https://www.marine.csiro.au/ipt/archive.do?r=b...,2022-06-24T04:39:34.000Z,2020-06-20T14:01:17.907Z,2022-06-24T15:44:28.300Z,occurrence,[],"{'Event': 0, 'absence': 0, 'dropped': 505736, ...",,...,,The Australian Marine Microbial Biodiversity I...,This work is licensed under a Creative Common...,"{'id': '42683bd2-2415-405b-8ddf-f7e9ca0d339e',...",[{'name': 'CSIRO National Collections and Mari...,"[{'role': None, 'type': 'creator', 'givenname'...",[{'id': '2a57cd59-6799-4579-955e-27c9af97aea4'...,"[{'keyword': 'occurrence', 'thesaurus': 'GBIF ...","[{'year': 2022, 'downloads': 1763, 'records': ...",483556
4,8b0d5fdd-6a3f-48c7-a4aa-84f39f2df647,https://www.marine.csiro.au/ipt/resource?r=bio...,https://www.marine.csiro.au/ipt/archive.do?r=b...,2022-08-03T02:05:07.000Z,2020-06-21T12:31:22.183Z,2022-08-05T10:51:58.796Z,occurrence,[],"{'Event': 0, 'absence': 0, 'dropped': 538392, ...",,...,,The Australian Marine Microbial Biodiversity I...,This work is licensed under a Creative Common...,"{'id': '42683bd2-2415-405b-8ddf-f7e9ca0d339e',...",[{'name': 'CSIRO National Collections and Mari...,"[{'role': None, 'type': 'creator', 'givenname'...",[{'id': '2a57cd59-6799-4579-955e-27c9af97aea4'...,"[{'keyword': 'occurrence', 'thesaurus': 'GBIF ...","[{'year': 2022, 'downloads': 1653, 'records': ...",464032
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
688,dc4fe74c-cb85-48e9-a1d8-ff5a39737e40,http://ipt.env.duke.edu/resource?r=zd_1704,https://ipt.env.duke.edu/archive.do?r=zd_1704,2022-06-07T23:49:07.000Z,2020-10-12T23:30:18.100Z,2022-06-09T10:05:57.744Z,occurrence,[],"{'Event': 0, 'absence': 0, 'dropped': 0, 'Occu...","POLYGON((172.925134 -43.846403,172.925134 -41....",...,https://seamap.env.duke.edu/dataset/1704,Original provider:\nHappywhale\n\nDataset cred...,This work is licensed under a Creative Common...,"{'id': '18954703-9b9d-4584-b46d-87846532c5ee',...",,"[{'role': None, 'type': 'creator', 'givenname'...",[{'id': '573654c1-4ce7-4ea2-b2f1-e4d42f8f9c31'...,"[{'keyword': 'Occurrence,PhotoID;Visual sighti...","[{'year': 2022, 'downloads': 920, 'records': 4...",1
689,ebe7ffad-b134-40d3-b3df-21b675c4bcb1,http://ipt.vliz.be/eurobis/resource?r=nbn_ga00...,http://ipt.vliz.be/eurobis/archive.do?r=nbn_ga...,2022-05-18T14:12:13.000Z,,2022-06-27T08:57:10.521Z,occurrence,[],"{'Event': 0, 'absence': 0, 'dropped': 2249, 'O...",POLYGON((-8.637567944960944 54.638541844356304...,...,https://doi.org/10.15468/faxvgd,The dataset comprises: species records from be...,This work is licensed under a Creative Common...,"{'id': 'e3dad797-a123-4e78-8473-5b0a295d3685',...","[{'name': 'Vlaams Instituut voor de Zee', 'oce...","[{'role': None, 'type': 'creator', 'givenname'...",[{'id': '4bf79a01-65a9-4db6-b37b-18434f26ddfc'...,"[{'keyword': 'Biodiversity', 'thesaurus': 'ASF...","[{'year': 2022, 'downloads': 14484, 'records':...",1
690,ee38cf35-ccff-4e2e-bd57-ea5a350bfa6a,http://ipt.env.duke.edu/resource?r=zd_1714,http://ipt.env.duke.edu/archive.do?r=zd_1714,2021-05-25T13:31:18.000Z,2020-10-12T23:30:04.907Z,2021-07-06T23:01:39.054Z,occurrence,[],"{'Event': 0, 'absence': 0, 'dropped': 0, 'Occu...","POLYGON((-44.5055 -25.341,-44.5055 -0.126667,-...",...,https://seamap.env.duke.edu/dataset/1714,Original provider:\nHappywhale\n\nDataset cred...,This work is licensed under a Creative Common...,"{'id': '18954703-9b9d-4584-b46d-87846532c5ee',...",,"[{'role': None, 'type': 'creator', 'givenname'...",[{'id': '573654c1-4ce7-4ea2-b2f1-e4d42f8f9c31'...,"[{'keyword': 'Occurrence,PhotoID;Visual sighti...","[{'year': 2022, 'downloads': 1006, 'records': ...",1
691,f49b2ebe-d278-4454-8b3e-c04eb3c0362d,http://ipt.env.duke.edu/resource?r=zd_1988,http://ipt.env.duke.edu/archive.do?r=zd_1988,2021-05-25T13:45:53.000Z,2020-10-12T23:46:42.677Z,2021-07-06T23:05:53.718Z,occurrence,[],"{'Event': 0, 'absence': 0, 'dropped': 0, 'Occu...",,...,https://seamap.env.duke.edu/dataset/1988,Original provider:\nHappywhale\n\nDataset cred...,This work is licensed under a Creative Common...,"{'id': '18954703-9b9d-4584-b46d-87846532c5ee',...",,"[{'role': None, 'type': 'creator', 'givenname'...",[{'id': '573654c1-4ce7-4ea2-b2f1-e4d42f8f9c31'...,"[{'keyword': 'Occurrence,PhotoID;Visual sighti...","[{'year': 2022, 'downloads': 909, 'records': 9...",1


Now let us see the data provider's details.

In [6]:
data["contacts"]

0      [{'role': None, 'type': 'creator', 'givenname'...
1      [{'role': None, 'type': 'creator', 'givenname'...
2      [{'role': None, 'type': 'creator', 'givenname'...
3      [{'role': None, 'type': 'creator', 'givenname'...
4      [{'role': None, 'type': 'creator', 'givenname'...
                             ...                        
688    [{'role': None, 'type': 'creator', 'givenname'...
689    [{'role': None, 'type': 'creator', 'givenname'...
690    [{'role': None, 'type': 'creator', 'givenname'...
691    [{'role': None, 'type': 'creator', 'givenname'...
692    [{'role': None, 'type': 'creator', 'givenname'...
Name: contacts, Length: 693, dtype: object

We can see that the data provider details are nested into lists of dictionaries. We will need to unnest this data and gather information about dataset owner.

In [8]:
unnest_df = pd.json_normalize(res, "contacts", ["id"]) # we will identify the records using dataset UUID
unnest_df

Unnamed: 0,role,type,givenname,surname,organization,position,email,url,organization_oceanexpert_id,type_display,id
0,,creator,Eric W.M.,Stienen,Research Institute for Nature and Forest,,,,17039.0,Creator,aac5ca81-638a-4335-9aa7-5c2bda67a362
1,,creator,Peter,Desmet,Research Institute for Nature and Forest (INBO),,,,17039.0,Creator,aac5ca81-638a-4335-9aa7-5c2bda67a362
2,,creator,Tanja,Milotic,Research Institute for Nature and Forest,,,,17039.0,Creator,aac5ca81-638a-4335-9aa7-5c2bda67a362
3,,creator,Francisco,Hernandez,Flanders Marine Institute,,,,,Creator,aac5ca81-638a-4335-9aa7-5c2bda67a362
4,,creator,Klaas,Deneudt,Flanders Marine Institute,,,,,Creator,aac5ca81-638a-4335-9aa7-5c2bda67a362
...,...,...,...,...,...,...,...,...,...,...,...
5263,distributor,associatedParty,,OBIS-SEAMAP,"Marine Geospatial Ecology Lab, Duke University",,seamap-contact@duke.edu,http://seamap.env.duke.edu,19393.0,Distributor,fda6d4a4-8d14-45bc-9b4c-24d89fd213ab
5264,owner,associatedParty,Marina,Costa,Tethys Research Institute,Primary contact,marinza.costa@gmail.com,www.tethys.org,,Owner,fda6d4a4-8d14-45bc-9b4c-24d89fd213ab
5265,originator,associatedParty,Giuseppe,Notarbartolo di Sciara,Tethys Research Institute,Secondary contact,tethys@tethys.org,www.tethys.org,,Originator,fda6d4a4-8d14-45bc-9b4c-24d89fd213ab
5266,owner,personnel,Marina,Costa,,,,,,Owner,fda6d4a4-8d14-45bc-9b4c-24d89fd213ab


Now that we have unnested the creator data, let us see the specifics.

In [18]:
unnest_df[unnest_df["position"]=="Primary contact"].groupby("id").givenname.count()

id
00e92406-7ace-4fae-aae1-a294c7dec151    3
026d9e46-1734-44b3-84da-9048df73b4f6    3
034e9fa8-2640-49df-89cc-81590b5d11d2    3
03d772f9-454f-42f6-91df-4d1faf2ead16    3
03dc7648-e915-45b8-ab5b-b3a16a3664b4    3
                                       ..
fe558aab-4a93-4f9f-b525-b481ad68e2b6    3
fecb2219-2c7d-4324-8a13-e95fe552a743    3
fef0bdef-04de-4968-a5e2-d267463d00af    3
fefcba02-f27c-448e-8038-e4a5e0bd51d7    3
ff91dc4c-b725-45d9-9bfa-f046b36a9f0b    3
Name: givenname, Length: 223, dtype: int64

Now we will fetch dataset owner's name and add it to subsequent records in our original dataframe.

In [64]:
data["owner_name"] = np.NaN
for id in unnest_df.id.unique():
  x = unnest_df[(unnest_df["id"]==id) & (unnest_df["position"]=="Primary contact") & (unnest_df["type_display"]=="Owner")]
  if(x["givenname"].any() and x["surname"].any()): # we will also need to check if it's null
    index = data[data["id"]==id].index
    data.at[index, "owner_name"] = x["givenname"].values[0] + " " + x["surname"].values[0]

Now let us look back at the dataframe.

In [65]:
data["owner_name"]

0                NaN
1                NaN
2                NaN
3                NaN
4                NaN
           ...      
688    Ted Cheeseman
689              NaN
690    Ted Cheeseman
691    Ted Cheeseman
692     Marina Costa
Name: owner_name, Length: 693, dtype: object

+ spatial distribution --> plot on a world map (matplotlib)
+ temporal distribution --> year-wise distribution of records (maybe a histogram)
+ taxanomic distribution --> each record a stacked bar plot
+ what MoFs are included? --> list down each mofs for each dataset like you did for previous one

+ what makes this dataset unique?

    > idk what literally don't know this

----
We are grabbing some data for *Mola mola* initially then getting the contacts of the owner.

In [None]:
id_df = pd.json_normalize(res, "contacts", ["id"])
id_df

Unnamed: 0,role,type,givenname,surname,organization,position,email,url,organization_oceanexpert_id,type_display,id
0,,creator,Olivier,Van Canneyt,"Observatoire Pelagis UMS 3462, University La R...",Primary contact,olivier.van-canneyt@univ-lr.fr,,,Creator,2101d4c5-c20b-49c0-a44b-3d6484c4c891
1,,creator,Hélène,Peltier,"Observatoire PELAGIS, UMS 3462, University La ...",Secondary contact,hpeltier@univ-lr.fr,,,Creator,2101d4c5-c20b-49c0-a44b-3d6484c4c891
2,,contact,Olivier,Van Canneyt,"Observatoire Pelagis UMS 3462, University La R...",Primary contact,olivier.van-canneyt@univ-lr.fr,,,Contact,2101d4c5-c20b-49c0-a44b-3d6484c4c891
3,,metadataProvider,,OBIS-SEAMAP,"Marine Geospatial Ecology Lab, Duke University",,seamap-contact@duke.edu,https://seamap.env.duke.edu,,Metadata Provider,2101d4c5-c20b-49c0-a44b-3d6484c4c891
4,distributor,associatedParty,,OBIS-SEAMAP,"Marine Geospatial Ecology Lab, Duke University",,seamap-contact@duke.edu,https://seamap.env.duke.edu,,Distributor,2101d4c5-c20b-49c0-a44b-3d6484c4c891
...,...,...,...,...,...,...,...,...,...,...,...
1228,owner,personnel,Monique,Pool,,,,,,Owner,f306ca42-4493-4bec-b9c9-cf10764a81bd
1229,originator,personnel,Marijke,de Boer,,,,,,Originator,f306ca42-4493-4bec-b9c9-cf10764a81bd
1230,,creator,,,Academia Sinica Biodiversity Research Centre,,,,20778.0,Creator,ff8b7809-41bc-40ad-8160-0e33862817a0
1231,,contact,,OBIS Secretariat,Intergovernmental Oceanographic Commission of ...,,info@iobis.org,http://www.iobis.org,6860.0,Contact,ff8b7809-41bc-40ad-8160-0e33862817a0


In [None]:
id_df["position"]

array(['Primary contact', 'Secondary contact', None, 'Scientist',
       'Data Officer', 'OBIS Australia Node manager',
       'Deputy Center Director', 'Biologist', 'Collections Associate',
       'Director', 'Senior Manager of Biodiversity Informatics',
       'Principal Investigator', 'Marine Database Manager', 'Co-author',
       'researcher', 'biologist', 'geomatician',
       'OBIS Canada node manager', 'Collection Manager',
       'OBIS Australia Data manager', 'OBIS Australia Data Manager',
       'OBIS Technician', 'Marine Data Manager', 'Curator',
       'Senior Curator', 'Supervisory Research Fish Biologist',
       'Chief Scientist and Database Manager',
       'Information Technology Specialist', 'Curator, Vertebrate Zoology',
       'Manager, Fish Collection',
       'Assistant Collections Information Manager',
       'Assistant Collection Manager', 'Collection Technician',
       'Curator of Fishes and Collections Manager', 'Laboratory Manager',
       'data management t