# QA of DSL Release v1.21 - GRID Tickets only
The purpose of this notebook is to review and test the new GRID search features in 1.21

Docs: https://docs.dimensions.ai/dsl/1.21.0-preview/


## Prerequisites

Please install the latest versions of these libraries to run this notebook. 

In [84]:
import dimcli
from dimcli.shortcuts import dslquery, dslqueryall
from tqdm import tqdm
import pandas as pd
import json
dimcli.login(instance="test")
# OR EXPLICITLY: dimcli.login(user="", password="", endpoint='https://integration.ds-metrics.com')
dsl = dimcli.Dsl() 

DimCli v0.6.1 - Succesfully connected to <https://integration.ds-metrics.com> (method: dsl.ini file)


---

## DSL-247 GRID API Function `extract_affiliations`
https://uberresearch.atlassian.net/browse/DSL-247

Let's pull out some real world affiliations 

Note that sometimes empty values come through as 'None', sometimes as an empty string. So we normalize to empty string..

In [None]:
res = dslquery("""search publications for "malaria" where authors is not empty return publications[authors]""")
testdata = []
for x in res.publications:
    for a in x['authors']:
        affiliations = a.get('affiliations')
        if affiliations:
            l = [affiliations[0].get('name', ""), affiliations[0].get('city', ""), affiliations[0].get('state', ""), affiliations[0].get('country', "")]
            l = [x if x != None else "" for x in l]
            if l not in testdata:
                testdata.append(l)
for x in enumerate(testdata[:15]): print(x)            

### Structured data

#### 1.1 single query 

In [None]:
# implicit results 
for d in tqdm(testdata[:15]):
    res = dsl.query(f"""extract_affiliations(name="{d[0]}", city="{d[1]}", state="{d[2]}", country="{d[3]}")""")
    print(res.json)

Test also when passing results opts

In [None]:

results_opts = ['basic', 'full', 'publisher']
for d in tqdm(testdata[:10], desc="affiliations outer loop"):
    for r in tqdm(results_opts):
        res = dsl.query(f"""extract_affiliations(name="{d[0]}", city="{d[1]}", state="{d[2]}", country="{d[3]}", results="{r}")""")
        print(res.json)

#### 1.2 batch matching

Build a JSON object from the affiliations test data above

In [None]:
jsondata = []
for el in testdata:
    jsondata.append({"name":el[0],"city":el[1],"state":el[2],"country":el[3]})
# json.dumps(jsondata)

In [None]:
res = dsl.query(f"""extract_affiliations(json={json.dumps(jsondata)})""")
print(res.json)

### Unstructured data

#### 2.1 Single query

In [None]:
# implicit results 
for d in tqdm(testdata[:15]):
    res = dsl.query(f"""extract_affiliations(affiliation="{d[0]} {d[1]} {d[2]} {d[3]}")""")
    print(res.json)

Test also when passing results opts

In [None]:
results_opts = ['basic', 'full', 'publisher']
for d in tqdm(testdata[:10], desc="affiliations outer loop"):
    for r in tqdm(results_opts):
        res = dsl.query(f"""extract_affiliations(affiliation="{d[0]} {d[1]} {d[2]} {d[3]}", results="{r}")""")
        print(res.json)

#### 2.2 Batch matching

Build a JSON object from the affiliations test data above

In [None]:
jsondata = []
for d in testdata:
    jsondata.append({"affiliation": f"{d[0]} {d[1]} {d[2]} {d[3]}"})
# json.dumps(jsondata)

In [None]:
res = dsl.query(f"""extract_affiliations(json={json.dumps(jsondata)})""")
print(res.json)

## DSL-207 GRID API phase 2 basic keyword and ID search 

https://uberresearch.atlassian.net/browse/DSL-207

In [37]:
%dsl search organizations where id="grid.410356.5" return organizations[basics]

Returned Organizations: 1 (total = 1)


<dimcli.Result object #4761677968. Records: 1/1>

In [39]:
%dsl search organizations where types = "Education" return organizations

Returned Organizations: 20 (total = 18775)


<dimcli.Result object #4495583056. Records: 20/18775>

In [32]:
res = dsl.query("""search organizations for "naples" return organizations limit 10""")
res.as_dataframe().head(3)

Returned Organizations: 10 (total = 10)


Unnamed: 0,types,latitude,state_name,city_name,country_name,name,longitude,linkout,acronym,id
0,,,Florida,Naples,United States,Cancer Alliance of Naples,,,,grid.427620.5
1,[Other],26.213648,Florida,Naples,United States,Naples Anesthesia & Physician Associates,-81.73307,[http://painfreenaples.com/],NAPA,grid.477562.5
2,[Education],40.83726,,Naples,Italy,Parthenope University of Naples,14.253195,[http://www.uniparthenope.it/],,grid.17682.3a


In [42]:
res = dsl.query("""search organizations for "naples" where state_name is not empty return organizations limit 10""")
res.as_dataframe().head(3)

Returned Organizations: 4 (total = 4)


Unnamed: 0,types,latitude,state_name,city_name,country_name,name,longitude,linkout,acronym,id
0,,,Florida,Naples,United States,Cancer Alliance of Naples,,,,grid.427620.5
1,[Other],26.213648,Florida,Naples,United States,Naples Anesthesia & Physician Associates,-81.73307,[http://painfreenaples.com/],NAPA,grid.477562.5
2,[Nonprofit],26.15077,Florida,Naples,United States,Naples Community Hospital Healthcare System,-81.798485,[http://www.nchmd.org/],NCH,grid.489100.4


In [29]:
%dsldf search organizations where latitude > 5 return organizations limit 10

Returned Errors: 1
Semantic Error
Semantic errors found:
	Field 'latitude' can not be used in filters. It can only be returned in the result.


In [30]:
%dsldf search organizations where longitude > 5 return organizations limit 10

Returned Errors: 1
Semantic Error
Semantic errors found:
	Field 'longitude' can not be used in filters. It can only be returned in the result.


#### more interesting queries 

faceting works ok

In [70]:
%dsldf search organizations for "nhs" return city_name

Returned City_name: 20


Unnamed: 0,id,count
0,London,38
1,Birmingham,7
2,Liverpool,7
3,Leeds,5
4,Manchester,5
5,Cambridge,3
6,Cardiff,3
7,Edinburgh,3
8,Exeter,3
9,Norwich,3


searching using types

In [44]:
%dsldf search organizations where country_name="Italy" and types="Company" return organizations[basics+acronym] limit 10

Returned Organizations: 10 (total = 627)


Unnamed: 0,types,latitude,state_name,city_name,country_name,name,longitude,linkout,acronym,id
0,[Company],44.428673,,Genoa,Italy,Ansaldo (Italy),8.886877,[http://www.ansaldoenergia.com/],,grid.12513.34
1,[Company],41.961998,,Rome,Italy,Telecom Italia (Italy),12.461547,[http://www.telecomitalia.com/tit/en.html],,grid.14587.3f
2,[Company],43.780216,,Florence,Italy,Menarini (Italy),11.276119,[http://www.menarini.com/],,grid.417562.3
3,[Company],43.471596,,Ancona,Italy,Loccioni (Italy),13.074696,[http://www.loccioni.com/?lang=en],,grid.423688.3
4,[Company],41.827152,,Rome,Italy,Eni (Italy),12.471406,[http://www.eni.com/en_IT/home.html],,grid.423791.a
5,[Company],41.92051,,Rome,Italy,Finmeccanica (Italy),12.469245,[http://www.finmeccanica.com/en/home],,grid.423952.b
6,[Company],45.61292,,Trieste,Italy,Eurospital (Italy),13.819512,[http://www.eurospital.com/],,grid.433771.5
7,[Company],44.71974,,Milan,Italy,UniCredit (Italy),8.579898,[https://www.unicreditgroup.eu/en.html#],,grid.436156.3
8,[Company],45.467724,,Milan,Italy,Edison (Italy),9.1781,[http://www.edison.it/],,grid.436460.3
9,[Company],43.62175,,Sansepolcro,Italy,Aboca (Italy),12.120622,[http://www.aboca.com/it],,grid.467166.4


In [45]:
%dsldf search organizations where country_name="Italy" and types in ["Education", "Company"] return organizations[basics+acronym] limit 10

Returned Organizations: 10 (total = 854)


Unnamed: 0,types,latitude,state_name,city_name,country_name,name,longitude,linkout,acronym,id
0,[Education],45.478863,,Milan,Italy,Polytechnic University of Milan,9.228206,[http://www.polimi.it/en/],,grid.4643.5
1,[Education],40.84722,Campania,Naples,Italy,University of Naples Federico II,14.256944,[http://www.unina.it/index.jsp],,grid.4691.a
2,[Education],46.080833,,Udine,Italy,University of Udine,13.211667,[http://www.uniud.it/en/uniud-international?se...,,grid.5390.f
3,[Education],43.71643,Toscana,Pisa,Italy,University of Pisa,10.398687,[http://www.unipi.it/],UniPi,grid.5395.a
4,[Education],44.40292,Liguria,Genoa,Italy,University of Genoa,8.958889,[http://www.unige.it/],UniGe,grid.5606.5
5,[Education],45.406387,Veneto,Padua,Italy,University of Padua,11.877446,[http://www.unipd.it/en/home-page],UNIPD,grid.5608.b
6,[Education],45.43667,Veneto,Verona,Italy,University of Verona,11.003611,[http://www.univr.it/jsp/index.jsp],,grid.5611.3
7,[Education],44.49389,Emilia-Romagna,Bologna,Italy,University of Bologna,11.342778,[http://www.unibo.it/en/homepage],UNIBO,grid.6292.f
8,[Education],41.85015,Lazio,Rome,Italy,University of Rome Tor Vergata,12.597991,[http://web.uniroma2.it/home.php?newlang=italian],,grid.6530.0
9,[Education],43.617065,,Ancona,Italy,Marche Polytechnic University,13.51269,[http://www.univpm.it/Entra/],,grid.7010.6


In [46]:
%dsldf search organizations where id="grid.477562.5" return organizations[basics+acronym]

Returned Organizations: 1 (total = 1)


Unnamed: 0,types,latitude,state_name,city_name,country_name,name,longitude,linkout,acronym,id
0,[Other],26.213648,Florida,Naples,United States,Naples Anesthesia & Physician Associates,-81.73307,[http://painfreenaples.com/],NAPA,grid.477562.5


searching pubs using the new organization fields **NEAT** 

**NOTE** the following two queries are syntactically valid but won't work due to sub-querying limitations... do we have a workaround? Otherwise we should just tell people not to do that!

In [73]:
%dsldf search publications where research_orgs.country_name="Italy" and research_orgs.types="Company" 

Returned Publications: 20 (total = 9866)
Query is too long or complex. Please see https://docs.dimensions.ai/dsl/faq.html for more information. [code: 4]


Unnamed: 0,id,year,volume,type,author_affiliations,pages,title,issue,journal.id,journal.title,journal
0,pub.1121864022,2020,12.0,article,"[[{'first_name': 'Takafumi', 'last_name': 'Tan...",1-1,Autonomous network diagnosis from the carrier ...,3.0,jour.1138977,Journal of Optical Communications and Networking,
1,pub.1120848982,2020,26.0,article,"[[{'first_name': 'Matteo', 'last_name': 'Buffo...",1-8,Investigation of Current-Driven Degradation of...,2.0,jour.1033570,IEEE Journal of Selected Topics in Quantum Ele...,
2,pub.1121564312,2020,26.0,article,"[[{'first_name': 'Lorenzo', 'last_name': 'Colu...",1-10,Efficient and Optical Feedback Tolerant Hybrid...,2.0,jour.1033570,IEEE Journal of Selected Topics in Quantum Ele...,
3,pub.1121066317,2020,12.0,article,"[[{'first_name': 'Takafumi', 'last_name': 'Tan...",a9-a17,Autonomous network diagnosis from the carrier ...,1.0,jour.1138977,Journal of Optical Communications and Networking,
4,pub.1122449006,2020,185.0,article,"[[{'first_name': 'F.I.', 'last_name': 'Mulder'...",13-19,Edoxaban for treatment of venous thromboemboli...,,jour.1015410,Thrombosis Research,
5,pub.1113506834,2020,35.0,article,"[[{'first_name': 'Roberto', 'last_name': 'Rizz...",430-442,An Isolated Multilevel Quasi-Resonant Multipha...,1.0,jour.1033560,IEEE Transactions on Power Electronics,
6,pub.1121964341,2020,330.0,article,"[[{'first_name': 'Todd J.', 'last_name': 'Levy...",108467,An impedance matching algorithm for common-mod...,,jour.1089656,Journal of Neuroscience Methods,
7,pub.1121577855,2020,1726.0,article,"[[{'first_name': 'Giordano', 'last_name': 'de ...",146502,"Increases in compulsivity, inflammation, and n...",,jour.1117575,Brain Research,
8,pub.1121884391,2020,,chapter,"[[{'first_name': 'Andrew', 'last_name': 'Livin...",1-18,Chapter 1 Challenges and Directions for Green ...,,,,
9,pub.1117495008,2019,9.0,article,"[[{'first_name': 'Octavio E.', 'last_name': 'S...",9220,How typhoons trigger turbidity currents in sub...,1.0,jour.1045337,Scientific Reports,


In [74]:
%dsldf search publications where count(research_orgs) > 1 and research_orgs.types="Company" return research_orgs

Returned Research_orgs: 20
Query is too long or complex. Please see https://docs.dimensions.ai/dsl/faq.html for more information. [code: 4]


Unnamed: 0,id,count,acronym,city_name,state_name,name,longitude,types,country_name,latitude,linkout
0,grid.410513.2,15164,Pfizer,New York,New York,Pfizer (United States),-73.97254,[Company],United States,40.750362,[http://www.pfizer.com/]
1,grid.417993.1,12410,MSD,Kenilworth,New Jersey,MSD (United States),-74.272575,[Company],United States,40.678677,[http://www.merck.com/]
2,grid.419666.a,10993,,Seoul,,Samsung (South Korea),127.0269,[Company],South Korea,37.49661,[http://www.samsung.com/sec/home/]
3,grid.418158.1,10870,Roche,Nutley,New Jersey,Roche (United States),-74.15691,[Company],United States,40.83348,[http://www.roche.com/careers/usa.htm]
4,grid.419481.1,9462,,Basel,,Novartis (Switzerland),7.579728,[Company],Switzerland,47.57432,[https://www.novartis.com/]
5,grid.418236.a,8962,GSK,London,,GlaxoSmithKline (United Kingdom),-0.31669,[Company],United Kingdom,51.4882,[http://uk.gsk.com/]
6,grid.418424.f,8950,NPC,New York,New York,Novartis (United States),-74.50225,[Company],United States,40.844296,[http://www.pharma.us.novartis.com/index.jsp]
7,grid.417540.3,8837,Eli Lilly & Company,Indianapolis,Indiana,Eli Lilly (United States),-86.15307,[Company],United States,39.75627,[http://www.lilly.com/]
8,grid.419815.0,7843,,Redmond,Washington,Microsoft (United States),-122.128334,[Company],United States,47.63972,[https://www.microsoft.com/en-us]
9,grid.5406.7,7321,,Munich,,Siemens (Germany),11.524718,[Company],Germany,48.13432,[http://www.siemens.com/entry/cc/en/]


## DSL-288 GRID API phase 3 all fields 

https://uberresearch.atlassian.net/browse/DSL-288

In [77]:
%dsl search organizations where id="grid.410356.5" return organizations[basics]

Returned Organizations: 1 (total = 1)


<dimcli.Result object #4779448016. Records: 1/1>

#### Getting fields dynamically

In [78]:
res = dsl.query("describe schema")

df = pd.DataFrame()

docs_for = ['organizations']
header = "sources"

d = {"sources": [], 'field': [], 'type': [], 'description':[], 'is_filter':[], 'is_entity': [],  'is_facet':[],}
for S in docs_for:
    for x in sorted(res.json[header][S]['fields']):
        d[header] += [S]
        d['field'] += [x]
        d['type'] += [res.json[header][S]['fields'][x]['type']]
        d['description'] += [res.json[header][S]['fields'][x]['description']]
        d['is_filter'] += [res.json[header][S]['fields'][x]['is_filter']]
        d['is_facet'] += [res.json[header][S]['fields'][x].get('is_facet', False)]
        d['is_entity'] += [res.json[header][S]['fields'][x].get('is_entity', False)]

fields = df.from_dict(d)
fields 

Unnamed: 0,sources,field,type,description,is_filter,is_entity,is_facet
0,organizations,acronym,string,"GRID acronym of the organization. E.g., ""UT"" f...",True,False,False
1,organizations,city_name,string,"GRID name of the organization country. E.g., ""...",True,False,True
2,organizations,cnrs_ids,string,CNRS IDs for this organization,True,False,False
3,organizations,country_name,string,"GRID name of the organization country. E.g., ""...",True,False,True
4,organizations,established,integer,Year when the organization was estabilished,True,False,False
5,organizations,external_ids_fundref,string,Fundref IDs for this organization,True,False,False
6,organizations,hesa_ids,string,HESA IDs for this organization,True,False,False
7,organizations,id,string,"GRID ID of the organization. E.g., ""grid.26999...",True,False,False
8,organizations,ificlaims_ids,string,IFI Claims IDs for this organization,True,False,False
9,organizations,isni_ids,string,ISNI IDs for this organization,True,False,False


Documentation seems good 
(note: the DF needs to be built manually cause dimcli doesn't have `organizations` as a valid source yet..) 

COMMENTS

* all 'related' IDs could be called `organization_patent_ids`, `organization_child_ids`, `organization_related_ids`
* We can drop the field `other_organization_ids` for now as it's confusing 
    * TODO Eg check if 'related' is the same as in https://grid.ac/institutes/grid.410356.5 
* `external` ids: name a bit ugly and also not aligned with other external IDs we have already (eg `pmid`, `pmcid`, `altmetric_id`, `orcid_id`). Is there a chance to rationalize things here?   




#### Automated query builder 


In [59]:
%dsl search organizations where wikipedia_url is not empty return organizations

Returned Organizations: 20 (total = 97308)


<dimcli.Result object #4779353552. Records: 20/97308>

In [79]:
# one with `is not empty` for filters or attributes 
q1 = """search organizations where {} is not empty return organizations[basics+{}]"""
# one with facet 
q2 = """search organizations where {} is not empty return {}"""


for index, row in fields.iterrows():
    print("\n===\n", row['field'], "\n===")
    q = q1.format(row['field'], row['field'])
    print(">>> " + q)
    dsl.query(q)
    if row['is_facet']: 
        q = q2.format(row['field'], row['field'])
        print("\n>>> " + q)
        dsl.query(q)


===
 acronym 
===
>>> search organizations where acronym is not empty return organizations[basics+acronym]
Returned Organizations: 20 (total = 33117)

===
 city_name 
===
>>> search organizations where city_name is not empty return organizations[basics+city_name]
Returned Organizations: 20 (total = 97240)

>>> search organizations where city_name is not empty return city_name
Returned City_name: 20

===
 cnrs_ids 
===
>>> search organizations where cnrs_ids is not empty return organizations[basics+cnrs_ids]
Returned Organizations: 20 (total = 937)

===
 country_name 
===
>>> search organizations where country_name is not empty return organizations[basics+country_name]
Returned Organizations: 20 (total = 97306)

>>> search organizations where country_name is not empty return country_name
Returned Country_name: 20

===
 established 
===
>>> search organizations where established is not empty return organizations[basics+established]
Returned Organizations: 20 (total = 97308)

===
 extern

#### Data Checks 

eg against web version of https://grid.ac/institutes/grid.410356.5

In [80]:
%dsl search organizations where id="grid.410356.5" return organizations[all]

Returned Organizations: 1 (total = 1)


<dimcli.Result object #4774276880. Records: 1/1>

Let's have a look at the fields returned

In [81]:
dsl_last_results.organizations[0].keys()

dict_keys(['id', 'external_ids_fundref', 'orgref_ids', 'city_name', 'acronym', 'state_name', 'name', 'ukprn_ids', 'organization_parent_ids', 'hesa_ids', 'types', 'ucas_ids', 'country_name', 'latitude', 'ificlaims_ids', 'organization_related_ids', 'longitude', 'organization_child_ids', 'wikipedia_url', 'cnrs_ids', 'established', 'isni_ids', 'linkout', 'wikidata_ids'])

`weight` should not be there!!! 

In [82]:
dsl_last_results.organizations[0]

{'id': 'grid.410356.5',
 'external_ids_fundref': ['501100003321', '501100003322', '501100006127'],
 'orgref_ids': ['7955551'],
 'city_name': 'Kingston',
 'acronym': None,
 'state_name': 'Ontario',
 'name': "Queen's University",
 'ukprn_ids': None,
 'organization_parent_ids': None,
 'hesa_ids': None,
 'types': ['Education'],
 'ucas_ids': None,
 'country_name': 'Canada',
 'latitude': 44.225502,
 'ificlaims_ids': ['069064'],
 'organization_related_ids': ['grid.413560.5',
  'grid.413632.1',
  'grid.415354.2',
  'grid.449116.f'],
 'longitude': -76.49516,
 'organization_child_ids': None,
 'wikipedia_url': "http://en.wikipedia.org/wiki/Queen's_University",
 'cnrs_ids': None,
 'established': 1841,
 'isni_ids': ['0000 0004 1936 8331'],
 'linkout': ['http://www.queensu.ca/'],
 'wikidata_ids': ['Q1420038']}

Now let's look at the contents of the data
* external_ids_ror is missing (in the UI it's https://ror.org/02y72wh86) but we'll add that as soon as it becomes available in SOLR
