# APIs and Datasets for Scholarly Publications 

Using [lens.org](https://www.lens.org/) API for scholarly data.

## Prerequisites

1. Clone the [GitHub repository](https://github.com/kaust-library/using_lens_org): https://github.com/kaust-library/using_lens_org
1. Create your virtual environment: `python -m venv venv`.
1. Activate your environment: `. .\venv\Scripts\activate`. (Windows platform) or `. venv/bin/activate` (Linux)
1. Install the required packages: `pip install -r requirements.txt`.

## Loading the Packages and Env File

Load the packages

In [160]:
import dotenv as DE
import os as OS
import requests as RQ
import pprint as PP
import json as JN
import csv as CSV
import pandas as PD

You may need to create a `.env` file with your _token_ on _root_ directory of your project

```
(venv) PS C:\Users\garcm0b\Work\lens_org> cat .env
MY_TOKEN=(...)
(venv) PS C:\Users\garcm0b\Work\lens_org>
```

Make sure that your `.env` file is in the `.gitignore` file, so we will not upload your credential by accident.

In [161]:
DE.load_dotenv()
api_passwd = OS.environ['MY_TOKEN']

## Example

### Simple Search

Using the [`requests`](https://docs.python-requests.org/en/latest/index.html) library to test the Lens.org API. We use a singple example from the Swager API test page. Here we see simple query with some fields:

* [Query](https://docs.api.lens.org/request-scholar.html#terms-query): operates in a single term and search for _exact_ term in the field provided.
* [Match](https://docs.api.lens.org/request-scholar.html#match-query): the main use case of the match query is full-text search. It matches each words separately.
* [From/Size](https://docs.api.lens.org/request-scholar.html#offsetsize-based-pagination): use parameter `from` to define the offset and `size` to specify number of records expected.
* [Include/Exclude](https://docs.api.lens.org/request-scholar.html#projection): only request specific fields from the API endpoint.
* [Sort](https://docs.api.lens.org/request-scholar.html#sorting): result can be retrieved in ascending or descending order.
* [Scroll/Scroll_id](https://docs.api.lens.org/request-scholar.html#cursor-based-pagination): You can specify records per page using `size` (default 20 and max 1000) and context alive time `scroll` (default 1 minute). You will receive a `scroll_id` in response, which should be passed via request body to access next set of results.

In [162]:
headers= {"Authorization": api_passwd, "Content-Type": "application/json"}

In [163]:
payload = '''
{
  "query": {
    "match": {
      "title": "Malaria"
    }
  },
  "size": 5,
  "from": 0,
  "include": [
    "title",
    "lens_id",
    "patent_citations_count"
  ],
  "sort": [
    {
      "created": "desc"
    },
    {
      "year_published": "asc"
    }
  ],
  "exclude": null,
  "scroll": null,
  "scroll_id": null
}
'''

rr = RQ.post('https://api.lens.org/scholarly/search', data=payload, headers=headers)

After the query, we check if our request was successful or not by checking the `status_code`. The value `200` means a valid answer from the server, and [any other value](https://docs.api.lens.org/getting-started.html#http-responses) means an error. Next we print the result of the query:

In [164]:
if rr.status_code == 200:
    print(f"Your request was successfull")
    PP.pprint(rr.text)
else:
    print(f"Something went wrong. The return code was '{rr.status_code}'")

Your request was successfull
('{"total":107644,"data":[{"lens_id":"189-780-393-989-766","title":"A world '
 'free of malaria: It is time for Africa to actively champion and take '
 'leadership of elimination and eradication '
 'strategies"},{"lens_id":"085-144-192-014-163","title":"Reflections from the '
 'first South Sudan malaria '
 'conference"},{"lens_id":"091-205-353-493-016","title":"ASYMPTOMATIC MALARIA '
 'INFECTION AND ANAEMIA AMONG SECONDARY SCHOOL CHILDREN IN IPOGUN, ONDO STATE, '
 'NIGERIA"},{"lens_id":"007-476-373-861-646","title":"EFFECT OF MALARIA ON '
 'VISUAL ACUITY (V.A)"},{"lens_id":"027-325-084-244-85X","title":"Peer Review '
 '#2 of Cohesin is involved in transcriptional repression of stage-specific '
 'genes in the human malaria parasite"}],"results":5}')


## Creating a Dataframe

Using [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) as container for our data:

In [165]:
text = JN.loads(rr.text)
df = PD.DataFrame(text['data'])

## Saving to CSV File

The next example we query for the articles with abstract, and we save the output to a CSV file. This can be further expanded by tokenizing the abstract, and using tokens for Machine Learning.

In [166]:
payload = '''{
     "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "catalyzed",
                        "fields": [
                            "title",
                            "abstract",
                            "full_text"
                        ],
                        "default_operator": "or"
                    }
                }
            ],
            "filter": [
                {
                    "term": {
                        "has_abstract": true
                    }
                }
            ]
        }
    },
     "size": 10
}
'''

In [167]:
rr = RQ.post('https://api.lens.org/scholarly/search', data=payload, headers=headers)

### Pure Python

Using pure Python to save to a CSV file:

The function below is to extract the first and last name of the author(s). The problem is that in the answer, the authors is a [structure with several fields](https://docs.api.lens.org/response-scholar.html#author), like affiliations, ids, initials, etc. Here we just want the name, and in the case of more than one author, we use a different character (`;`) so we don't mix with commas separating the fields.

In [168]:
def get_authors(aulist: list) -> str:
    """
    Return the author's first and lastname.
    """
    
    count = len(aulist)
    if count == 1:
        return aulist[0]['first_name'] + " " + aulist[0]['last_name']
    else:
        names = ""
        for aa in aulist:
            names += aa['first_name'] + " " + aa['last_name'] + "; "
        # hack: remove the last '; '.
        names = names[:-2]
    
        return names

We use the method [`loads`](https://docs.python.org/3/library/json.html#json.loads) to read the output from our request into a JSON object. Next we will save the JSON items as a CSV file. To write the CSV file we'll use the [DictWriter](https://docs.python.org/3.10/library/csv.html#csv.DictWriter) method.

In [169]:
text = JN.loads(rr.text)
data = text['data']

fields = ['lens_id', 'title', 'year_published', 'authors', 'abstract']
row_csv = []

for dd in data:
    row = {}
    for ff in fields:
        row.update({ff: dd[ff]})
    row['authors'] = get_authors(dd['authors'])
    row_csv.append(row)

with open('metadata.csv', "w", newline="", encoding='utf-8') as csvfile:
    writer = CSV.DictWriter(csvfile, fieldnames=fields)
    writer.writeheader()
    writer.writerows(row_csv)

Let's give an example of authors as given by the API and after our function:

In [170]:
print("First the output from the API:")
PP.pprint(dd['authors'])
print("\nAfter the function 'get_authors'")
PP.pprint(row['authors'])

First the output from the API:
[{'affiliations': [],
  'first_name': 'Andrei K.',
  'ids': [{'type': 'magid', 'value': '1890584584'}],
  'initials': 'AK',
  'last_name': 'Yudin'},
 {'affiliations': [],
  'first_name': 'John F.',
  'ids': [{'type': 'magid', 'value': '2091457642'}],
  'initials': 'JF',
  'last_name': 'Hartwig'}]

After the function 'get_authors'
'Andrei K. Yudin; John F. Hartwig'


### Pandas

This is an interesing example of using the `apply` method to apply a function to a column of the dataframe. Next we save just some fields on the CSV file.

In [171]:
text = JN.loads(rr.text)
df = PD.DataFrame(text['data'])
df['authors'] = df['authors'].apply(get_authors)
df.to_csv('metadata_pd.csv', index=False, columns=['lens_id', 'title', 'year_published', 'authors', 'abstract'])

## Working with Fields

In [172]:
payload = '''{
    "query": {
        "bool": {
            "must": [
                {
                    "match_phrase":{
                        "author.affiliation.name": "King Abdullah University of Science and Technology"
                    }
                }, {
                    "range": {
                        "year_published": {
                            "gte": "2018",
                            "lte": "2020"
                        }
                    }                
                }
            ],
            "filter": [
                {
                    "term": {
                        "publication_type": "journal article"
                    }
                }, {
                    "term": {
                        "is_open_access": "true"
                    }
                }
            ]
        }
    },
    "include": [
        "lens_id",
        "title",
        "year_published",
        "open_access.colour"
    ],
    "sort": [
        {
            "year_published": "desc"
        }
    ],
    "size": 500
}
'''

In [173]:
rr = RQ.post('https://api.lens.org/scholarly/search', data=payload, headers=headers)

In [174]:
if rr.status_code == 200:
    print(f"Your request was successfull")
else:
    print(f"Something went wrong. The return code was '{rr.status_code}'")

Your request was successfull


In [175]:
text = JN.loads(rr.text)
df = PD.DataFrame(text['data'])

Checking if the dataframe is correct:

In [176]:
df.head()

Unnamed: 0,lens_id,title,year_published,open_access
0,000-883-392-158-888,Poly(A)-DG: A deep-learning-based domain gener...,2020,{'colour': 'gold'}
1,002-453-593-170-202,In Situ Growth of Lithiophilic MOF Layer Enabl...,2020,{'colour': 'gold'}
2,000-999-964-665-354,Ultrafast Charge Dynamics in Dilute-Donor vers...,2020,{'colour': 'hybrid'}
3,007-457-455-708-625,A framework for experimental scenarios of glob...,2020,{'colour': 'gold'}
4,005-462-289-428-962,Classes of Full-Duplex Channels With Capacity ...,2020,{'colour': 'green'}


We can query for specific fields of the dataframe. For example, the _title_ and _open access colour_ of the 101th (the count starts at `0`) article.

In [177]:
print(f"title: {df.iloc[100]['title']}, open access colour: {df.iloc[100]['open_access']['colour']}")

title: Assessing the age- and gender-dependence of the severity and case fatality rates of COVID-19 disease in Spain., open access colour: gold


In [178]:
df['open_access'].value_counts()

{'colour': 'green'}     259
{'colour': 'gold'}      147
{'colour': 'hybrid'}     80
{'colour': 'bronze'}      8
{}                        6
Name: open_access, dtype: int64

In [179]:
df[df['open_access'] == {'colour': 'bronze'}]

Unnamed: 0,lens_id,title,year_published,open_access
57,080-587-395-479-957,Solar Water Splitting: Over 17% Efficiency Sta...,2020,{'colour': 'bronze'}
63,126-123-125-201-260,A pseudo-kinetic model to simulate phase chang...,2020,{'colour': 'bronze'}
142,161-191-262-426-101,Global adjoint tomography—model GLAD-M25,2020,{'colour': 'bronze'}
164,004-044-936-135-704,Author Correction: Efficient near-infrared lig...,2020,{'colour': 'bronze'}
213,118-117-899-312-545,High-Resolution Operational Ocean Forecast and...,2020,{'colour': 'bronze'}
255,040-314-227-624-903,Anisotropic Growth of Al-Intercalated Vanadate...,2020,{'colour': 'bronze'}
270,079-195-414-281-623,Uncovering Atomic and Nano-scale Deformations ...,2020,{'colour': 'bronze'}
275,097-933-906-728-238,A Prolonged High-Salinity Event in the Norther...,2020,{'colour': 'bronze'}


In [193]:
payload = '''{
    "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "catalyzed",
                        "fields": [
                            "title",
                            "abstract",
                            "full_text"
                        ],
                        "default_operator": "or"
                    }
                }
            ],
            "filter": [
                {
                    "term": {
                        "has_abstract": true
                    },
                    "term": {
                        "is_open_access": "true"
                    }
                }
            ]
        }
    },
     "size": 100
}
'''

rr = RQ.post('https://api.lens.org/scholarly/search', data=payload, headers=headers)

if rr.status_code == 200:
    print(f"Your request was successfull")
else:
    print(f"Something went wrong. The return code was '{rr.status_code}'")

Your request was successfull


In [194]:
text = JN.loads(rr.text)
df = PD.DataFrame(text['data'])

In [195]:
df.head()

Unnamed: 0,lens_id,title,publication_type,year_published,date_published,date_published_parts,created,external_ids,open_access,authors,...,scholarly_citations,author_count,is_open_access,patent_citations,patent_citations_count,publication_supplementary_type,mesh_terms,chemicals,funding,keywords
0,001-085-590-248-665,Theoretical Studies on the Addition Reactions ...,journal article,2008.0,2008-07-20T00:00:00.000000+00:00,"[2008, 7, 20]",2018-05-11T20:33:24.230000+00:00,"[{'type': 'magid', 'value': '2100529744'}, {'t...",{'colour': 'bronze'},"[{'first_name': 'Chang Kon', 'last_name': 'Kim...",...,"[032-363-097-009-822, 033-768-753-171-206, 073...",6.0,True,,,,,,,
1,001-337-217-179-970,Adenylate Kinase-catalyzed Phosphoryl Transfer...,journal article,1995.0,1995-03-31T00:00:00.000000+00:00,"[1995, 3, 31]",2018-05-08T19:55:24.404000+00:00,"[{'type': 'magid', 'value': '2069998681'}, {'t...","{'license': 'CC BY, CC BY-NC-ND', 'colour': 'g...","[{'first_name': 'Robert J.', 'last_name': 'Zel...",...,"[000-020-372-597-177, 000-831-481-070-875, 002...",3.0,True,"[{'lens_id': '109-733-766-615-414'}, {'lens_id...",2.0,"[comparative study, research support, non-u.s....","[{'mesh_heading': 'Adenosine Diphosphate', 'qu...","[{'substance_name': 'Lactates', 'registry_numb...","[{'org': 'NIGMS NIH HHS', 'funding_id': 'GM288...",
2,001-410-588-597-790,Role of tunable acid catalysis in decompositio...,journal article,2014.0,2014-10-02T00:00:00.000000+00:00,"[2014, 10, 2]",2018-05-12T02:36:30.230000+00:00,"[{'type': 'pmid', 'value': '25234427'}, {'type...",{'colour': 'green'},"[{'first_name': 'Manoj', 'last_name': 'Kumar',...",...,"[002-141-861-880-954, 002-637-810-201-332, 005...",4.0,True,,,"[research support, u.s. gov't, non-p.h.s.]","[{'mesh_heading': 'Aldehydes', 'qualifier_name...","[{'substance_name': 'Aldehydes', 'registry_num...",[{'org': 'National Institute of Food and Agric...,
3,002-641-540-376-854,Recent advances in the Pd-catalyzed carboxylat...,journal article,2020.0,2020-07-13T00:00:00.000000+00:00,"[2020, 7, 13]",2021-03-19T16:46:44.345000+00:00,"[{'type': 'magid', 'value': '3138078936'}, {'t...",{'colour': 'bronze'},"[{'first_name': 'Wenfang', 'last_name': 'Xiong...",...,[167-147-285-931-792],3.0,True,,,,,,,
4,003-065-746-965-463,Iterative catalyst controlled diastereodiverge...,journal article,2016.0,,[2016],2018-05-12T06:02:51.938000+00:00,"[{'type': 'magid', 'value': '2463085273'}, {'t...",{'colour': 'green'},"[{'first_name': 'Diederik', 'last_name': 'Roke...",...,"[012-548-425-699-03X, 064-473-844-677-559, 114...",3.0,True,,,,,,,


In [205]:
df_author_3 = df[(df['publication_type'] == "journal article") & (df['author_count'] > 3.0)]
df_author_3.head()

Unnamed: 0,lens_id,title,publication_type,year_published,date_published,date_published_parts,created,external_ids,open_access,authors,...,scholarly_citations,author_count,is_open_access,patent_citations,patent_citations_count,publication_supplementary_type,mesh_terms,chemicals,funding,keywords
0,001-085-590-248-665,Theoretical Studies on the Addition Reactions ...,journal article,2008.0,2008-07-20T00:00:00.000000+00:00,"[2008, 7, 20]",2018-05-11T20:33:24.230000+00:00,"[{'type': 'magid', 'value': '2100529744'}, {'t...",{'colour': 'bronze'},"[{'first_name': 'Chang Kon', 'last_name': 'Kim...",...,"[032-363-097-009-822, 033-768-753-171-206, 073...",6.0,True,,,,,,,
2,001-410-588-597-790,Role of tunable acid catalysis in decompositio...,journal article,2014.0,2014-10-02T00:00:00.000000+00:00,"[2014, 10, 2]",2018-05-12T02:36:30.230000+00:00,"[{'type': 'pmid', 'value': '25234427'}, {'type...",{'colour': 'green'},"[{'first_name': 'Manoj', 'last_name': 'Kumar',...",...,"[002-141-861-880-954, 002-637-810-201-332, 005...",4.0,True,,,"[research support, u.s. gov't, non-p.h.s.]","[{'mesh_heading': 'Aldehydes', 'qualifier_name...","[{'substance_name': 'Aldehydes', 'registry_num...",[{'org': 'National Institute of Food and Agric...,
6,003-676-868-590-527,Mechanistic and selectivity investigations int...,journal article,2022.0,,"[2022, 1]",2021-12-08T22:35:57.664000+00:00,"[{'type': 'openalex', 'value': 'W4200296038'},...",{'colour': 'bronze'},"[{'first_name': 'Juping', 'last_name': 'Wang',...",...,"[004-123-103-774-523, 175-393-110-319-140]",5.0,True,,,,,,[{'org': 'Natural Science Foundation of Guangd...,
7,003-797-831-865-647,High efficiency of a sequential recombinase-me...,journal article,2010.0,2010-10-06T00:00:00.000000+00:00,"[2010, 10, 6]",2018-05-12T00:29:35.066000+00:00,"[{'type': 'pmcid', 'value': 'pmc2980817'}, {'t...",{'colour': 'green'},"[{'first_name': 'Natalia', 'last_name': 'Malch...",...,"[052-299-755-827-619, 085-028-856-572-662, 090...",7.0,True,,,"[research support, n.i.h., extramural, researc...",[{'mesh_heading': 'DNA Nucleotidyltransferases...,"[{'substance_name': 'Recombinases', 'registry_...","[{'org': 'NIGMS NIH HHS', 'funding_id': 'R01 G...",
10,004-351-907-257-809,Sulfenylphosphinoferrocenes: Novel Planar Chir...,journal article,2006.0,2006-07-25T00:00:00.000000+00:00,"[2006, 7, 25]",2018-05-12T15:24:07.332000+00:00,"[{'type': 'magid', 'value': '2949060142'}, {'t...",{'colour': 'green'},"[{'first_name': 'Silvia', 'last_name': 'Cabrer...",...,,6.0,True,,,[repository],,,,
