# APIs and Datasets for Scholarly Publications 

Using [lens.org](https://www.lens.org/) API for scholarly data.

## Prerequisites

1. Clone the [GitHub repository](https://github.com/kaust-library/using_lens_org): https://github.com/kaust-library/using_lens_org
1. Create your virtual environment: `python -m venv venv`.
1. Activate your environment: `. .\venv\Scripts\activate`. (Windows platform) or `. venv/bin/activate` (Linux)
1. Install the required packages: `pip install -r requirements.txt`.

## Loading the Packages and Env File

Load the packages

In [1]:
import dotenv as DE
import os as OS
import requests as RQ
import pprint as PP
import json as JN
import csv as CSV

You may need to create a `.env` file with your _token_ on _root_ directory of your project

```
(venv) PS C:\Users\garcm0b\Work\lens_org> cat .env
MY_TOKEN=(...)
(venv) PS C:\Users\garcm0b\Work\lens_org>
```

Make sure that your `.env` file is in the `.gitignore` file, so we will not upload your credential by accident.

In [2]:
DE.load_dotenv()
api_passwd = OS.environ['MY_TOKEN']

## Examples

### Simple Search

Using the [`requests`](https://docs.python-requests.org/en/latest/index.html) library to test the Lens.org API. We use a singple example from the Swager API test page. Here we see simple query with some fields:

* [Query](https://docs.api.lens.org/request-scholar.html#terms-query): operates in a single term and search for _exact_ term in the field provided.
* [Match](https://docs.api.lens.org/request-scholar.html#match-query): the main use case of the match query is full-text search. It matches each words separately.
* [From/Size](https://docs.api.lens.org/request-scholar.html#offsetsize-based-pagination): use parameter `from` to define the offset and `size` to specify number of records expected.
* [Include/Exclude](https://docs.api.lens.org/request-scholar.html#projection): only request specific fields from the API endpoint.
* [Sort](https://docs.api.lens.org/request-scholar.html#sorting): result can be retrieved in ascending or descending order.
* [Scroll/Scroll_id](https://docs.api.lens.org/request-scholar.html#cursor-based-pagination): You can specify records per page using `size` (default 20 and max 1000) and context alive time `scroll` (default 1 minute). You will receive a `scroll_id` in response, which should be passed via request body to access next set of results.

In [3]:
headers= {"Authorization": api_passwd, "Content-Type": "application/json"}

payload = '''
{
  "query": {
    "match": {
      "title": "Malaria"
    }
  },
  "size": 5,
  "from": 0,
  "include": [
    "title",
    "lens_id",
    "patent_citations_count"
  ],
  "sort": [
    {
      "created": "desc"
    },
    {
      "year_published": "asc"
    }
  ],
  "exclude": null,
  "scroll": null,
  "scroll_id": null
}
'''

rr = RQ.post('https://api.lens.org/scholarly/search', data=payload, headers=headers)

After the query, we check if our request was successful or not by checking the `status_code`. The value `200` means a valid answer from the server, and [any other value](https://docs.api.lens.org/getting-started.html#http-responses) means an error. Next we print the result of the query:

In [4]:
if rr.status_code == 200:
    print(f"Your request was successfull")
    PP.pprint(rr.text)
else:
    print(f"Something went wrong. The return code was '{rr.status_code}'")

Your request was successfull
('{"total":107644,"data":[{"lens_id":"189-780-393-989-766","title":"A world '
 'free of malaria: It is time for Africa to actively champion and take '
 'leadership of elimination and eradication '
 'strategies"},{"lens_id":"085-144-192-014-163","title":"Reflections from the '
 'first South Sudan malaria '
 'conference"},{"lens_id":"091-205-353-493-016","title":"ASYMPTOMATIC MALARIA '
 'INFECTION AND ANAEMIA AMONG SECONDARY SCHOOL CHILDREN IN IPOGUN, ONDO STATE, '
 'NIGERIA"},{"lens_id":"007-476-373-861-646","title":"EFFECT OF MALARIA ON '
 'VISUAL ACUITY (V.A)"},{"lens_id":"027-325-084-244-85X","title":"Peer Review '
 '#2 of Cohesin is involved in transcriptional repression of stage-specific '
 'genes in the human malaria parasite"}],"results":5}')


### Saving to CSV File

The next example we query for the articles with abstract, and we save the output to a CSV file. This can be further expanded by tokenizing the abstract, and using tokens for Machine Learning.

In [5]:
payload = '''{
     "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "catalyzed",
                        "fields": [
                            "title",
                            "abstract",
                            "full_text"
                        ],
                        "default_operator": "or"
                    }
                }
            ],
            "filter": [
                {
                    "term": {
                        "has_abstract": true
                    }
                }
            ]
        }
    },
     "size": 10
}
'''

In [6]:
rr = RQ.post('https://api.lens.org/scholarly/search', data=payload, headers=headers)

The function below is to extract the first and last name of the author(s). The problem is that in the answer, the authors is a [structure with several fields](https://docs.api.lens.org/response-scholar.html#author), like affiliations, ids, initials, etc. Here we just want the name, and in the case of more than one author, we use a different character (`;`) so we don't mix with commas separating the fields.

In [7]:
def get_authors(count: int, aulist: list) -> str:
    """
    Return the author's first and lastname.
    """
    
    if count == 1:
        return aulist[0]['first_name'] + " " + aulist[0]['last_name']
    else:
        names = ""
        for aa in aulist:
            names += aa['first_name'] + " " + aa['last_name'] + "; "
        # hack: remove the last '; '.
        names = names[:-2]
    
        return names

We use the method [`loads`](https://docs.python.org/3/library/json.html#json.loads) to read the output from our request into a JSON object. Next we will save the JSON items as a CSV file. To write the CSV file we'll use the [DictWriter](https://docs.python.org/3.10/library/csv.html#csv.DictWriter) method.

In [8]:
text = JN.loads(rr.text)
data = text['data']

fields = ['lens_id', 'title', 'year_published', 'authors', 'abstract']
row_csv = []

for dd in data:
    row = {}
    for ff in fields:
        row.update({ff: dd[ff]})
    # print(dd['author_count'])
    row['authors'] = get_authors(dd['author_count'], dd['authors'])
    row_csv.append(row)

with open('metadata.csv', "w", newline="", encoding='utf-8') as csvfile:
    writer = CSV.DictWriter(csvfile, fieldnames=fields)
    writer.writeheader()
    writer.writerows(row_csv)

Let's give an example of authors as given by the API and after our function:

In [9]:
print("First the output from the API:")
PP.pprint(dd['authors'])
print("\nAfter the function 'get_authors'")
PP.pprint(row['authors'])

First the output from the API:
[{'affiliations': [],
  'first_name': 'Andrei K.',
  'ids': [{'type': 'magid', 'value': '1890584584'}],
  'initials': 'AK',
  'last_name': 'Yudin'},
 {'affiliations': [],
  'first_name': 'John F.',
  'ids': [{'type': 'magid', 'value': '2091457642'}],
  'initials': 'JF',
  'last_name': 'Hartwig'}]

After the function 'get_authors'
'Andrei K. Yudin; John F. Hartwig'


### Building a Query

Next we will build a search. We start querying for KAUST output with the following fields: _title_, _lens\_id_, and _year of publication_.

In [10]:
payload = '''{
    "query": {
        "match_phrase": {
            "author.affiliation.name": "King Abdullah University of Science and Technology"
        }
    },
    "include": [
        "title",
        "lens_id",
        "year_published"
    ],
    "size": 10
}
'''

In [11]:
rr = RQ.post('https://api.lens.org/scholarly/search', data=payload, headers=headers)

In [12]:
if rr.status_code == 200:
    print(f"Your request was successfull")
else:
    print(f"Something went wrong. The return code was '{rr.status_code}'")

Your request was successfull


Next we include a range for the year of publication.

In [13]:
payload = '''{
    "query": {
        "bool": {
            "must": [
                {
                    "match_phrase":{
                        "author.affiliation.name": "King Abdullah University of Science and Technology"
                    }
                }, {
                    "range": {
                        "year_published": {
                            "gte": "2018",
                            "lte": "2020"
                        }
                    }                
                }
            ],
            "filter": [
                {
                    "term": {
                        "publication_type": "journal article"
                    }
                }
            ]
        }
    },
    "include": [
        "lens_id",
        "title",
        "year_published"
    ],
    "size": 10
}
'''

In [14]:
rr = RQ.post('https://api.lens.org/scholarly/search', data=payload, headers=headers)

In [15]:
if rr.status_code == 200:
    print(f"Your request was successfull")
else:
    print(f"Something went wrong. The return code was '{rr.status_code}'")

Your request was successfull


In the output above there is no order, that is, there are articles from 2020, followed by articles from 2018, and back to 2020. Next we sort the articles by year in descending order.

In [16]:
payload = '''{
    "query": {
        "bool": {
            "must": [
                {
                    "match_phrase":{
                        "author.affiliation.name": "King Abdullah University of Science and Technology"
                    }
                }, {
                    "range": {
                        "year_published": {
                            "gte": "2018",
                            "lte": "2020"
                        }
                    }                
                }
            ],
            "filter": [
                {
                    "term": {
                        "publication_type": "journal article"
                    }
                }, {
                    "term": {
                        "is_open_access": "true"
                    }
                }
            ]
        }
    },
    "include": [
        "lens_id",
        "title",
        "year_published",
        "open_access.colour"
    ],
    "sort": [
        {
            "year_published": "desc"
        }
    ],
    "size": 500
}
'''

In [17]:
rr = RQ.post('https://api.lens.org/scholarly/search', data=payload, headers=headers)

In [18]:
if rr.status_code == 200:
    print(f"Your request was successfull")
else:
    print(f"Something went wrong. The return code was '{rr.status_code}'")

Your request was successfull


## Using Pandas

Using [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) as container for our data. First we import the library

In [19]:
import pandas as PD

Next we create a dataframe from the request object:

In [20]:
text = JN.loads(rr.text)
df = PD.DataFrame(text['data'])

Checking if the dataframe is correct:

In [21]:
df.head()

Unnamed: 0,lens_id,title,year_published,open_access
0,000-883-392-158-888,Poly(A)-DG: A deep-learning-based domain gener...,2020,{'colour': 'gold'}
1,002-453-593-170-202,In Situ Growth of Lithiophilic MOF Layer Enabl...,2020,{'colour': 'gold'}
2,000-999-964-665-354,Ultrafast Charge Dynamics in Dilute-Donor vers...,2020,{'colour': 'hybrid'}
3,007-457-455-708-625,A framework for experimental scenarios of glob...,2020,{'colour': 'gold'}
4,005-462-289-428-962,Classes of Full-Duplex Channels With Capacity ...,2020,{'colour': 'green'}


We can access the dataframe as a dictionary:

In [22]:
df['lens_id']

0      000-883-392-158-888
1      002-453-593-170-202
2      000-999-964-665-354
3      007-457-455-708-625
4      005-462-289-428-962
              ...         
495    069-140-290-778-572
496    070-878-495-309-549
497    080-586-502-760-943
498    081-347-569-585-738
499    079-059-555-990-777
Name: lens_id, Length: 500, dtype: object

Querying the data types of the dataframe. 

In [23]:
df.dtypes

lens_id           object
title             object
year_published     int64
open_access       object
dtype: object

We can query for specific fields of the dataframe. For example, the _title_ and _open access colour_ of the 101th (the count starts at `0`) article.

In [24]:
print(f"title: {df.iloc[100]['title']}, open access colour: {df.iloc[100]['open_access']['colour']}")

title: Assessing the age- and gender-dependence of the severity and case fatality rates of COVID-19 disease in Spain., open access colour: gold


In [25]:
df['open_access'].value_counts()

{'colour': 'green'}     259
{'colour': 'gold'}      147
{'colour': 'hybrid'}     80
{'colour': 'bronze'}      8
{}                        6
Name: open_access, dtype: int64

In [26]:
df[df['open_access'] == {'colour': 'bronze'}]

Unnamed: 0,lens_id,title,year_published,open_access
57,080-587-395-479-957,Solar Water Splitting: Over 17% Efficiency Sta...,2020,{'colour': 'bronze'}
63,126-123-125-201-260,A pseudo-kinetic model to simulate phase chang...,2020,{'colour': 'bronze'}
144,161-191-262-426-101,Global adjoint tomography—model GLAD-M25,2020,{'colour': 'bronze'}
164,004-044-936-135-704,Author Correction: Efficient near-infrared lig...,2020,{'colour': 'bronze'}
213,118-117-899-312-545,High-Resolution Operational Ocean Forecast and...,2020,{'colour': 'bronze'}
255,040-314-227-624-903,Anisotropic Growth of Al-Intercalated Vanadate...,2020,{'colour': 'bronze'}
270,079-195-414-281-623,Uncovering Atomic and Nano-scale Deformations ...,2020,{'colour': 'bronze'}
275,097-933-906-728-238,A Prolonged High-Salinity Event in the Norther...,2020,{'colour': 'bronze'}
