# Getting started: handsearching with paperfetcher

*Written on Aug 2, 2021 by Akash Pallath.* 

*Last updated on May 22, 2022 by Akash Pallath.*

---

To get started, let's import paperfetcher's handsearch module.

In [1]:
from paperfetcher import handsearch

Let's perform a simple task: to search for all journal articles in the *Journal of Physical Chemistry B* (JPCB) published between January 01, 2021 and June 01, 2021.

A quick Google search reveals that the ISSN for the web edition of JPCB is 1520-5207.

Now let's use this information to create a search object:

In [2]:
# Create a search object
search = handsearch.CrossrefSearch(ISSN="1520-5207",
                                   from_date="2021-01-01",
                                   until_date="2021-06-01")

Let's run the search!

(Ignore the warning for now. We'll get to it in a bit!)

In [3]:
search()

2022-05-22 19:55:10.894 INFO    paperfetcher.handsearch: Fetching 568 works.
Fetching 29 batches of 20 articles: 100%|████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:30<00:00,  1.06s/it]


How many works did our search return?

In [4]:
len(search)

568

This was rather slow... Can we speed this up?

*Yes, we can!*

**Why was the search so slow?**

Paperfetcher retrieved all the metadata available on Crossref for each paper. Each paper can have a lot of metadata (abstract, citations, keywords, funding information, etc.) deposited on Crossref, and retrieving all this informating can take a lot of time (and also, memory!).

**How do we make it faster?**

By retrieving only the metadata we need!

For example, if all we need is article DOIs and abstracts,we can do the following:

In [5]:
search = handsearch.CrossrefSearch(ISSN="1520-5207",
                                   from_date="2021-01-01",
                                   until_date="2021-06-01")

search(select=True, select_fields=['DOI', 'abstract'])

2022-05-22 19:55:42.434 INFO    paperfetcher.handsearch: Fetching 568 works.
Fetching 29 batches of 20 articles: 100%|████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:25<00:00,  1.13it/s]


In [6]:
len(search)

568

We can refine our search with keywords.

For example, let's perform search for all articles in the journal *Proceedings of the National Academy of Sciences* (online ISSN: 1091-6490) published between January 1, 2020 and January 1, 2022 containing the keyword 'hydrophobic'.

As before, we create a search object, but this time, pass the keyword to the search using the `keyword_list` argument. We'll also fetch more metadata!

In [7]:
# Create a search object
search = handsearch.CrossrefSearch(ISSN="1091-6490", 
                                   keyword_list=["hydrophobic"], 
                                   from_date="2020-01-01",
                                   until_date="2022-01-01")

search(select=True, select_fields=['DOI', 'URL', 'title', 'author', 'issued', 'abstract'])

2022-05-22 19:56:08.522 INFO    paperfetcher.handsearch: Fetching 7 works.
Fetching 1 batches of 20 articles: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.12it/s]


How many works did our search return?

In [8]:
len(search)

7

## Extracting data from the search results

paperfetcher provides many different ways to access the search result data, using special data structures called Datasets. 

For example, we can make a Dataset of DOIs from the search results:

In [9]:
doi_ds = search.get_DOIDataset()

We can display this as a DataFrame:

In [10]:
doi_ds.to_df()

Unnamed: 0,DOI
0,10.1073/pnas.2023867118
1,10.1073/pnas.2018234118
2,10.1073/pnas.2020205118
3,10.1073/pnas.2008122117
4,10.1073/pnas.2009310117
5,10.1073/pnas.2008209117
6,10.1073/pnas.1918981117


Or save it to a text file:

In [11]:
doi_ds.save_txt("out/handsearching_DOIs.txt")

**What if we want more information?**

We can extract information corresponding to all the fields that Crossref stores and store them in a `CitationsDataset`. The way in which Crossref stores some of these fields can be pretty complex. paperfetcher provides 'parsers' to convert these fields into human-readable strings.

Let's create a dataset containing the DOI, URL, article title, author list, and publication date. As per the Crossref API, these fields are:
`DOI`, `URL`, `title`, `author`, and `issued`.

`title` and `author` and `issued` require special parsers. The rest don't.

In order to extract these fields, however, the metadata for these fields needs to be available. That's why we selected all these fields when running the search! If we didn't, we wouldn't be able to perform the next few steps.

In [12]:
# Import the parsers module
from paperfetcher import parsers

In [13]:
ds = search.get_CitationsDataset(field_list=['DOI', 'URL', 'title', 'author', 'issued'],
                                 field_parsers_list=[None, None, parsers.crossref_title_parser,
                                                     parsers.crossref_authors_parser, 
                                                     parsers.crossref_date_parser])

We can view the data as a pandas DataFrame

In [14]:
ds.to_df()

Unnamed: 0,DOI,URL,title,author,issued
0,10.1073/pnas.2023867118,http://dx.doi.org/10.1073/pnas.2023867118,Size dependence of hydrophobic hydration at el...,"Serva, Salanne, Havenith, Pezzotti",2021-4-5
1,10.1073/pnas.2018234118,http://dx.doi.org/10.1073/pnas.2018234118,Identifying hydrophobic protein patches to inf...,"Rego, Xi, Patel",2021-2
2,10.1073/pnas.2020205118,http://dx.doi.org/10.1073/pnas.2020205118,Affinity of small-molecule solutes to hydropho...,"Monroe, Jiao, Davis, Robinson Brown, Katz, Shell",2020-12-28
3,10.1073/pnas.2008122117,http://dx.doi.org/10.1073/pnas.2008122117,"Comparative roles of charge, <i>π</i> , and hy...","Das, Lin, Vernon, Forman-Kay, Chan",2020-11-2
4,10.1073/pnas.2009310117,http://dx.doi.org/10.1073/pnas.2009310117,Spontaneous outflow efficiency of confined liq...,"Gao, Li, Zhang, Lu, Xu",2020-9-28
5,10.1073/pnas.2008209117,http://dx.doi.org/10.1073/pnas.2008209117,Enhanced receptor binding of SARS-CoV-2 throug...,"Wang, Liu, Gao",2020-6-5
6,10.1073/pnas.1918981117,http://dx.doi.org/10.1073/pnas.1918981117,Short solvent model for ion correlations and h...,"Gao, Remsing, Weeks",2020-1-7


We can also save this to a text file using the `save_txt` method:

In [15]:
ds.save_txt("out/handsearching_citations.txt")

Or save it as a CSV file using the `save_csv` method:

In [16]:
ds.save_csv("out/handsearching_citations.csv")

Or save it as an Excel file using the `save_excel` method:

In [17]:
ds.save_excel("out/handsearching_citations.xlsx")

## Exporting data to RIS format

Citation data stored in the RIS (Research Information Systems) file format can easily be imported into systematic review screening tools (such as Covidence) and citation management software (such as Zotero). Paperfetcher can export search results to RIS files. Let's take a look:

**Exporting to RIS format without abstracts**

Paperfetcher uses [Crossref's content negotiation service](https://www.crossref.org/documentation/retrieve-metadata/content-negotiation/) to get RIS data for each DOI. Unfortunately, this does not contain abstracts. However, there is a workaround, which we'll get to in a bit.

First, let's see how to export data to RIS format without abstracts:

In [18]:
ds = search.get_RISDataset()

Converting results to RIS format.: 100%|███████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  3.96it/s]


In [19]:
ds.save_ris("out/handsearching.ris")

**Exporting to RIS format with abstracts**

Recall that we have already retrieved abstracts during our search. We can insert these abstracts as an extra field into the RIS dataset. Here's how:

In [20]:
ds = search.get_RISDataset(extra_field_list=["abstract"],
                           extra_field_parser_list=[None],
                           extra_field_rispy_tags=["notes_abstract"])

Converting results to RIS format.: 100%|███████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.13it/s]


In [21]:
ds.save_ris("out/handsearching_abstracts.ris")