# How to obtain data



### Obtaining a set of relevant data sources

At the start of the data extraction process you have to collect a set of potentially relevant data sources. Therefore, you could collect a dataset manually or use a tool to help automating and speeding up this process. The Crossref API is a very useful tool to collect the metadata of relevant articles. Besides the API there are multiple Python libraries available that make access to the API easier. One of these libraries is [crossrefapi](https://github.com/fabiobatalha/crossrefapi). As an example 100 sources including metadata on the topic 'buchwald-hartwig coupling' are extracted and saved into a json file. 

In [9]:
from crossref.restful import Works
import json

works = Works(timeout=60)

# Performing the search for sources on the topic of buchwald-hartwig coupling for 10 papers
query_result = works.query(bibliographic='buchwald-hartwig coupling').select('DOI', 'title', 'author', 'type', 'publisher', 'issued').sample(10)

results = [item for item in query_result]

# Save 100 results including their metadata in a json file
with open('buchwald-hartwig_coupling_results.json', 'w') as file:
    json.dump(results, file)
    
print(results)

[{'DOI': '10.1021/jo061366i.s001', 'issued': {'date-parts': [[None]]}, 'publisher': 'American Chemical Society (ACS)', 'title': ['Synthesis of Cyclic Peptides Constrained with Biarylamine Linkers Using Buchwald-Hartwig C-N Coupling'], 'type': 'component'}, {'DOI': '10.1002/chin.200723250', 'author': [{'given': 'Peter', 'family': 'Kettler', 'sequence': 'first', 'affiliation': []}], 'issued': {'date-parts': [[2007, 5, 16]]}, 'publisher': 'Wiley', 'title': ['Carbon—Nitrogen Coupling and Buchwald—Hartwig Amination'], 'type': 'journal-article'}, {'DOI': '10.1055/s-0039-1690303', 'issued': {'date-parts': [[2019, 10, 18]]}, 'publisher': 'Georg Thieme Verlag KG', 'title': ['Buchwald–Hartwig Coupling of Piperidines with Hetaryl Bromides'], 'type': 'journal-article'}, {'DOI': '10.1002/chin.200704170', 'author': [{'given': 'V.', 'family': 'Balraju', 'sequence': 'first', 'affiliation': []}, {'given': 'Javed', 'family': 'Iqbal', 'sequence': 'additional', 'affiliation': []}], 'issued': {'date-parts'

### Data mining from an available database

There are multiple dataset available which are open for data mining.  To download full text documents from open access libararys the [paperscraper](https://github.com/jannisborn/paperscraper) tool can be used. As an example full text articles from ChemRxiv to the topic of 'buchwald-hartwig coupling' were downloaded.

In [15]:
from paperscraper.get_dumps import chemrxiv

# Download of the chemrxiv paper dump
chemrxiv(save_path='chemrxiv_2020-11-10.jsonl')

23763it [1:24:56,  4.66it/s]
100%|██████████| 23766/23766 [00:07<00:00, 3167.05it/s]


INFO:paperscraper.get_dumps.utils.chemrxiv.utils:Done, shutting down


In [30]:
from paperscraper.xrxiv.xrxiv_query import XRXivQuery
from paperscraper.pdf import save_pdf_from_dump
import pandas as pd

df = pd.read_json('./chemrxiv_2020-11-10.jsonl', lines=True)

# define keywords for the paper search
synthesis = ['synthesis']
reaction = ['buchwald-hartwig']

# combine keywords 
query = [synthesis, reaction]

# start searching for relevent papaers in the chemrxiv dump
querier = XRXivQuery('./chemrxiv_2020-11-10.jsonl')
querier.search_keywords(query, output_filepath='buchwald-hartwig_coupling_ChemRxiv.jsonl')

# Save PDFs in current folder and name the files by their DOI
save_pdf_from_dump('./buchwald-hartwig_coupling_ChemRxiv.jsonl', pdf_path='./PDFs', key_to_save='doi')

Processing paper 5/5: 100%|██████████| 5/5 [00:07<00:00,  1.55s/it]


### Data annotation

To annotate data tools like doccano can help to speed up the process. To use the doccano tool at first a database and account has to be created.

In [None]:
#initialize the database 
$ doccano init
$ doccano createuser --username admin --password pass
$ doccano webserver --port 8000

#start the annotation task
$ doccano task

Afterwards one can access the created [database]( http://0.0.0.0:8000), login with the created account and upload the unannotated articles there. After that one can start annotating the dataset with adding relevant labels to the text of the articles.