# From DIMENSIONS data to a dataframe of abstracts

### Search for documents in the [DIMENSIONS](https://app.dimensions.ai/discover/publication) database and create an Excel-file containing search results ("search_results.xlsx").
* Run a query in the search bar. 

(I chose search option "DOI" and searched for known DOIs. One of my documents has no DOI, so I chose search option "Title and abstract" and searched for some sentences of this document's "SUMMARY" section.)

* Klick on "Save/Export" (on the right next to the search bar) and export the list of search results. Choose to export the results as an Excel-file.

(In this way I created "hcq.xlsx".)

### Make a Dataframe out of the Excel-file

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json

In [2]:
# make a dataframe from "hcq.xlsx": df_hcq
df_hcq = pd.read_excel("hcq.xlsx",skiprows=[0])
df_hcq.head(3)

Unnamed: 0,Rank,Publication ID,DOI,PMID,PMCID,Title,Source title,Anthology title,MeSH terms,Publication Date,...,Corresponding Author,Authors Affiliations,Times cited,Recent citations,RCR,FCR,Source Linkout,Dimensions URL,FOR (ANZSRC) Categories,Sustainable Development Goals
0,113,pub.1126880632,10.1186/s12969-020-00422-z,32321540.0,PMC7175817,COVID-19 and what pediatric rheumatologists sh...,Pediatric Rheumatology,,"Adolescent; Anti-Inflammatory Agents, Non-Ster...",2020-12,...,"Giani, Teresa (Meyer Children's Hospital; Univ...","Licciardi, Francesco (University of Turin); Gi...",1,1,,,https://ped-rheum.biomedcentral.com/track/pdf/...,https://app.dimensions.ai/details/publication/...,1108 Medical Microbiology; 11 Medical and Heal...,
1,65,pub.1127125006,10.1186/s13054-020-02894-7,32345336.0,PMC7187670,Shining a light on the evidence for hydroxychl...,Critical Care,,Betacoronavirus; Coronavirus Infections; Human...,2020-12,...,"Ingraham, Nicholas E. (University of Minnesota)","Ingraham, Nicholas E. (University of Minnesota...",0,0,,,https://ccforum.biomedcentral.com/track/pdf/10...,https://app.dimensions.ai/details/publication/...,11 Medical and Health Sciences,
2,58,pub.1125978056,10.1016/j.medmal.2020.03.006,32240719.0,PMC7195369,No Evidence of Rapid Antiviral Clearance or Cl...,Médecine et Maladies Infectieuses,,Adult; Aged; Antiviral Agents; Azithromycin; B...,2020-06,...,"Molina, Jean Michel (Hôpital Saint-Louis)","Molina, Jean Michel (Hôpital Saint-Louis); Del...",67,67,,,https://doi.org/10.1016/j.medmal.2020.03.006,https://app.dimensions.ai/details/publication/...,,


## Exploit Websites

In [3]:
pubkeys = [
    'title',
    'aff_org_name',
    'researcher_dim_id',
    'researcher_dim_count',
    'journal_title',
    'language',
    'abstract',
    'open_access',
    'publisher',
    'aff_city_name',
    'aff_country_name',
    'doi',
    'pub_date',
    'pub_year',
    'times_cited',
    'altmetric_id',
    'altmetric',
    'authors_full'
    ]

In [4]:
# define function getmeta()
def getmeta(x):
    html = ""
    d = ""
    value = ""
    url = x["Dimensions URL"]
    
    if url != None:
        html = requests.get(url).content
        
    soup = BeautifulSoup(html, "lxml")
    datadoc = soup.find("div")
    
    if datadoc == None:
        attr = ""
    else:
        attr = datadoc.get("data-doc")
        if attr:
            d = json.loads(attr)
    
    dicpub = dict()
    
    for i in pubkeys:
        if d != "":
            value = d.get(i, "NaN")
        dicpub[i] = value
        
    return(dicpub)

In [5]:
# create a new column "seldict" of dataframe df_hcq
# store metadata in column "seldict" of dataframe df_hcq
df_hcq["seldict"] = df_hcq.apply(getmeta, axis=1)
df_hcq["seldict"].head(3)

0    {'title': 'COVID-19 and what pediatric rheumat...
1    {'title': 'Shining a light on the evidence for...
2    {'title': 'No Evidence of Rapid Antiviral Clea...
Name: seldict, dtype: object

In [6]:
df_hcq.head(3)

Unnamed: 0,Rank,Publication ID,DOI,PMID,PMCID,Title,Source title,Anthology title,MeSH terms,Publication Date,...,Authors Affiliations,Times cited,Recent citations,RCR,FCR,Source Linkout,Dimensions URL,FOR (ANZSRC) Categories,Sustainable Development Goals,seldict
0,113,pub.1126880632,10.1186/s12969-020-00422-z,32321540.0,PMC7175817,COVID-19 and what pediatric rheumatologists sh...,Pediatric Rheumatology,,"Adolescent; Anti-Inflammatory Agents, Non-Ster...",2020-12,...,"Licciardi, Francesco (University of Turin); Gi...",1,1,,,https://ped-rheum.biomedcentral.com/track/pdf/...,https://app.dimensions.ai/details/publication/...,1108 Medical Microbiology; 11 Medical and Heal...,,{'title': 'COVID-19 and what pediatric rheumat...
1,65,pub.1127125006,10.1186/s13054-020-02894-7,32345336.0,PMC7187670,Shining a light on the evidence for hydroxychl...,Critical Care,,Betacoronavirus; Coronavirus Infections; Human...,2020-12,...,"Ingraham, Nicholas E. (University of Minnesota...",0,0,,,https://ccforum.biomedcentral.com/track/pdf/10...,https://app.dimensions.ai/details/publication/...,11 Medical and Health Sciences,,{'title': 'Shining a light on the evidence for...
2,58,pub.1125978056,10.1016/j.medmal.2020.03.006,32240719.0,PMC7195369,No Evidence of Rapid Antiviral Clearance or Cl...,Médecine et Maladies Infectieuses,,Adult; Aged; Antiviral Agents; Azithromycin; B...,2020-06,...,"Molina, Jean Michel (Hôpital Saint-Louis); Del...",67,67,,,https://doi.org/10.1016/j.medmal.2020.03.006,https://app.dimensions.ai/details/publication/...,,,{'title': 'No Evidence of Rapid Antiviral Clea...


In [7]:
lidi = []
def newpub(x):
    di = dict()
    d = x["seldict"]
    di["Publication ID"] = x["Publication ID"]
    di["DOI"] = x["DOI"]
    di["title"] = x["Title"]
    di["authors"] = d["authors_full"]
    di["publisher"] = d["publisher"]
    di["source"] = x["Source title"]
    di["aff_org_name"] = d["aff_org_name"]
    di["aff_country"] = d["aff_country_name"]
    di["pub_date"] = d["pub_date"]
    di["abstract"] = d["abstract"]
    di["openaccess"] = d["open_access"]
    di["di_URL"] = x["Dimensions URL"]
    lidi.append(di)

In [8]:
df_hcq.apply(lambda x:newpub(x), axis=1)
df_hcq_Lit = pd.DataFrame(lidi)
df_hcq_Lit.head(3)

Unnamed: 0,Publication ID,DOI,title,authors,publisher,source,aff_org_name,aff_country,pub_date,abstract,openaccess,di_URL
0,pub.1126880632,10.1186/s12969-020-00422-z,COVID-19 and what pediatric rheumatologists sh...,"[Francesco Licciardi, Teresa Giani, Letizia Ba...",Springer Nature,Pediatric Rheumatology,"[University of Siena, University of Turin, Div...",[Italy],2020-12,"On March 11th, 2020 the World Health Organizat...",True,https://app.dimensions.ai/details/publication/...
1,pub.1127125006,10.1186/s13054-020-02894-7,Shining a light on the evidence for hydroxychl...,"[Nicholas E. Ingraham, David Boulware, Matthew...",Springer Nature,Critical Care,"[University of North Carolina at Chapel Hill, ...",[United States],2020-12,,True,https://app.dimensions.ai/details/publication/...
2,pub.1125978056,10.1016/j.medmal.2020.03.006,No Evidence of Rapid Antiviral Clearance or Cl...,"[Jean Michel Molina, Constance Delaugerre, Jer...",Elsevier,Médecine et Maladies Infectieuses,"[University of Paris-Sud, U944 INSERM, Univers...",[France],2020-06,,True,https://app.dimensions.ai/details/publication/...


## Transforming source data

In [9]:
# selcect columns
df_HCQ = df_hcq_Lit[["Publication ID", "title", "abstract"]]
df_HCQ.head(3)

Unnamed: 0,Publication ID,title,abstract
0,pub.1126880632,COVID-19 and what pediatric rheumatologists sh...,"On March 11th, 2020 the World Health Organizat..."
1,pub.1127125006,Shining a light on the evidence for hydroxychl...,
2,pub.1125978056,No Evidence of Rapid Antiviral Clearance or Cl...,


In [10]:
df_HCQ.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Publication ID  19 non-null     object
 1   title           19 non-null     object
 2   abstract        19 non-null     object
dtypes: object(3)
memory usage: 584.0+ bytes


In [11]:
# export df_HCQ as json file: df_HCQ.json
df_HCQ.to_json("df_HCQ.json")