# From DIMENSIONS data to a dataframe of abstracts

### Search for documents in the [DIMENSIONS](https://app.dimensions.ai/discover/publication) database and create an Excel-file containing search results ("search_results.xlsx").
* Run a query in the search bar. 

(I chose search option "DOI" and searched for known DOIs. One of my documents has no DOI, so I chose search option "Title and abstract" and searched for some sentences of this document's "SUMMARY" section.)

* Klick on "Save/Export" (on the right next to the search bar) and export the list of search results. Choose to export the results as an Excel-file.

(In this way I created "hcq.xlsx".)

### Make a Dataframe out of the Excel-file

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
import numpy as np
import re

In [2]:
# make a dataframe from "hcq.xlsx": df_hcq
df_hcq = pd.read_excel("hcq.xlsx",skiprows=[0])
df_hcq.head(3)

Unnamed: 0,Rank,Publication ID,DOI,PMID,PMCID,Title,Source title,Anthology title,MeSH terms,Publication Date,...,Corresponding Author,Authors Affiliations,Times cited,Recent citations,RCR,FCR,Source Linkout,Dimensions URL,FOR (ANZSRC) Categories,Sustainable Development Goals
0,113,pub.1126880632,10.1186/s12969-020-00422-z,32321540.0,PMC7175817,COVID-19 and what pediatric rheumatologists sh...,Pediatric Rheumatology,,"Adolescent; Anti-Inflammatory Agents, Non-Ster...",2020-12,...,"Giani, Teresa (Meyer Children's Hospital; Univ...","Licciardi, Francesco (University of Turin); Gi...",1,1,,,https://ped-rheum.biomedcentral.com/track/pdf/...,https://app.dimensions.ai/details/publication/...,1108 Medical Microbiology; 11 Medical and Heal...,
1,65,pub.1127125006,10.1186/s13054-020-02894-7,32345336.0,PMC7187670,Shining a light on the evidence for hydroxychl...,Critical Care,,Betacoronavirus; Coronavirus Infections; Human...,2020-12,...,"Ingraham, Nicholas E. (University of Minnesota)","Ingraham, Nicholas E. (University of Minnesota...",0,0,,,https://ccforum.biomedcentral.com/track/pdf/10...,https://app.dimensions.ai/details/publication/...,11 Medical and Health Sciences,
2,58,pub.1125978056,10.1016/j.medmal.2020.03.006,32240719.0,PMC7195369,No Evidence of Rapid Antiviral Clearance or Cl...,Médecine et Maladies Infectieuses,,Adult; Aged; Antiviral Agents; Azithromycin; B...,2020-06,...,"Molina, Jean Michel (Hôpital Saint-Louis)","Molina, Jean Michel (Hôpital Saint-Louis); Del...",67,67,,,https://doi.org/10.1016/j.medmal.2020.03.006,https://app.dimensions.ai/details/publication/...,,


## Exploit Websites

In [3]:
pubkeys = [
    'title',
    'aff_org_name',
    'researcher_dim_id',
    'researcher_dim_count',
    'journal_title',
    'language',
    'abstract',
    'open_access',
    'publisher',
    'aff_city_name',
    'aff_country_name',
    'doi',
    'pub_date',
    'pub_year',
    'times_cited',
    'altmetric_id',
    'altmetric',
    'authors_full'
    ]

In [4]:
# define function getmeta()
def getmeta(x):
    html = ""
    d = ""
    value = ""
    url = x["Dimensions URL"]
    
    if url != None:
        html = requests.get(url).content
        
    soup = BeautifulSoup(html, "lxml")
    datadoc = soup.find("div")
    
    if datadoc == None:
        attr = ""
    else:
        attr = datadoc.get("data-doc")
        if attr:
            d = json.loads(attr)
    
    dicpub = dict()
    
    for i in pubkeys:
        if d != "":
            value = d.get(i, "NaN")
        dicpub[i] = value
        
    return(dicpub)

In [5]:
# create a new column "seldict" of dataframe df_hcq
# store metadata in column "seldict" of dataframe df_hcq
df_hcq["seldict"] = df_hcq.apply(getmeta, axis=1)
df_hcq["seldict"].head(3)

0    {'title': 'COVID-19 and what pediatric rheumat...
1    {'title': 'Shining a light on the evidence for...
2    {'title': 'No Evidence of Rapid Antiviral Clea...
Name: seldict, dtype: object

In [6]:
df_hcq.head(3)

Unnamed: 0,Rank,Publication ID,DOI,PMID,PMCID,Title,Source title,Anthology title,MeSH terms,Publication Date,...,Authors Affiliations,Times cited,Recent citations,RCR,FCR,Source Linkout,Dimensions URL,FOR (ANZSRC) Categories,Sustainable Development Goals,seldict
0,113,pub.1126880632,10.1186/s12969-020-00422-z,32321540.0,PMC7175817,COVID-19 and what pediatric rheumatologists sh...,Pediatric Rheumatology,,"Adolescent; Anti-Inflammatory Agents, Non-Ster...",2020-12,...,"Licciardi, Francesco (University of Turin); Gi...",1,1,,,https://ped-rheum.biomedcentral.com/track/pdf/...,https://app.dimensions.ai/details/publication/...,1108 Medical Microbiology; 11 Medical and Heal...,,{'title': 'COVID-19 and what pediatric rheumat...
1,65,pub.1127125006,10.1186/s13054-020-02894-7,32345336.0,PMC7187670,Shining a light on the evidence for hydroxychl...,Critical Care,,Betacoronavirus; Coronavirus Infections; Human...,2020-12,...,"Ingraham, Nicholas E. (University of Minnesota...",0,0,,,https://ccforum.biomedcentral.com/track/pdf/10...,https://app.dimensions.ai/details/publication/...,11 Medical and Health Sciences,,{'title': 'Shining a light on the evidence for...
2,58,pub.1125978056,10.1016/j.medmal.2020.03.006,32240719.0,PMC7195369,No Evidence of Rapid Antiviral Clearance or Cl...,Médecine et Maladies Infectieuses,,Adult; Aged; Antiviral Agents; Azithromycin; B...,2020-06,...,"Molina, Jean Michel (Hôpital Saint-Louis); Del...",67,67,,,https://doi.org/10.1016/j.medmal.2020.03.006,https://app.dimensions.ai/details/publication/...,,,{'title': 'No Evidence of Rapid Antiviral Clea...


In [7]:
lidi = []
def newpub(x):
    di = dict()
    d = x["seldict"]
    di["Publication ID"] = x["Publication ID"]
    di["DOI"] = x["DOI"]
    di["title"] = x["Title"]
    di["authors"] = d["authors_full"]
    di["publisher"] = d["publisher"]
    di["source"] = x["Source title"]
    di["aff_org_name"] = d["aff_org_name"]
    di["aff_country"] = d["aff_country_name"]
    di["pub_date"] = d["pub_date"]
    di["abstract"] = d["abstract"]
    di["openaccess"] = d["open_access"]
    di["di_URL"] = x["Dimensions URL"]
    lidi.append(di)

In [8]:
df_hcq.apply(lambda x:newpub(x), axis=1)
df_hcq_Lit = pd.DataFrame(lidi)
df_hcq_Lit.head(3)

Unnamed: 0,Publication ID,DOI,title,authors,publisher,source,aff_org_name,aff_country,pub_date,abstract,openaccess,di_URL
0,pub.1126880632,10.1186/s12969-020-00422-z,COVID-19 and what pediatric rheumatologists sh...,"[Francesco Licciardi, Teresa Giani, Letizia Ba...",Springer Nature,Pediatric Rheumatology,"[University of Siena, University of Turin, Div...",[Italy],2020-12,"On March 11th, 2020 the World Health Organizat...",True,https://app.dimensions.ai/details/publication/...
1,pub.1127125006,10.1186/s13054-020-02894-7,Shining a light on the evidence for hydroxychl...,"[Nicholas E. Ingraham, David Boulware, Matthew...",Springer Nature,Critical Care,"[University of North Carolina at Chapel Hill, ...",[United States],2020-12,,True,https://app.dimensions.ai/details/publication/...
2,pub.1125978056,10.1016/j.medmal.2020.03.006,No Evidence of Rapid Antiviral Clearance or Cl...,"[Jean Michel Molina, Constance Delaugerre, Jer...",Elsevier,Médecine et Maladies Infectieuses,"[University of Paris-Sud, U944 INSERM, Univers...",[France],2020-06,,True,https://app.dimensions.ai/details/publication/...


## Transforming source data

In [9]:
# selcect columns
df_HCQ = df_hcq_Lit[["Publication ID", "title", "abstract"]]
df_HCQ.head()

Unnamed: 0,Publication ID,title,abstract
0,pub.1126880632,COVID-19 and what pediatric rheumatologists sh...,"On March 11th, 2020 the World Health Organizat..."
1,pub.1127125006,Shining a light on the evidence for hydroxychl...,
2,pub.1125978056,No Evidence of Rapid Antiviral Clearance or Cl...,
3,pub.1127834352,Hydroxychloroquine or chloroquine with or with...,"BACKGROUND: Hydroxychloroquine or chloroquine,..."
4,pub.1126667578,Hydroxychloroquine in patients mainly with mil...,Abstract Objectives To assess the efficacy and...


We now have a dataframe that contains the title, publication ID and abstract for all selected publictions. As you can see in the cell above, in some rows the abstracts are missing. But the notebook does not register the missing abstracts as missing: Of 19 entries in the column 'abstract' 19 are counted as "non-null" (see the next cell below). Instead it counts empty strings ('') as a kind of abstract - an empty but still existent abstract. It is the next step to get rid of those rows that contain empty strings in the 'abstract' column.  

In [10]:
df_HCQ.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Publication ID  19 non-null     object
 1   title           19 non-null     object
 2   abstract        19 non-null     object
dtypes: object(3)
memory usage: 584.0+ bytes


In [11]:
# Drop rows of 'df_HCQ' that contain in column 'abstract' an empty string ('') instead of an abstract: df_HCQ_dropped

drop_list = []
for i in range(len(df_HCQ)):
    if df_HCQ["abstract"].iloc[i] == '':
        drop_list.append(i)
        
df_HCQ_dropped = df_HCQ.drop(drop_list)

In [12]:
# Create new index for 'df_HCQ_dropped'
df_HCQ_dropped.index = [i for i in range(len(df_HCQ_dropped))]

In [13]:
df_HCQ_dropped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17 entries, 0 to 16
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Publication ID  17 non-null     object
 1   title           17 non-null     object
 2   abstract        17 non-null     object
dtypes: object(3)
memory usage: 544.0+ bytes


In [14]:
df_HCQ_dropped["abstract"].iloc[1]

'BACKGROUND: Hydroxychloroquine or chloroquine, often in combination with a second-generation macrolide, are being widely used for treatment of COVID-19, despite no conclusive evidence of their benefit. Although generally safe when used for approved indications such as autoimmune disease or malaria, the safety and benefit of these treatment regimens are poorly evaluated in COVID-19.\nMETHODS: We did a multinational registry analysis of the use of hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19. The registry comprised data from 671 hospitals in six continents. We included patients hospitalised between Dec 20, 2019, and April 14, 2020, with a positive laboratory finding for SARS-CoV-2. Patients who received one of the treatments of interest within 48 h of diagnosis were included in one of four treatment groups (chloroquine alone, chloroquine with a macrolide, hydroxychloroquine alone, or hydroxychloroquine with a macrolide), and patients who receiv

In [15]:
# Define function to remove (potentially disturbing) characters from abstracts: clean_abstract()
def clean_abstract(abstract):
    c1 = re.sub("\u2009", " ", abstract)
    c2 = re.sub("\n", " ", c1)
    c3 = re.sub("\u2008", " ", c2)
    c4 = re.sub("&#x27;", "'", c3)
    c5 = re.sub("#", " ", c4)
    c6 = re.sub("\xa0", " ", c5)
    c7 = re.sub("&quot;", " ", c6)
    c8 = re.sub("& s", "'", c7)
    c9 = re.sub("##", " ", c8)
    c10 = re.sub("###", " ", c9)
    
    cleaned_abstract = c10
    return cleaned_abstract

In [16]:
df_HCQ_dropped["abstract_clean"] = df_HCQ_dropped["abstract"].apply(clean_abstract)

In [17]:
df_HCQ_dropped.head()

Unnamed: 0,Publication ID,title,abstract,abstract_clean
0,pub.1126880632,COVID-19 and what pediatric rheumatologists sh...,"On March 11th, 2020 the World Health Organizat...","On March 11th, 2020 the World Health Organizat..."
1,pub.1127834352,Hydroxychloroquine or chloroquine with or with...,"BACKGROUND: Hydroxychloroquine or chloroquine,...","BACKGROUND: Hydroxychloroquine or chloroquine,..."
2,pub.1126667578,Hydroxychloroquine in patients mainly with mil...,Abstract Objectives To assess the efficacy and...,Abstract Objectives To assess the efficacy and...
3,pub.1125404383,Of chloroquine and COVID-19,Recent publications have brought attention to ...,Recent publications have brought attention to ...
4,pub.1127182972,An independent appraisal and re-analysis of hy...,A recent open-label study claimed that hydroxy...,A recent open-label study claimed that hydroxy...


In [18]:
df_HCQ_dropped["abstract_clean"].iloc[1]

"BACKGROUND: Hydroxychloroquine or chloroquine, often in combination with a second-generation macrolide, are being widely used for treatment of COVID-19, despite no conclusive evidence of their benefit. Although generally safe when used for approved indications such as autoimmune disease or malaria, the safety and benefit of these treatment regimens are poorly evaluated in COVID-19. METHODS: We did a multinational registry analysis of the use of hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19. The registry comprised data from 671 hospitals in six continents. We included patients hospitalised between Dec 20, 2019, and April 14, 2020, with a positive laboratory finding for SARS-CoV-2. Patients who received one of the treatments of interest within 48 h of diagnosis were included in one of four treatment groups (chloroquine alone, chloroquine with a macrolide, hydroxychloroquine alone, or hydroxychloroquine with a macrolide), and patients who receive

In [19]:
# export 'df_HCQ_dropped' as json file: HCQ_clean_abstracts.json
df_HCQ_dropped.to_json("data/HCQ_clean_abstracts.json")          # "data/" means that the json-file is to be stored in
                                                                 # the directory 'data'