First, we will get the Single cell studies database:

In [1]:
import pandas as pd
data = pd.read_csv('http://nxn.se/single-cell-studies/data.tsv', sep='\t')

In [2]:
data.head(3)

Unnamed: 0,Shorthand,DOI,Authors,Journal,Title,Date,bioRxiv DOI,Reported cells total,Organism,Tissue,...,Developmental stage,Number of reported cell types or clusters,Cell clustering,Pseudotime,RNA Velocity,PCA,tSNE,H5AD location,Isolation,BC --> Cell ID _OR_ BC --> Cluster ID
0,Cauli et al PNAS,10.1073/pnas.97.11.6144,"B. Cauli, J. T. Porter, K. Tsuzuki, B. Lambole...",Proceedings of the National Academy of Sciences,Classification of fusiform neocortical interne...,20020726,-,85,Rat,Brain,...,21-27 days,3.0,Yes,No,No,Yes,No,,Patch-clamp,
1,Malnic et al Cell,10.1016/S0092-8674(00)80581-4,"Bettina Malnic, Junzo Hirono, Takaaki Sato, Li...",Cell,Combinatorial Receptor Codes for Odors,20040410,-,18,Mouse,Brain,...,,,,,,,,,,
2,Tietjen et al Neuron,10.1016/S0896-6273(03)00229-0,"Ian Tietjen, Jason M. Rihel, Yanxiang Cao, Geo...",Neuron,Single-Cell Transcriptional Analysis of Neuron...,20040415,-,37,"Human, Mouse",Brain,...,,6.0,,,,,,,"Manual, LCM",


In [3]:
data.columns

Index(['Shorthand', 'DOI', 'Authors', 'Journal', 'Title', 'Date',
       'bioRxiv DOI', 'Reported cells total', 'Organism', 'Tissue',
       'Technique', 'Data location', 'Panel size', 'Measurement',
       'Cell source', 'Disease', 'Contrasts', 'Developmental stage',
       'Number of reported cell types or clusters', 'Cell clustering',
       'Pseudotime', 'RNA Velocity', 'PCA', 'tSNE', 'H5AD location',
       'Isolation', 'BC --> Cell ID _OR_ BC --> Cluster ID'],
      dtype='object')

The idea is to reconcile interesting information to Wikidata. The following columns are considered interesting:


In [4]:
cols_of_interest = ["DOI", "Organism", "Tissue", "Technique", "Data location" ]

The idea is to reconcile interesting information to Wikidata. The following columns are considered interesting:


In [5]:
data = data[cols_of_interest]

In [6]:
data

Unnamed: 0,DOI,Organism,Tissue,Technique,Data location
0,10.1073/pnas.97.11.6144,Rat,Brain,sc-RT-mPCR,
1,10.1016/S0092-8674(00)80581-4,Mouse,Brain,PCR,
2,10.1016/S0896-6273(03)00229-0,"Human, Mouse",Brain,PCR,
3,10.1093/cercor/bhj081,Rat,Brain,sc-RT-mPCR,
4,10.1093/nar/gkl050,Mouse,ICM,aRNA amplification,GSE4309
...,...,...,...,...,...
1214,10.1186/s13059-021-02267-5,Human,Culture,Chromium,GSE142392
1215,10.1038/s42003-020-01625-6,Human,"Blood, Tumor",Chromium,GSE121638
1216,10.7554/eLife.62586,,,,
1217,10.3390/biom11020177,Human,Culture,,


In [7]:

from wikidata2df import wikidata2df

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]
        
def get_doi_df(doi_list):
    doi_list = ['"' + doi + '"' for doi in doi_list]
    doi_string = " ".join(doi_list)
    query = """
    SELECT ?normalized_doi ?item  ?itemLabel
    WHERE {
      {
        SELECT ?item ?normalized_doi WHERE {
          VALUES ?doi {

          """ + doi_string +"""

          }
          BIND(UCASE(?doi) AS ?normalized_doi)
          ?item wdt:P356 ?normalized_doi.
        }
      }
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
    }
    """
    doi_df = wikidata2df(query)
    return(doi_df)

In [8]:
doi_list = data["DOI"]

dois_in_chunks = chunks(doi_list, 99)

df_now = pd.DataFrame(columns=["itemLabel", "normalized_doi", "item"])

for doi_list in dois_in_chunks:

    df = get_doi_df(doi_list)
    df_now = df_now.append(df)

In [9]:
df_now


Unnamed: 0,itemLabel,normalized_doi,item
0,Classification of NPY-expressing neocortical i...,10.1523/JNEUROSCI.0058-09.2009,Q30490498
1,Single-cell transcriptional analysis of neuron...,10.1016/S0896-6273(03)00229-0,Q30921133
2,mRNA-Seq whole-transcriptome analysis of a sin...,10.1038/NMETH.1315,Q28240611
3,Modelling and measuring single cell RNA expres...,10.1186/1471-2164-9-268,Q33340138
4,Combinatorial receptor codes for odors,10.1016/S0092-8674(00)80581-4,Q29616773
...,...,...,...
1,Distinct developmental pathways from blood mon...,10.1016/J.IMMUNI.2020.12.003,Q104684648
2,Spatiotemporal analysis of human intestinal de...,10.1016/J.CELL.2020.12.016,Q104754414
3,Single-Cell Mapping of Progressive Fetal-to-Ad...,10.1016/J.CELREP.2020.108573,Q104754435
4,Characterization of a common progenitor pool o...,10.1126/SCIENCE.ABB2986,Q104794656


In [10]:
data["DOI"] = [doi.upper() for doi in data["DOI"]]

Let's split the requests to get all dois

In [11]:
data = data.merge(df_now, left_on="DOI", right_on="normalized_doi", how="left")

In [19]:
sum(data["normalized_doi"].isnull())

187

187 articles on the list are not on Wikidata yet.
Eventually we can build something to automatically add those. 
For now, I will just stick to those that are present already. 

In [25]:
data.dropna(subset=["normalized_doi"]).to_csv("reconciled_articles.csv", sep="\t", index=False)