# Antimicrobial Resistance Knowledge Graph 

## Abstact

This study explores the integration of antibiotic resistance data from GenBank and RefSeq databases into the publicly accessible Knowledge Graph, Wikidata. Antibiotic resistance poses a significant global health threat, with the misuse of antibiotics leading to the emergence of resistant strains. The project focuses on modeling proteins, genes, chromosomes, bacterial strains, and species in Wikidata, using a comprehensive dataset from the National Center for Biotechnology Information (NCBI). The data cleaning process involves addressing discrepancies and extracting crucial information from the linked databases. Despite challenges, the implementation in Wikidata progresses, with ongoing efforts to link bacterial strains, genes, and proteins. The study highlights the need for standardized entries in databases and emphasizes the potential impact of integrating antibiotic resistance information into Wikidata for global accessibility and collaborative contributions.

## Background and Motivation

Antibiotics play a crucial role in the treatment of bacterial infections. The use of antibiotics annually saves millions of lives, but it also accelerates the growth of livestock in factory farming. Due to the increased and sometimes unnecessary use of antibiotics, individual microorganisms develop resistance, rendering antibiotics ineffective. The development of antibiotic resistance is caused by genetic mutations that occur randomly in bacteria. If these mutations have a positive impact on the bacterium's survival, it survives the antibiotic treatment and passes on the positive traits to other bacteria or the next generation.

Improper use of antibiotics, especially in developing countries, has led to the prevalence of many antibiotic-resistant bacteria today. Just as different antibiotics have different mechanisms of action (for example, Beta-Lactam targets the bacterial cell wall), bacteria also develop various defense mechanisms. For instance, bacteria can form an efflux pump, which expels already entered antibiotics from the cell.

Infection with antibiotic-resistant bacteria poses a high risk to humans, as medical treatment becomes impossible. The World Health Organization (WHO) classifies Antimicrobial Resistance (AMR) as one of the three greatest medical threats, referring to it as a silent pandemic due to the high annual death toll (1.27 million people). By 2050, the WHO estimates that the number could rise to 10 million deaths per year, far surpassing the annual mortality rate of cancer.

Although various health organizations are aware of numerous bacterial strains encoding antibiotic-resistant proteins, these databases are not interconnected, despite the urgent need for such integration. This work focuses on incorporating the largest publicly accessible database from the National Center for Biotechnology Information (NCBI) into a publicly accessible Knowledge Graph (Wikidata). Medial staff, scientists and interested people can easily access the knowledge from Wikidata. [1]

## Material and Methods

Since the data in the publicly accessible Wikidata Knowledge Graph is intended to be available to everyone, the implementation concept follows the modeling of proteins, genes, chromosomes, bacterial strains, and bacterial species. The premise is that the bacterial species is already present in Wikidata, and only the further sequence needs to be modeled.

As a data source, a public database from the National Center for Biotechnology Information (NIH) is chosen [2]. This database consists of approximately 10,000 proteins that cause antibiotic resistances in bacteria. The protein is linked to the type of antibiotic resistance, as well as to links to nucleotides and proteins in the Refseq and GenBank databases. To model the sequence described above, consisting of protein, gene, chromosome, and bacterial strain starting from the bacterial species, the names of the gene and bacterial strain must be extracted from the linked Refseq and GenBank databases. Using the "bioservice EUtils" and the Refseq or GenBank references, the aforementioned information is extracted from the databases. Redundancies are deliberately built (e.g., extracting the bacterial species via Refseq and GenBank, as well as via protein and nucleotide) to subsequently combine information or choose the most complete data source.


In [1]:
import pandas as pd 
import urllib
import time 
import numpy as np 
import bioservice_fetcher as biof 
import os 
from typing import Optional 

### Fetch data from NCBI database 

The read dataframe contains keys to access the protein, nucleotide either via Reference Sequence (RefSeq) database or genbank. Lets load the first two columns of the dataframe and read some values. First try is to get the data from the RefSeq-database.  

In [2]:
# Reads everything from the linked refseq and genbank databases that could possible be interesting for this project

def fetch_data(read_from_web: bool = False, amount: Optional[int] = None) -> pd.DataFrame: 
    """
    if fetch_data_switch is True data will be fetched from NCBI and interesting things will be read from genbank or refseq database -- caution: Takes for ages 
    else data will be read from last time -- should be used in most of the cases 
    """
    if read_from_web: 
        # will take ~ 12h 
        url = "https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/latest/ReferenceGeneCatalog.txt"
        df = pd.read_csv(urllib.request.urlopen(url), delimiter="\t")
        if amount is not None:
            df = df.sample(amount)
        df[["refseq_parent_taxon", "refseq_protein", "refseq_parent_taxon2"]] = df.apply(biof.get_protein_and_parent, axis=1)
        print(1)
        df[["refseq_gene", "refseq_protein2", "refseq_genome", "refseq_organism", "refseq_tax_id"]] = df.apply(biof.get_strain_and_gene, axis=1)
        print(2)
        df[["genbank_organism", "genbank_strain"]] = df.apply(biof.get_organism_strain_via_prot,axis=1)
        print(3)
        df[["genbank_organism2", "genbank_strain2", "genbank_tax_id"]] = df.apply(biof.get_organism_strain_via_nuc, axis=1)
        df.to_csv("resistance_df.csv", index=False)
    else: 
        if not os.path.exists("resistance_df.csv"): 
            print("Cannot read from hard drive because file does not exist -- set read from web switch to True")
            return None
        df = pd.read_csv("resistance_df.csv")
    return df


df = fetch_data(False)
df.sample(5, random_state=19)


Unnamed: 0,allele,gene_family,whitelisted_taxa,product_name,scope,type,subtype,class,subclass,refseq_protein_accession,...,refseq_gene,refseq_protein2,refseq_genome,refseq_organism,refseq_tax_id,genbank_organism,genbank_strain,genbank_organism2,genbank_strain2,genbank_tax_id
3774,blaOXA-823,blaOXA,,OXA-10 family class D beta-lactamase OXA-823,core,AMR,AMR,BETA-LACTAM,BETA-LACTAM,WP_136512103.1,...,Pseudomonas aeruginosa HUPM19015969 blaOXA,OXA-10 family class D beta-lactamase OXA-823,,Pseudomonas aeruginosa,taxon:287,Pseudomonas aeruginosa,HUPM19015969,Pseudomonas aeruginosa,HUPM19015969,taxon:287
6706,,narB,,ionophore ABC transporter permease subunit NarB,plus,AMR,AMR,IONOPHORE,MADURAMICIN/NARASIN/SALINOMYCIN,,...,,,,,,Enterococcus faecium,WT1145,Enterococcus faecium,WT1145,taxon:1352
139,aac(6')-29a,aac(6')-29,,aminoglycoside N-acetyltransferase AAC(6')-29a,core,AMR,AMR,AMINOGLYCOSIDE,AMINOGLYCOSIDE,WP_064190968.1,...,Pseudomonas aeruginosa aac(6')-29,aminoglycoside N-acetyltransferase AAC(6')-29a,,Pseudomonas aeruginosa,taxon:287,,,Pseudomonas aeruginosa,,taxon:287
7508,,tet(D),,tetracycline efflux MFS transporter Tet(D),core,AMR,AMR,TETRACYCLINE,TETRACYCLINE,WP_001039466.1,...,Shigella sonnei 119 tet(D),tetracycline efflux MFS transporter Tet(D),,Shigella sonnei,taxon:624,Shigella sonnei,119,Shigella sonnei,119,taxon:624
1398,blaCAR-1,blaCAR,,subclass B3 metallo-beta-lactamase CAR-1,core,AMR,AMR,BETA-LACTAM,CARBAPENEM,WP_011094382.1,...,Pectobacterium atrosepticum SCRI1043 blaCAR,subclass B3 metallo-beta-lactamase CAR-1,,Pectobacterium atrosepticum SCRI1043,taxon:218491,Pectobacterium atrosepticum SCRI1043,SCRI1043,Pectobacterium atrosepticum SCRI1043,SCRI1043,taxon:218491


## STRAIN

Because there is quite a bit of data present, my goal is to extract the right name of the bacterial strain. Therefore, I am going to find the best data source of organism name and combine it with the best source of the exact strain name

In [3]:
df.keys()

Index(['allele', 'gene_family', 'whitelisted_taxa', 'product_name', 'scope',
       'type', 'subtype', 'class', 'subclass', 'refseq_protein_accession',
       'refseq_nucleotide_accession', 'curated_refseq_start',
       'genbank_protein_accession', 'genbank_nucleotide_accession',
       'genbank_strand', 'genbank_start', 'genbank_stop', 'refseq_strand',
       'refseq_start', 'refseq_stop', 'pubmed_reference', 'blacklisted_taxa',
       'synonyms', 'hierarchy_node', 'db_version', 'refseq_parent_taxon',
       'refseq_protein', 'refseq_parent_taxon2', 'refseq_gene',
       'refseq_protein2', 'refseq_genome', 'refseq_organism', 'refseq_tax_id',
       'genbank_organism', 'genbank_strain', 'genbank_organism2',
       'genbank_strain2', 'genbank_tax_id'],
      dtype='object')

In [4]:
# Just by looking into this small random sampled dataframe "refseq organism", "genbank_organsim" and "genbank_organism2" yield similar results, 
# altough "refseq_organism" has less information
# "refseq_parent_taxon" and "refseq_parent_taxon2" sometimes have the same information as the others and sometimes have a much higher taxon (e.g. line 8816 "Bacteria")
# For extracting the right organism name I am going to look closer into "refseq_organism", "genbank_organism" and "ganbank_organism2"
df[["refseq_parent_taxon", "refseq_parent_taxon2",  "refseq_organism", "genbank_organism", "genbank_organism2"]].sample(10, random_state=100)

Unnamed: 0,refseq_parent_taxon,refseq_parent_taxon2,refseq_organism,genbank_organism,genbank_organism2
758,,,,Acidiphilium multivorum,Acidiphilium multivorum
9222,Enterobacteriaceae,Enterobacteriaceae,,Shigella sonnei,Shigella sonnei
8816,Bacteria,Bacteria,,Escherichia coli str. K-12 substr. MG1655,Escherichia coli str. K-12 substr. MG1655
3993,Klebsiella oxytoca,Klebsiella oxytoca,Klebsiella oxytoca,Klebsiella oxytoca,Klebsiella oxytoca
4874,Escherichia coli,Escherichia coli,Escherichia coli,Escherichia coli,Escherichia coli
6956,Citrobacter freundii,Citrobacter freundii,Citrobacter freundii,,Citrobacter freundii
277,Pseudomonadaceae,Pseudomonadaceae,Pseudomonas putida,,Pseudomonas putida
3978,Klebsiella michiganensis,Klebsiella michiganensis,Klebsiella michiganensis,Klebsiella michiganensis,Klebsiella michiganensis
8247,Enterococcus,Enterococcus,,Enterococcus faecium,Enterococcus faecium
2520,Lelliottia amnigena,Lelliottia amnigena,Lelliottia amnigena,Lelliottia amnigena,Lelliottia amnigena


In [5]:
# The above impression solifies: refseq organism obivously contains the most empty fields
df[["genbank_organism", "genbank_organism2", "refseq_organism"]].isna().sum()

genbank_organism      368
genbank_organism2       0
refseq_organism      2499
dtype: int64

In [6]:
#cases where organisms in genbank via protein (genbank_organism), genbank via nuclotide and refseq do not match 

diff_organism_df = df.loc[(df["genbank_organism"] != df["genbank_organism2"]) | (df["genbank_organism"] != df["refseq_organism"]), ["genbank_organism", 
                                                                                                                                    "genbank_organism2", 
                                                                                                                                    "refseq_organism"]]
diff_organism_df.sample(10, random_state=8)

Unnamed: 0,genbank_organism,genbank_organism2,refseq_organism
7123,Staphylococcus aureus,Staphylococcus aureus,
6122,Escherichia coli,Escherichia coli,
770,Enterococcus sp. JM4C,Enterococcus sp. JM4C,
8458,Campylobacter jejuni subsp. jejuni NCTC 11168 ...,Campylobacter jejuni subsp. jejuni NCTC 11168 ...,
6540,Enterobacter roggenkampii,Enterobacter roggenkampii,
999,,Acinetobacter baumannii AB4A3,Acinetobacter baumannii AB4A3
8553,Pseudomonas aeruginosa,Pseudomonas aeruginosa,
5718,Escherichia coli,Escherichia coli,
3126,,Pseudomonas aeruginosa,Pseudomonas aeruginosa
5948,Bacillus bingmayongensis,Bacillus bingmayongensis,


In [7]:
# As can be seen there are 35 issues where organism found via genbank (nucleotid) and via genbank (protein) are different 
# What should be done here? Are these alternative names? -- Expert knowledge requiered: I'm going to drop them 
(x1 := diff_organism_df.loc[diff_organism_df["genbank_organism"] != diff_organism_df["genbank_organism2"], ["genbank_organism", "genbank_organism2"]].dropna())

Unnamed: 0,genbank_organism,genbank_organism2
126,Salmonella enterica subsp. enterica serovar Ty...,Salmonella virus Fels2
601,Bacillus anthracis str. Ames,Bacillus phage lambda Ba02
5475,Escherichia coli K-12,Escherichia coli
5697,Clostridioides difficile 630,Peptoclostridium phage p630P2
5721,Bacillus cereus ATCC 14579,Bacillus phage phBC6A52
5942,Bacillus cereus ATCC 14579,Bacillus phage phBC6A52
6214,Bacillus cereus ATCC 14579,Bacillus phage phBC6A52
6217,Staphylococcus epidermidis RP62A,Staphylococcus epidermidis RP62A phage SP-beta
6220,Bacillus phage lambda Ba03,Bacillus phage lambda Ba02
6257,Salmonella enterica subsp. enterica serovar Ty...,Salmonella virus Fels2


In [8]:
# Furthermore there are 5 issues between genbank via nucleotide and refseq 
# Are these alternative names? Expert knowledge requiered: I'm going to drop them as I'm not sure
(x2 := diff_organism_df.loc[diff_organism_df["genbank_organism"] != diff_organism_df["refseq_organism"], ["genbank_organism", 
                                                                                                          "refseq_organism"]].dropna())

Unnamed: 0,genbank_organism,refseq_organism
2032,Acinetobacter sp.,Acinetobacter baumannii
4045,Klebsiella michiganensis,Klebsiella michiganensis M5al
5056,Salmonella enterica,Salmonella enterica subsp. enterica serovar In...
6220,Bacillus phage lambda Ba03,Bacillus anthracis str. Ames
6236,Cytobacillus massiliigabonensis,Bacillus massiliigabonensis


I am going to choose the organism found by genbank via nucleotid as base organism, because it has no NaN and seems to be complete

In [9]:
# drop data instances where I'm unsure 
df.drop(set(x1.index).union(set(x2.index)), inplace=True)

In [10]:
# Genbank_strain carrys either redundant information or no information at all -- genbank_strain2 will therefore be selected
df.loc[(df["genbank_strain2"] != df["genbank_strain"]) & ~(df["genbank_strain"].isna()), ["genbank_strain2", "genbank_strain", "genbank_organism2"]]

Unnamed: 0,genbank_strain2,genbank_strain,genbank_organism2


In [11]:
# Often the strain name in as expected to be in genbank_strain2, somtimes it is already included in genbank organism
# e.g. line 7743: Clostridium sp. MLG080-1 
df.loc[:, ["genbank_strain2", "genbank_organism2"]].sample(10, random_state=50)

Unnamed: 0,genbank_strain2,genbank_organism2
9018,FA19,Neisseria gonorrhoeae
7743,MLG080-1,Clostridium sp. MLG080-1
9165,WHO_U,Neisseria gonorrhoeae
3013,2318902,Pseudomonas aeruginosa
4,FC1K,Mycolicibacterium fortuitum
6795,SKLX003475,Klebsiella pneumoniae
1363,G4074,Elizabethkingia miricola
4295,185584,Pseudomonas aeruginosa
9014,FA19,Neisseria gonorrhoeae
5168,13S00929-3,Escherichia coli


To find the correct and full strain name, I will combine "genbank_strain2" and "genbank_organism2", only if strain is not included in organism. Else, the organism itself will be selected.

In [12]:
def check_for_combination(df_row: pd.Series) -> bool:
    """
    Checks if genbank_organism and genbank_strain should be connected to one strain
    """
    if not isinstance(df_row["genbank_strain2"], str): 
        # don't combine organism and strain if is NaN --> Is this correct? 
        return False
    return not (df_row["genbank_strain2"].upper() in df_row["genbank_organism2"].upper())
    
    
df["strain"] = np.where(df.apply(check_for_combination , axis=1), 
                           df["genbank_organism2"] + " " + df["genbank_strain2"], 
                           df["genbank_organism2"])
df[["genbank_strain2", "genbank_organism2", "strain"]].sample(10, random_state=5)

Unnamed: 0,genbank_strain2,genbank_organism2,strain
5358,,Staphylococcus aureus,Staphylococcus aureus
850,HA-2,Hafnia alvei,Hafnia alvei HA-2
9038,VRCO0432,Klebsiella pneumoniae,Klebsiella pneumoniae VRCO0432
4021,SG271,Klebsiella spallanzanii,Klebsiella spallanzanii SG271
7357,NIPH56,Acinetobacter baumannii,Acinetobacter baumannii NIPH56
5862,13H1,Escherichia coli,Escherichia coli 13H1
2529,HD24,Klebsiella pneumoniae,Klebsiella pneumoniae HD24
613,,uncultured bacterium,uncultured bacterium
8818,K-12,Escherichia coli str. K-12 substr. MG1655,Escherichia coli str. K-12 substr. MG1655
4861,KK19,Klebsiella pneumoniae,Klebsiella pneumoniae KK19


In [13]:
# As can be seen sometimes strain now contains only the parent taxon name (e.g. line Staphylococcus aureus), which is not correct. 
# I will drop instances where strain is no longer than two words. Expert knowledge required  
df = df[~(df["strain"].str.split().str.len() <= 2)]
df[["genbank_strain2", "genbank_organism2", "strain"]].sample(10, random_state=5)

Unnamed: 0,genbank_strain2,genbank_organism2,strain
6455,EC3769,Escherichia coli,Escherichia coli EC3769
5176,CLSiS 1590/96,Klebsiella pneumoniae,Klebsiella pneumoniae CLSiS 1590/96
8198,VC4477,Burkholderia cenocepacia,Burkholderia cenocepacia VC4477
2566,blaCLHK-3,Laribacter hongkongensis,Laribacter hongkongensis blaCLHK-3
1466,BS,Escherichia coli,Escherichia coli BS
5081,3343,Escherichia coli,Escherichia coli 3343
8839,PAO1,Pseudomonas aeruginosa PAO1,Pseudomonas aeruginosa PAO1
3313,268/2C,Acinetobacter baumannii,Acinetobacter baumannii 268/2C
3345,XM1570,Acinetobacter calcoaceticus,Acinetobacter calcoaceticus XM1570
9152,PNUSAS062732,Salmonella enterica,Salmonella enterica PNUSAS062732


## Gene

In [14]:
# Gene is only in one column -- everything Nan is dropped 
# Gene_family was included in orginal dataframe and matches with the found gene -- Should be okay: expert knowledge 
df.dropna(subset="refseq_gene", inplace=True)
df[["gene_family", "refseq_gene"]].sample(10, random_state=10)

Unnamed: 0,gene_family,refseq_gene
4850,blaSHV,Klebsiella pneumoniae 90088 blaSHV
3156,blaOXA,Acinetobacter bereziniae Nec blaOXA
4217,blaPDC,Pseudomonas aeruginosa 163613 blaPDC
1131,blaADC,Acinetobacter pittii 56 GEIH blaADC
1183,blaADC,Acinetobacter nosocomialis 12A183 blaADC
2890,blaOXA,Acinetobacter baumannii U21-Benz-S1-1 blaOXA
7372,sul2,Salmonella enterica subsp. enterica serovar Ty...
4373,blaPDC,Pseudomonas aeruginosa 1800176 blaPDC
5726,dfrA1,Vibrio cholerae non-O1/non-O139 dfrA1
77,aac(6'),Yersinia mollaretii FE82747 aac(6')


## Search for connection to Wikidata

In [15]:
from SPARQLWrapper import SPARQLWrapper, JSON

sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

In [16]:


def search_parent_taxon(df_row: pd.Series) -> str:
    """
    Searches in wikidata for a species of bacterium which matches the two first words of strain
    Sometime we have abbreviations for example Achromobacter sp. -- nobody knows if this is rather Achromobacter spanius or Achromobacter spiritinus -- This is all rather unusable
    """
    parent = df_row["strain"].split()[:2]
    print(df_row["strain"])
    abbreviation = False
    for i, word in enumerate(parent): 
        if word[-1] == ".": 
            abbreviation = True
            parent[i] = word[:-1]
    parent = " ".join(parent).lower()
    query = f"""SELECT ?item ?itemLabel ?itemDescription
    WHERE {{
      ?item rdfs:label ?label;
            schema:description "species of bacterium"@en.
      
      FILTER(LANG(?label) = "en" && CONTAINS(LCASE(?label), "{parent}"))
      
      SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}
    }}
    LIMIT 10
    """
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    while True:
        try:
            results = sparql.query().convert().get("results").get("bindings")
            break
        except (urllib.error.HTTPError, urllib.request.HTTPError): 
            time.sleep(5)
        except: 
            pass
    results = [(item.get("item").get("value"), item.get("itemLabel").get("value")) for item in results]
    if not results:
        return None
    if abbreviation: 
        return [res[0] for res in results] if len(results) > 1 else results[0][0]
    else: 
        for res in results: 
            if res[1].lower() in parent: 
                return res[0]


get_from_wikidata_switch = False
if get_from_wikidata_switch:
    # Will take ~ 2h
    df["parent_taxon"] = df.apply(search_parent_taxon, axis=1)
    df.to_csv("resistance_df2.csv", index=False)
else: 
    df = pd.read_csv("resistance_df2.csv")

In [17]:
# The dataframe now contains a column called parent taxon which leads to the corresponding wikidata species of bacterium 
# Some parent taxon could not be found -- will be dropped later 
df[["product_name", "refseq_organism",  "strain", "parent_taxon", "refseq_gene"]].sample(5, random_state=9)

Unnamed: 0,product_name,refseq_organism,strain,parent_taxon,refseq_gene
1420,extended-spectrum class A beta-lactamase CTX-M-74,Enterobacter cloacae,Enterobacter cloacae JF216,http://www.wikidata.org/entity/Q4038096,Enterobacter cloacae JF216 blaCTX-M
5304,tetracycline efflux MFS transporter Tet(L),Latilactobacillus sakei,Latilactobacillus sakei Rits9,,Lactobacillus sakei Rits9 pLS55 tet(L)
1453,class C beta-lactamase DHA-18,Morganella morganii,Morganella morganii 984080,http://www.wikidata.org/entity/Q2696880,Morganella morganii 984080 blaDHA
3415,class C beta-lactamase PDC-156,Pseudomonas aeruginosa,Pseudomonas aeruginosa 1231451,http://www.wikidata.org/entity/Q31856,Pseudomonas aeruginosa 1231451 blaPDC
780,extended-spectrum class C beta-lactamase ADC-256,Acinetobacter baumannii,Acinetobacter baumannii 20A3025,http://www.wikidata.org/entity/Q3241189,Acinetobacter baumannii 20A3025 blaADC


### Handle special cases

For example: Escherichia coli K-12 
I found this randomly -- could be more -- expert knowledge required

In [18]:
df.keys()

Index(['allele', 'gene_family', 'whitelisted_taxa', 'product_name', 'scope',
       'type', 'subtype', 'class', 'subclass', 'refseq_protein_accession',
       'refseq_nucleotide_accession', 'curated_refseq_start',
       'genbank_protein_accession', 'genbank_nucleotide_accession',
       'genbank_strand', 'genbank_start', 'genbank_stop', 'refseq_strand',
       'refseq_start', 'refseq_stop', 'pubmed_reference', 'blacklisted_taxa',
       'synonyms', 'hierarchy_node', 'db_version', 'refseq_parent_taxon',
       'refseq_protein', 'refseq_parent_taxon2', 'refseq_gene',
       'refseq_protein2', 'refseq_genome', 'refseq_organism', 'refseq_tax_id',
       'genbank_organism', 'genbank_strain', 'genbank_organism2',
       'genbank_strain2', 'genbank_tax_id', 'strain', 'parent_taxon'],
      dtype='object')

In [19]:
df["parent_taxon"] = np.where((df["genbank_organism2"].str.lower().str.contains("escherichia coli")) & (df["genbank_strain2"] == "K-12"), 
                             "https://www.wikidata.org/entity/Q21399437", 
                             df["parent_taxon"])
df[df["parent_taxon"] == "https://www.wikidata.org/entity/Q21399437"]

Unnamed: 0,allele,gene_family,whitelisted_taxa,product_name,scope,type,subtype,class,subclass,refseq_protein_accession,...,refseq_genome,refseq_organism,refseq_tax_id,genbank_organism,genbank_strain,genbank_organism2,genbank_strain2,genbank_tax_id,strain,parent_taxon
459,,aph(4)-Ia,,aminoglycoside O-phosphotransferase APH(4)-Ia,core,AMR,AMR,AMINOGLYCOSIDE,HYGROMYCIN,WP_000742814.1,...,,Escherichia coli K-12,taxon:83333,Escherichia coli K-12,K-12,Escherichia coli K-12,K-12,taxon:83333,Escherichia coli K-12,https://www.wikidata.org/entity/Q21399437
4443,,catA2,,type A-2 chloramphenicol O-acetyltransferase C...,core,AMR,AMR,PHENICOL,CHLORAMPHENICOL,WP_012477888.1,...,,Escherichia coli K-12,taxon:83333,Escherichia coli K-12,K-12,Escherichia coli K-12,K-12,taxon:83333,Escherichia coli K-12,https://www.wikidata.org/entity/Q21399437


## Drop useless data instances

In [20]:
# drop rows where no wikidata parent taxon (species of bacterium) was found 
df = df.dropna(subset="parent_taxon")

In [21]:
# These data instances are made up of abbreviations, which make it unclear to which taxon they belong. 
# e.g. Streptomyces sp. 769 could belong to Streptomyces sp. myrophorea (Q60748847), Streptomyces spiramyceticus (Q104909301) or Streptomyces sporangiiformans (Q104957131)
# Expert Knowledge required -- They also need to be dropped 
df.loc[:, "parent_taxon"] = df["parent_taxon"].astype(str)
df.loc[df["parent_taxon"].str.contains("\[|\]"), ["strain", "parent_taxon"]]

Unnamed: 0,strain,parent_taxon
38,Streptomyces sp. 769,"['http://www.wikidata.org/entity/Q60748845', '..."
46,Streptomyces sp. GBA 94-10 4N24,"['http://www.wikidata.org/entity/Q60748845', '..."
52,Streptomyces sp. SPB78,"['http://www.wikidata.org/entity/Q60748845', '..."
59,Streptomyces sp. NRRL S-1831,"['http://www.wikidata.org/entity/Q60748845', '..."
74,Streptomyces sp. MBRL 601,"['http://www.wikidata.org/entity/Q60748845', '..."
80,Streptomyces sp. M10,"['http://www.wikidata.org/entity/Q60748845', '..."
82,Streptomyces sp. KE1,"['http://www.wikidata.org/entity/Q60748845', '..."
89,Streptomyces sp. NRRL F-4711,"['http://www.wikidata.org/entity/Q60748845', '..."
90,Streptomyces sp. NRRL F-4707,"['http://www.wikidata.org/entity/Q60748845', '..."
1953,Streptomyces sp. NRRL S-1868,"['http://www.wikidata.org/entity/Q60748845', '..."


In [22]:
df = df[~df["parent_taxon"].str.contains("\[|\]")]

In [23]:
len(df) # About half of the data is lost after everything is clearead

4563

In [24]:
# I now have a dataframe which contains antibiotic resistance class / subclass, protein name, gene, species of bacterium (wikidata) and bacterial strain 
# Could be implemented into wikidata like this 
df[["class", "subclass", "product_name", "refseq_gene", "parent_taxon", "strain"]].sample(5, random_state=5)

Unnamed: 0,class,subclass,product_name,refseq_gene,parent_taxon,strain
3752,BETA-LACTAM,CEPHALOSPORIN,inhibitor-resistant class C beta-lactamase PDC...,Pseudomonas aeruginosa 208176 blaPDC,http://www.wikidata.org/entity/Q31856,Pseudomonas aeruginosa 208176
266,AMINOGLYCOSIDE,KANAMYCIN/TOBRAMYCIN,aminoglycoside 6'-N-acetyltransferase AacA34,Klebsiella pneumoniae KP-PNK-1 aacA34,http://www.wikidata.org/entity/Q132592,Klebsiella pneumoniae KP-PNK-1
543,BETA-LACTAM,CEPHALOSPORIN,cephalosporin-hydrolyzing class C beta-lactama...,Enterobacter cloacae 963327 blaACT,http://www.wikidata.org/entity/Q4038096,Enterobacter cloacae 963327
5074,QUINOLONE,QUINOLONE,quinolone resistance pentapeptide repeat prote...,Citrobacter freundii V1 pCFV1 qnrB,http://www.wikidata.org/entity/Q5122842,Citrobacter freundii V1
1095,BETA-LACTAM,CEPHALOSPORIN,class C beta-lactamase CMY-124,Citrobacter freundii DNS-2 blaCMY,http://www.wikidata.org/entity/Q5122842,Citrobacter freundii DNS-2


In [25]:
# NCBI taxonomy ID is in all data instances regardsless of access via refseq or genbank the same 
any(df["genbank_tax_id"] != df["refseq_tax_id"])

False

In [26]:
# But the NCBI taxonomy ID has not always enough depth -- e.g. line 3664 taxon 470 -> https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi leads to acinetobacter baumannii 
# but not to the corresponding strain (16-02P46T-1) which does not exist yet. 
# Line 59 Serratia marcescens W2.3 leads to the wanted result: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi --> 1218513
# What sould be done here? Expert knowledge required -- I don't feel well implementing the data without the correct taxonomy ID
# I will drop rows where strain and genbank_organism2 don't match - basically undo combination step further up
df[["strain", "genbank_organism2", "genbank_strain2", "genbank_tax_id"]].sample(10, random_state=1)

Unnamed: 0,strain,genbank_organism2,genbank_strain2,genbank_tax_id
5129,Citrobacter braakii 107,Citrobacter braakii,107,taxon:57706
2998,Acinetobacter baumannii 16-02P46T-1,Acinetobacter baumannii,16-02P46T-1,taxon:470
3013,Acinetobacter baumannii 17A1872,Acinetobacter baumannii,17A1872,taxon:470
4113,Klebsiella pneumoniae 1409130,Klebsiella pneumoniae,1409130,taxon:573
30,Serratia marcescens W2.3,Serratia marcescens W2.3,W2.3,taxon:1218513
3965,Pseudomonas libanensis DSM 17149,Pseudomonas libanensis,DSM 17149,taxon:75588
865,Acinetobacter baumannii 23A3701,Acinetobacter baumannii,23A3701,taxon:470
4389,Vibrio alginolyticus Vb1833,Vibrio alginolyticus,Vb1833,taxon:663
445,Salmonella enterica subsp. enterica serovar Ty...,Salmonella enterica subsp. enterica serovar Ty...,,taxon:90371
3497,Pseudomonas aeruginosa 163604,Pseudomonas aeruginosa,163604,taxon:287


In [27]:
# These would be the data instances where I feel comfortable, including into Wikidata, because I have the correct NCBI taxonomy ID for others to check. 
# Now there are only 312 instances left 
df = df.loc[df.apply(lambda x: str(x["genbank_strain2"]) in x["genbank_organism2"], axis=1), :]
print(len(df))
df[["strain", "genbank_organism2", "genbank_strain2", "genbank_tax_id"]].sample(10)

312


Unnamed: 0,strain,genbank_organism2,genbank_strain2,genbank_tax_id
138,Acinetobacter baumannii 146457,Acinetobacter baumannii 146457,146457,taxon:1310623
269,Acinetobacter baumannii TG02011,Acinetobacter baumannii TG02011,TG02011,taxon:1315135
4954,Exiguobacterium sp. S3-2,Exiguobacterium sp. S3-2,S3-2,taxon:1389960
159,Enterococcus faecium SD3B-2,Enterococcus faecium SD3B-2,SD3B-2,taxon:1244155
161,Enterococcus hirae ATCC 9790,Enterococcus hirae ATCC 9790,ATCC 9790,taxon:768486
1597,Elizabethkingia anophelis NUHP1,Elizabethkingia anophelis NUHP1,NUHP1,taxon:1338011
5355,Bifidobacterium longum subsp. longum F8,Bifidobacterium longum subsp. longum F8,F8,taxon:722911
5194,Nocardia farcinica IFM 10152,Nocardia farcinica IFM 10152,IFM 10152,taxon:247156
1022,Vibrio parahaemolyticus S105,Vibrio parahaemolyticus S105,S105,taxon:1394641
78,Serratia marcescens MC620,Serratia marcescens MC620,MC620,taxon:1333585


In [28]:
df.to_csv("resistance_df3.csv", index=False)

## Results

After selecting and combining the best data sources (see the commented code above), a dataframe remains containing information about the protein, its encoding gene, and the bacterial strain. To establish a connection with Wikidata, the first two words of the bacterial strain (often the name of the corresponding bacterial species) are used in a SPARQL Wikidata query to find the associated bacterial species or its Wikidata qualifier. In addition to the respective names, identifiers are available to link each name with the NCBI database. This connection is crucial to provide users with more comprehensive information and allow experts to make improvements.

Furthermore, data is cleaned if there are missing or conflicting details at critical points. The most extensive cleaning, associated with the greatest data loss, occurs when it was observed that the found NCBI Taxonomy ID does not always refer to the bacterial strain but rather to higher taxa. This is unacceptable for Wikidata implementation, leading to the removal of such instances.

In the final step, the data is implemented in Wikidata. Using "pywikibots," a connection is established. Firstly, if not already present, the bacterial strain is implemented with the NCBI taxonomy ID. Subsequently, the identified gene is implemented, referencing the bacterial strain. Finally, the gene encodes a protein that makes the bacterium antibiotic-resistant. This protein is also implemented with a corresponding reference to the gene and the Quick-Go reference "response to antibiotic" (GO:0046677).

As of the submission deadline, work is ongoing on the Wikidata implementation. For instance, the bacterial strain Serratia marcescens VGH107 (Q124664471) has already been implemented. Additional data will be integrated in the coming days.

In [184]:
import pywikibot


class WikidataAdder: 

    site = pywikibot.Site("wikidata")
    repo = site.data_repository()

    def __init__(self, df_row: pd.Series, sim: bool): 
        self.df_row = df_row 
        self.strain_page: pywikibot.ItemPage = None
        self.sim = sim

    def return_strain_page(self): 
        if self.sim:
            return ""
        return self.strain_page.getID()

    def create_strain(self) -> str | bool: 
        if self.sim: 
            print("Label: " + self.df_row["genbank_organism2"])
            print("Description: " + "bacterial strain")
            print("Alias: " + self.df_row["genbank_organism2"].split()[-1])
        else:
            query = f"""
                        SELECT ?item WHERE 
                            {{?item rdfs:label "{self.df_row["genbank_organism2"]}"@en}}
                    """
            sparql.setQuery(query)
            sparql.setReturnFormat(JSON)
            results = sparql.query().convert().get("results").get("bindings")
            print(results)
            if len(results) >= 2: 
                print("more than one item found")
                self.sim = True 
                return 
            if len(results) == 1:
                self.strain_page = pywikibot.ItemPage(self.repo, results[0].get("item").get("value").split("/")[-1])
            elif not results: 
                # self.strain_page = pywikibot.ItemPage(WikidataAdder.site)
                # self.strain_page.editLabels({"en": self.df_row["genbank_organism2"]}, summary="Setting new label")
                # self.strain_page.editDescriptions({"en": "bacterial strain"}, summary="Setting new description")
                # self.strain_page.editAliases({"en": [self.df_row["genbank_organism2"].split()[-1]]}, summary="Setting new alias")
                pass
            print(self.strain_page.get()["labels"])
            print(self.strain_page.get()["descriptions"])
            print(self.strain_page.get()["aliases"]["en"])
        

    def add_instance_of_strain(self):
        if not self.sim:
            claim = pywikibot.Claim(WikidataAdder.repo, u"P31")
            target = pywikibot.ItemPage(WikidataAdder.repo, u"Q855769")
            claim.setTarget(target)
            self.strain_page.addClaim(claim, summary=u'Adding claim')

    def add_taxon_name(self): 
        if not self.sim:
            stringclaim = pywikibot.Claim(WikidataAdder.repo, u'P225')
            stringclaim.setTarget(self.df_row["genbank_organism2"])
            self.strain_page.addClaim(stringclaim, summary=u'Adding taxon name')

    def add_parent_taxon(self):
        if not self.sim:
            claim = pywikibot.Claim(self.repo, u"P171")
            target = pywikibot.ItemPage(self.repo, self.df_row["parent_taxon"].split("/")[-1])
            claim.setTarget(target)
            self.strain_page.addClaim(claim, summary=u'Adding parent taxon')


def wikidata_wrapper(df_row: pd.Series) -> pd.Series: 
    wa = WikidataAdder(df_row, sim=False) # Set sim to True unless you really want to include data into wikidata
    wa.create_strain()
    return "123"

In [186]:
##################
### DISCLAIMER ###
##################
# This is not finished yet - I have to dive deeper into this 
# will be finished in the coming days 

z = df.iloc[0:7].copy()
z["strain_wd_id"] = z.apply(wikidata_wrapper, axis=1)
z

[{'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q124664031'}}]
<class 'pywikibot.page._collections.LanguageDict'>({'en': 'Pseudomonas aeruginosa PA38182'})
<class 'pywikibot.page._collections.LanguageDict'>({'en': 'bacterial strain'})
['PA38182']
[{'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q124663344'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q124664085'}}]
more than one item found
[{'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q21102987'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q124664088'}}]
more than one item found
[{'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q21398890'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q124664402'}}, {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q124664429'}}]
more than one item found
[{'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q21398562'}}, {'item': {

Unnamed: 0,allele,gene_family,whitelisted_taxa,product_name,scope,type,subtype,class,subclass,refseq_protein_accession,...,refseq_organism,refseq_tax_id,genbank_organism,genbank_strain,genbank_organism2,genbank_strain2,genbank_tax_id,strain,parent_taxon,strain_wd_id
0,,aac(2')-I(A267),,aminoglycoside N-acetyltransferase AAC(2')-I(A...,core,AMR,AMR,AMINOGLYCOSIDE,GENTAMICIN/TOBRAMYCIN,WP_025297907.1,...,Pseudomonas aeruginosa PA38182,taxon:1407059,Pseudomonas aeruginosa PA38182,PA38182,Pseudomonas aeruginosa PA38182,PA38182,taxon:1407059,Pseudomonas aeruginosa PA38182,http://www.wikidata.org/entity/Q31856,123
2,,aac(2')-IIb,,kasugamycin N-acetyltransferase AAC(2')-IIb,core,AMR,AMR,AMINOGLYCOSIDE,KASUGAMYCIN,WP_071224044.1,...,Paenibacillus sp. LC231,taxon:1120679,Paenibacillus sp. LC231,LC231,Paenibacillus sp. LC231,LC231,taxon:1120679,Paenibacillus sp. LC231,http://www.wikidata.org/entity/Q26270468,123
4,,aac(2')-Ic,,aminoglycoside N-acetyltransferase AAC(2')-Ic,core,AMR,AMR,AMINOGLYCOSIDE,GENTAMICIN/TOBRAMYCIN,WP_003899880.1,...,Mycobacterium tuberculosis H37Rv,taxon:83332,Mycobacterium tuberculosis H37Rv,H37Rv,Mycobacterium tuberculosis H37Rv,H37Rv,taxon:83332,Mycobacterium tuberculosis H37Rv,http://www.wikidata.org/entity/Q130971,123
20,,aac(3)-Ig,,aminoglycoside N-acetyltransferase AAC(3)-Ig,core,AMR,AMR,AMINOGLYCOSIDE,GENTAMICIN,WP_011468318.1,...,Saccharophagus degradans 2-40,taxon:203122,Saccharophagus degradans 2-40,2-40,Saccharophagus degradans 2-40,2-40,taxon:203122,Saccharophagus degradans 2-40,http://www.wikidata.org/entity/Q7396606,123
21,,aac(3)-Ii,,aminoglycoside N-acetyltransferase AAC(3)-Ii,core,AMR,AMR,AMINOGLYCOSIDE,GENTAMICIN,WP_011540937.1,...,Sphingopyxis alaskensis RB2256,taxon:317655,Sphingopyxis alaskensis RB2256,RB2256,Sphingopyxis alaskensis RB2256,RB2256,taxon:317655,Sphingopyxis alaskensis RB2256,http://www.wikidata.org/entity/Q21324563,123
27,,aac(6'),,aminoglycoside 6'-N-acetyltransferase,core,AMR,AMR,AMINOGLYCOSIDE,AMINOGLYCOSIDE,WP_004874306.1,...,Yersinia mollaretii ATCC 43969,taxon:349967,Yersinia mollaretii ATCC 43969,ATCC 43969,Yersinia mollaretii ATCC 43969,ATCC 43969,taxon:349967,Yersinia mollaretii ATCC 43969,http://www.wikidata.org/entity/Q16994539,123
30,,aac(6'),,aminoglycoside 6'-N-acetyltransferase,core,AMR,AMR,AMINOGLYCOSIDE,AMINOGLYCOSIDE,WP_019453091.1,...,Serratia marcescens W2.3,taxon:1218513,,,Serratia marcescens W2.3,W2.3,taxon:1218513,Serratia marcescens W2.3,http://www.wikidata.org/entity/Q140004,123


In [159]:
def add_ncbi_taxonomy_id(item_id, ncbi_taxonomy_id):
    site = pywikibot.Site("wikidata", "wikidata")
    repo = site.data_repository()

    # Load the Wikidata item
    item = pywikibot.ItemPage(repo, item_id)
    item.get()

    # Check if the identifier is already present
    if "P685" in item.claims:
        existing_identifiers = [claim.target.amount for claim in item.claims["P685"]]
        if ncbi_taxonomy_id in existing_identifiers:
            print(f"NCBI Taxonomy ID {ncbi_taxonomy_id} already exists for {item_id}.")
            return

    # Add the identifier to the item
    new_claim = pywikibot.Claim(repo, "P685")
    new_claim.setTarget(ncbi_taxonomy_id)
    item.addClaim(new_claim)

    print(f"NCBI Taxonomy ID {ncbi_taxonomy_id} added to {item_id}.")

# Example usage:
wikidata_item_id = "Q124664471"  # Replace with the Wikidata item ID you're working with
ncbi_taxonomy_id_to_add = "1263833"  # Replace with the NCBI Taxonomy ID you want to add

add_ncbi_taxonomy_id(wikidata_item_id, ncbi_taxonomy_id_to_add)

NCBI Taxonomy ID 1263833 added to Q124664471.


In [154]:
site = pywikibot.Site("test", "wikidata")
repo = site.data_repository()
item = pywikibot.ItemPage(repo, "Q124664471")
item.getID()

'Q124664471'

In [156]:

qualifier = pywikibot.Claim(repo, u'P685')
target = pywikibot.ItemPage(repo, "Q35409")
qualifier.setTarget(target)
claim.addQualifier(qualifier, summary=u'Adding a qualifier.')

AttributeError: 'str' object has no attribute 'on_item'

In [152]:
dir(item)

['DATA_ATTRIBUTES',
 '__abstractmethods__',
 '__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_cache_attrs',
 '_check_bot_may_edit',
 '_cmpkey',
 '_cosmetic_changes_hook',
 '_defined_by',
 '_getInternals',
 '_initialize_empty',
 '_latest_cached_revision',
 '_link',
 '_namespace',
 '_normalizeData',
 '_revisions',
 '_save',
 'addClaim',
 'applicable_protections',
 'autoFormat',
 'backlinks',
 'botMayEdit',
 'categories',
 'change_category',
 'clear_cache',
 'concept_uri',
 'content_model',
 'contributors',
 'coordinates',
 'create_short_link',
 'data_item',
 'data_repository',
 'defaultsort',
 'delete',
 'de

In [None]:
# iloc[0]: 
# iloc[1]: 
# iloc[2]: 
# iloc[3]: already exists
# iloc[4]: Q124664441
# iloc[5]: Q124664454
# iloc[6]: Q124664466
# iloc[7]: Q124664471

In [85]:
df.head(7)[["genbank_organism2", "parent_taxon"]]

Unnamed: 0,genbank_organism2,parent_taxon
0,Pseudomonas aeruginosa PA38182,http://www.wikidata.org/entity/Q31856
2,Paenibacillus sp. LC231,http://www.wikidata.org/entity/Q26270468
4,Mycobacterium tuberculosis H37Rv,http://www.wikidata.org/entity/Q130971
20,Saccharophagus degradans 2-40,http://www.wikidata.org/entity/Q7396606
21,Sphingopyxis alaskensis RB2256,http://www.wikidata.org/entity/Q21324563
27,Yersinia mollaretii ATCC 43969,http://www.wikidata.org/entity/Q16994539
30,Serratia marcescens W2.3,http://www.wikidata.org/entity/Q140004


## Discussion and Conclusion

In conclusion, it can be noted that data has been successfully queried from two databases (GenBank and RefSeq). However, due to incomplete data in some cases (references to higher taxons, the use of abbreviations leading to ambiguity in bacterial species, or discrepancies between GenBank and RefSeq in providing different names), the originally extensive dataset of approximately 10,000 entries has been reduced to a more manageable 300 entries that can be reliably implemented.

Following the model of other bacteria, genes, and proteins (with the addition of the tag "response to antibiotic" here), these selected entries are implemented. However, unlike other genes already present in Wikidata, the "Entrez Gene ID" reference could not be found. Despite this missing link, the genes are implemented in Wikidata with the hope that users will contribute and add this information in the future.

In summary, it can be concluded that a small percentage of all antibiotic-resistant bacteria known to NCBI could potentially be implemented in Wikidata. The main obstacles to implementing further data are the incomplete or non-standardized entries in GenBank and RefSeq databases and the absence of expert knowledge.

## Link to GitHub


https://github.com/gjmm07/DS_LOD_and_Knowledge_Graphs_2023_Finn_Heydemann

## Litarture

[1] Salam et al.: Antimicrobial Resistance: A Growing Serious Threat for Global Public Health, 2023

[2] National Center for Biotechnolog Information: National Database of Antibiotic Resistant Organisms (NDARO), URL: https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/, last accessed: 26.02.2024