# Antimicrobial Resistance Knowledge Graph 

## Abstact


abc

## Background and Motivation

Antibiotics play a crucial role in the treatment of bacterial infections. The use of antibiotics annually saves millions of lives, but it also accelerates the growth of livestock in factory farming. Due to the increased and sometimes unnecessary use of antibiotics, individual microorganisms develop resistance, rendering antibiotics ineffective. The development of antibiotic resistance is caused by genetic mutations that occur randomly in bacteria. If these mutations have a positive impact on the bacterium's survival, it survives the antibiotic treatment and passes on the positive traits to other bacteria or the next generation.

Improper use of antibiotics, especially in developing countries, has led to the prevalence of many antibiotic-resistant bacteria today. Just as different antibiotics have different mechanisms of action (for example, Beta-Lactam targets the bacterial cell wall), bacteria also develop various defense mechanisms. For instance, bacteria can form an efflux pump, which expels already entered antibiotics from the cell.

Infection with antibiotic-resistant bacteria poses a high risk to humans, as medical treatment becomes impossible. The World Health Organization (WHO) classifies Antimicrobial Resistance (AMR) as one of the three greatest medical threats, referring to it as a silent pandemic due to the high annual death toll (1.27 million people). By 2050, the WHO estimates that the number could rise to 10 million deaths per year, far surpassing the annual mortality rate of cancer.

Although various health organizations are aware of numerous bacterial strains encoding antibiotic-resistant proteins, these databases are not interconnected, despite the urgent need for such integration. This work focuses on incorporating the largest publicly accessible database from the National Center for Biotechnology Information (NCBI) into a publicly accessible Knowledge Graph (Wikidata). Medial staff, scientists and interested people can easily access the knowledge from Wikidata. 

## Material and Methods

Since the data in the publicly accessible Wikidata Knowledge Graph is intended to be available to everyone, the implementation concept follows the modeling of proteins, genes, chromosomes, bacterial strains, and bacterial species. The premise is that the bacterial species is already present in Wikidata, and only the further sequence needs to be modeled.

As a data source, a public database from NCBI is chosen. This database consists of approximately 10,000 proteins that cause antibiotic resistances in bacteria. The protein is linked to the type of antibiotic resistance, as well as to links to nucleotides and proteins in the Refseq and GenBank databases. To model the sequence described above, consisting of protein, gene, chromosome, and bacterial strain starting from the bacterial species, the names of the gene and bacterial strain must be extracted from the linked Refseq and GenBank databases. Using the "bioservice EUtils" and the Refseq or GenBank references, the aforementioned information is extracted from the databases. Redundancies are deliberately built (e.g., extracting the bacterial species via Refseq and GenBank, as well as via protein and nucleotide) to subsequently combine information or choose the most complete data source.


In [2]:
import pandas as pd 
import urllib
import time 
import numpy as np 
import bioservice_fetcher as biof 
import os 

### Fetch data from NCBI database 

The read dataframe contains keys to access the protein, nucleotide either via Reference Sequence (RefSeq) database or genbank. Lets load the first two columns of the dataframe and read some values. First try is to get the data from the RefSeq-database.  

In [2]:
# Reads everything that could possible be interesting for this project

def fetch_data(read_from_web: bool = False) -> pd.DataFrame: 
    """
    if fetch_data_switch is True data will be fetched from NCBI and interesting things will be read from genbank or refseq database -- caution: Takes for ages 
    else data will be read from last time -- should be used in most of the cases 
    """
    if read_from_web: 
        url = "https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/latest/ReferenceGeneCatalog.txt"
        df = pd.read_csv(urllib.request.urlopen(url), delimiter="\t")
        # df = df.sample(10)
        df.apply(biof.get_protein_and_parent, axis=1)
        df.apply(biof.get_strain_and_gene, axis=1)
        df.apply(biof.get_organism_strain_via_prot,axis=1)
        df.apply(biof.get_organism_strain_via_nuc, axis=1)
        df.to_csv("resistance_df.csv", index=False)
    else: 
        if not os.path.exists("resistance_df.csv"): 
            print("Cannot read from hard drive because file does not exist -- set read from web switch to True")
            return None
        df = pd.read_csv("resistance_df.csv")
    return df


df = fetch_data()
df.sample(5, random_state=19)


Unnamed: 0,allele,gene_family,whitelisted_taxa,product_name,scope,type,subtype,class,subclass,refseq_protein_accession,...,refseq_protein,refseq_genome,refseq_organism,refseq_parent,rrefseq_protein,refseq_same_parent,genbank_organsim_nuc,genbank_strain_nuc,genbank_organsim_prot,genbank_strain_prot
4666,blaPLA-4,blaPLA,,class A beta-lactamase PLA-4,core,AMR,AMR,BETA-LACTAM,BETA-LACTAM,WP_032687534.1,...,class A beta-lactamase PLA-4,,Raoultella planticola,Raoultella planticola,class A beta-lactamase PLA-4,Raoultella planticola,Raoultella planticola,42-1,Raoultella planticola,42-1
5397,catB6,catB,,type B-3 chloramphenicol O-acetyltransferase C...,core,AMR,AMR,PHENICOL,CHLORAMPHENICOL,WP_063843229.1,...,type B-3 chloramphenicol O-acetyltransferase C...,,Pseudomonas aeruginosa,Gammaproteobacteria,type B-3 chloramphenicol O-acetyltransferase C...,Gammaproteobacteria,Pseudomonas aeruginosa,,Pseudomonas aeruginosa,
8260,fusA_G452S,fusA,Staphylococcus_aureus,elongation factor G,core,AMR,POINT,FUSIDIC ACID,FUSIDIC ACID,WP_000090315.1,...,,,,Staphylococcus,elongation factor G,Staphylococcus,Staphylococcus aureus subsp. aureus Mu50,Mu50,Staphylococcus aureus subsp. aureus Mu50,Mu50
1990,blaFLC-1,blaFRI,,FRI family carbapenem-hydrolyzing class A beta...,core,AMR,AMR,BETA-LACTAM,CARBAPENEM,WP_123061077.1,...,FRI family carbapenem-hydrolyzing class A beta...,,Enterobacter cloacae,Enterobacter cloacae,FRI family carbapenem-hydrolyzing class A beta...,Enterobacter cloacae,Enterobacter cloacae,FRI-3442,Enterobacter cloacae,FRI-3442
7825,,vanXY-N,,D-Ala-D-Ala dipeptidase/D-Ala-D-Ala carboxypep...,core,AMR,AMR,GLYCOPEPTIDE,VANCOMYCIN,WP_063856820.1,...,D-Ala-D-Ala dipeptidase/D-Ala-D-Ala carboxypep...,,Enterococcus faecium,Enterococcus,D-Ala-D-Ala dipeptidase/D-Ala-D-Ala carboxypep...,Enterococcus,Enterococcus faecium,UCN71,Enterococcus faecium,UCN71


## STRAIN

Because there is quite a bit of data present, my goal is to extract the right name of the becterial strain. Therefore I am going to find the best datasource of organsim name and combine it with the best source of the exact strain name

In [3]:
df.keys()

Index(['allele', 'gene_family', 'whitelisted_taxa', 'product_name', 'scope',
       'type', 'subtype', 'class', 'subclass', 'refseq_protein_accession',
       'refseq_nucleotide_accession', 'curated_refseq_start',
       'genbank_protein_accession', 'genbank_nucleotide_accession',
       'genbank_strand', 'genbank_start', 'genbank_stop', 'refseq_strand',
       'refseq_start', 'refseq_stop', 'pubmed_reference', 'blacklisted_taxa',
       'synonyms', 'hierarchy_node', 'db_version', 'refseq_gene',
       'refseq_protein', 'refseq_genome', 'refseq_organism', 'refseq_parent',
       'rrefseq_protein', 'refseq_same_parent', 'genbank_organsim_nuc',
       'genbank_strain_nuc', 'genbank_organsim_prot', 'genbank_strain_prot'],
      dtype='object')

In [4]:
# Just by looking into this small random sampled dataframe "refseq organism", "genbank_organsim_nuc" and "genbank_organism_prot" yield pretty similar results, 
# altough "refseq_organism" has less information
# "refseq_parent" and "refseq_same_parent" sometimes carry the same organism name, but sometimes some higher taxon
# For extracting the right organism name I am going to look closer into "refseq_organism", "genbank_organism_nuc" and "ganbank_organism_prot"
df[["refseq_organism", "refseq_parent", "refseq_same_parent", "genbank_organsim_nuc", "genbank_organsim_prot"]].sample(10, random_state=10)

Unnamed: 0,refseq_organism,refseq_parent,refseq_same_parent,genbank_organsim_nuc,genbank_organsim_prot
389,Simplicispira metamorpha,Simplicispira metamorpha,Simplicispira metamorpha,Simplicispira metamorpha,Simplicispira metamorpha
2293,Klebsiella pneumoniae subsp. pneumoniae,Klebsiella pneumoniae,Klebsiella pneumoniae,Klebsiella pneumoniae subsp. pneumoniae,Klebsiella pneumoniae subsp. pneumoniae
6959,Escherichia coli,Gammaproteobacteria,Gammaproteobacteria,Escherichia coli,Escherichia coli
8684,,Klebsiella,Klebsiella,Klebsiella pneumoniae,Klebsiella pneumoniae
4144,Pseudomonas aeruginosa,Pseudomonas aeruginosa,Pseudomonas aeruginosa,Pseudomonas aeruginosa,Pseudomonas aeruginosa
1563,Klebsiella pneumoniae,Enterobacteriaceae,Enterobacteriaceae,Klebsiella pneumoniae,Klebsiella pneumoniae
6553,Acinetobacter baumannii,Bacteria,Bacteria,Acinetobacter baumannii,Acinetobacter baumannii
3522,Campylobacter jejuni,Campylobacter jejuni,Campylobacter jejuni,Campylobacter jejuni,Campylobacter jejuni
7193,,,,Escherichia phage 933W,Escherichia phage 933W
4930,Escherichia coli,Escherichia coli,Escherichia coli,Escherichia coli,Escherichia coli


In [5]:
df[["genbank_organsim_prot", "genbank_organsim_nuc", "refseq_organism"]]

Unnamed: 0,genbank_organsim_prot,genbank_organsim_nuc,refseq_organism
0,Pseudomonas aeruginosa PA38182,Pseudomonas aeruginosa PA38182,Pseudomonas aeruginosa PA38182
1,Burkholderia glumae,Burkholderia glumae,Burkholderia glumae
2,Paenibacillus sp. LC231,Paenibacillus sp. LC231,Paenibacillus sp. LC231
3,Providencia stuartii,Providencia stuartii,Providencia stuartii
4,Mycolicibacterium fortuitum,Mycolicibacterium fortuitum,Mycolicibacterium fortuitum
...,...,...,...
9184,Staphylococcus aureus subsp. aureus MRSA252,Staphylococcus aureus subsp. aureus MRSA252,
9185,Staphylococcus aureus subsp. aureus MRSA252,Staphylococcus aureus subsp. aureus MRSA252,
9186,Staphylococcus aureus subsp. aureus MRSA252,Staphylococcus aureus subsp. aureus MRSA252,
9187,Burkholderia mallei ATCC 23344,Burkholderia mallei ATCC 23344,


In [6]:
# 2651 cases where found organisms in genbank via protein, genbank via nuclotide and refseq do not match 

diff_organism_df = df.loc[(df["genbank_organsim_nuc"] != df["genbank_organsim_prot"]) | (df["genbank_organsim_nuc"] != df["refseq_organism"]), ["genbank_organsim_prot", 
                                                                                                                                                "genbank_organsim_nuc", 
                                                                                                                                                "refseq_organism"]]
diff_organism_df

Unnamed: 0,genbank_organsim_prot,genbank_organsim_nuc,refseq_organism
34,,Serratia marcescens,Serratia marcescens
36,,uncultured bacterium,uncultured bacterium
45,Paracoccus denitrificans PD1222,Paracoccus denitrificans PD1222,
52,,Escherichia coli APEC O1,Escherichia coli APEC O1
58,,Streptomyces xiaopingdaonensis,Streptomyces xiaopingdaonensis
...,...,...,...
9184,Staphylococcus aureus subsp. aureus MRSA252,Staphylococcus aureus subsp. aureus MRSA252,
9185,Staphylococcus aureus subsp. aureus MRSA252,Staphylococcus aureus subsp. aureus MRSA252,
9186,Staphylococcus aureus subsp. aureus MRSA252,Staphylococcus aureus subsp. aureus MRSA252,
9187,Burkholderia mallei ATCC 23344,Burkholderia mallei ATCC 23344,


In [7]:
# But genbank via nucleotid has non NaN -- genbank via nucleotid seems to be best: have a look at the exact differences 
diff_organism_df.isna().sum()

genbank_organsim_prot     363
genbank_organsim_nuc        0
refseq_organism          2396
dtype: int64

In [8]:
# As can be seen there are 35 issues where organism found via genbank (nucleotid) and via genbank (protein) are different 
# What should be done here? Are these alternative names -- Expert knowledge requiered 
diff_organism_df.loc[diff_organism_df["genbank_organsim_nuc"] != diff_organism_df["genbank_organsim_prot"], ["genbank_organsim_nuc", 
                                                                                                             "genbank_organsim_prot"]].dropna()

Unnamed: 0,genbank_organsim_nuc,genbank_organsim_prot
126,Salmonella virus Fels2,Salmonella enterica subsp. enterica serovar Ty...
599,Bacillus phage lambda Ba02,Bacillus anthracis str. Ames
5382,Escherichia coli,Escherichia coli K-12
5600,Peptoclostridium phage p630P2,Clostridioides difficile 630
5612,Bacillus phage phBC6A52,Bacillus cereus ATCC 14579
5833,Bacillus phage phBC6A52,Bacillus cereus ATCC 14579
6103,Bacillus phage phBC6A52,Bacillus cereus ATCC 14579
6106,Staphylococcus epidermidis RP62A phage SP-beta,Staphylococcus epidermidis RP62A
6109,Bacillus phage lambda Ba02,Bacillus phage lambda Ba03
6145,Salmonella virus Fels2,Salmonella enterica subsp. enterica serovar Ty...


In [9]:
# Furthermore there are 9 issues between genbank via nucleotide and refseq 
# Are these alternative names? Expert knowledge requiered 
diff_organism_df.loc[diff_organism_df["genbank_organsim_nuc"] != diff_organism_df["refseq_organism"], ["genbank_organsim_nuc", 
                                                                                                       "refseq_organism"]].dropna()

Unnamed: 0,genbank_organsim_nuc,refseq_organism
2020,Acinetobacter sp.,Acinetobacter baumannii
4017,Klebsiella michiganensis,Klebsiella michiganensis M5al
5010,Salmonella enterica,Salmonella enterica subsp. enterica serovar In...
5382,Escherichia coli,Escherichia coli K-12
6103,Bacillus phage phBC6A52,Bacillus cereus ATCC 14579
6106,Staphylococcus epidermidis RP62A phage SP-beta,Staphylococcus epidermidis RP62A
6109,Bacillus phage lambda Ba02,Bacillus anthracis str. Ames
6125,Cytobacillus massiliigabonensis,Bacillus massiliigabonensis
6991,Bacillus phage lambda Ba02,Bacillus anthracis str. Ames


I am going to choose the organism found by genbank via nucleotid as base organism, because it has no NaN and seems to be complete

In [10]:
# Looks like genbank strain nuc has more information but information is sometimes included in genbank_organism_nuc 
# Serratia marcescens MC620 has MC620 already included
df.loc[df["genbank_strain_nuc"] != df["genbank_strain_prot"], ["genbank_strain_nuc", "genbank_strain_prot", "genbank_organsim_nuc"]].sample(10, random_state=1)

Unnamed: 0,genbank_strain_nuc,genbank_strain_prot,genbank_organsim_nuc
2492,,,Stenotrophomonas maltophilia
8003,MS4,,Staphylococcus aureus
7040,,,Micromonospora zionensis
300,M16,,Stenotrophomonas maltophilia
8008,SWU02,,Streptococcus pneumoniae
6030,,,Escherichia coli
3656,,,Pseudomonas aeruginosa
107,MC620,,Serratia marcescens MC620
460,BWH49,,Escherichia coli
7224,,,Pseudomonas aeruginosa


In [11]:
# Genbank strain prot carrys either redundant information or no information at all -- genbank strain via nucleotid will therefore be selected
df.loc[(df["genbank_strain_nuc"] != df["genbank_strain_prot"]) & ~df["genbank_strain_nuc"].isna(), ["genbank_strain_nuc", "genbank_strain_prot", "genbank_organsim_nuc"]]

Unnamed: 0,genbank_strain_nuc,genbank_strain_prot,genbank_organsim_nuc
52,APEC O1,,Escherichia coli APEC O1
58,DUT 180,,Streptomyces xiaopingdaonensis
59,W2.3,,Serratia marcescens W2.3
62,DSM 12546,,Saccharospirillum impatiens DSM 12546
87,NRRL ISP-5461,,Streptomyces lydicus
...,...,...,...
8630,NG196,,Neisseria gonorrhoeae
8631,NG196,,Neisseria gonorrhoeae
8675,TUM15753,,Neisseria gonorrhoeae
8676,TUM15753,,Neisseria gonorrhoeae


To find the correct and full strain name I will combine "genbank strain via nucleotid" and "genbank organism via nucleotid", only if strain is not included in organism. Else organism itself will be selected.

In [12]:
def check_for_combination(df_row: pd.Series) -> bool:
    """
    Checks if genbank_organism and genbank_strain should be connected to one strain
    """
    if not isinstance(df_row["genbank_strain_nuc"], str): 
        # don't combine organism and strain if is NaN --> Is this correct? 
        return False
    return not (df_row["genbank_strain_nuc"].upper() in df_row["genbank_organsim_nuc"].upper())
    
    
df["strain"] = np.where(df.apply(check_for_combination , axis=1), 
                           df["genbank_organsim_nuc"] + " " + df["genbank_strain_nuc"], 
                           df["genbank_organsim_nuc"])
df[["genbank_strain_nuc", "genbank_organsim_nuc", "strain"]]

Unnamed: 0,genbank_strain_nuc,genbank_organsim_nuc,strain
0,PA38182,Pseudomonas aeruginosa PA38182,Pseudomonas aeruginosa PA38182
1,5091,Burkholderia glumae,Burkholderia glumae 5091
2,LC231,Paenibacillus sp. LC231,Paenibacillus sp. LC231
3,,Providencia stuartii,Providencia stuartii
4,FC1K,Mycolicibacterium fortuitum,Mycolicibacterium fortuitum FC1K
...,...,...,...
9184,MRSA252,Staphylococcus aureus subsp. aureus MRSA252,Staphylococcus aureus subsp. aureus MRSA252
9185,MRSA252,Staphylococcus aureus subsp. aureus MRSA252,Staphylococcus aureus subsp. aureus MRSA252
9186,MRSA252,Staphylococcus aureus subsp. aureus MRSA252,Staphylococcus aureus subsp. aureus MRSA252
9187,ATCC 23344,Burkholderia mallei ATCC 23344,Burkholderia mallei ATCC 23344


## Gene

In [13]:
df["refseq_gene"]

0       Pseudomonas aeruginosa PA38182 aac(2')-I(A267)
1                 Burkholderia glumae 5091 aac(2')-IIa
2                  Paenibacillus sp. LC231 aac(2')-IIb
3                      Providencia stuartii aac(2')-Ia
4              Mycobacterium fortuitum FC1K aac(2')-Ib
                             ...                      
9184                                               NaN
9185                                               NaN
9186                                               NaN
9187                                               NaN
9188                                               NaN
Name: refseq_gene, Length: 9189, dtype: object

## Search for connection to Wikidata

In [14]:
from SPARQLWrapper import SPARQLWrapper, JSON

sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

In [15]:


def search_parent_taxon(df_row: pd.Series) -> str:
    """
    Searches in wikidata for a species of bacterium which matches the two first words of _organism
    Sometime we have abbreviations for example Achromobacter sp. -- nobody knows if this is rather Achromobacter spanius or Achromobacter spiritinus -- This is all rather unusable
    """
    parent = df_row["strain"].split()[:2]
    print(df_row["strain"])
    abbreviation = False
    for i, word in enumerate(parent): 
        if word[-1] == ".": 
            abbreviation = True
            parent[i] = word[:-1]
    parent = " ".join(parent).lower()
    query = f"""SELECT ?item ?itemLabel ?itemDescription
    WHERE {{
      ?item rdfs:label ?label;
            schema:description "species of bacterium"@en.
      
      FILTER(LANG(?label) = "en" && CONTAINS(LCASE(?label), "{parent}"))
      
      SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}
    }}
    LIMIT 10
    """
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    while True:
        try:
            results = sparql.query().convert().get("results").get("bindings")
            break
        except (urllib.error.HTTPError, urllib.request.HTTPError): 
            time.sleep(5)
        except: 
            pass
    results = [(item.get("item").get("value"), item.get("itemLabel").get("value")) for item in results]
    if not results:
        return None
    if abbreviation: 
        return [res[0] for res in results] if len(results) > 1 else results[0][0]
    else: 
        for res in results: 
            if res[1].lower() in parent: 
                return res[0]


get_from_wikidata_switch = True
if get_from_wikidata_switch:
    df["parent_taxon"] = df.apply(search_parent_taxon, axis=1)
else: 
    df = pd.read_csv("resistance_df2.csv")

Pseudomonas aeruginosa PA38182
Burkholderia glumae 5091
Paenibacillus sp. LC231
Providencia stuartii
Mycolicibacterium fortuitum FC1K
Mycobacterium tuberculosis H37Rv
Mycolicibacterium smegmatis MC2 155
Mycobacterium leprae Br4923
Pseudomonas aeruginosa
Salmonella enterica subsp. enterica serovar Typhimurium PNUSAS018916
Pseudomonas aeruginosa
Pseudomonas aeruginosa Ps142
uncultured bacterium
Pseudomonas aeruginosa PST-1
Pseudomonas aeruginosa WH-SGI-V-07065
Pseudomonas aeruginosa
Plasmid pWP113a
Serratia marcescens
Escherichia coli clinical isolate 164
Citrobacter freundii
Plasmid R
Acinetobacter baumannii PAU
Escherichia coli
uncultured bacterium HHV35
Enterobacter cloacae
Enterobacter cloacae
Escherichia coli
Cupriavidus gilardii W2-2
Micromonospora chalcea
Klebsiella pneumoniae
Salmonella enterica subsp. enterica serovar Typhimurium
Pseudomonas aeruginosa
uncultured bacterium
Acinetobacter baumannii Aci 33
Serratia marcescens
Pseudomonas aeruginosa
uncultured bacterium
Pseudomonas 

In [16]:
df[["product_name", "refseq_organism",  "strain", "parent_taxon", "refseq_gene"]].sample(5)

Unnamed: 0,product_name,refseq_organism,strain,parent_taxon,refseq_gene
7513,tetracycline resistance ribosomal protection p...,uncultured bacterium,uncultured bacterium,,uncultured bacterium tet(W)
5738,trimethoprim-resistant dihydrofolate reductase...,Enterococcus faecalis,Enterococcus faecalis TX0638,http://www.wikidata.org/entity/Q140014,Enterococcus faecalis TX0638 dfrF
2022,extended-spectrum class A beta-lactamase GES-26,Pseudomonas aeruginosa,Pseudomonas aeruginosa,http://www.wikidata.org/entity/Q31856,Pseudomonas aeruginosa blaGES
3565,OXA-23 family carbapenem-hydrolyzing class D b...,Acinetobacter baumannii,Acinetobacter baumannii 1772979,http://www.wikidata.org/entity/Q3241189,Acinetobacter baumannii 1772979 blaOXA
5427,type B-3 chloramphenicol O-acetyltransferase C...,Providencia rustigianii,Providencia rustigianii C-B10C,,Providencia rustigianii C-B10C catB8


## Handle special cases
For example Escherichia coli K-12 


In [17]:
df["parent_taxon"] = np.where((df["genbank_organsim_nuc"].str.lower().str.contains("escherichia coli")) & (df["genbank_strain_nuc"] == "K-12"), 
                             "https://www.wikidata.org/entity/Q21399437", 
                             df["parent_taxon"])

## Time to drop useless data instances

In [40]:
# drop rows where no gene and no wikidata parent taxon was found 
df = df.dropna(subset=["refseq_gene", "parent_taxon"])

In [41]:
# drop rows where strain is no longer than two words 
df = df[~(df["strain"].str.split().str.len() <= 2)]

In [55]:
# These data instances are made up of abbreviations, which make it unclear to which taxon they belong. 
# They also need to be dropped 
df["parent_taxon"] = df["parent_taxon"].astype(str)
df.loc[df["parent_taxon"].str.contains("\[|\]"), ["strain", "parent_taxon"]]

Unnamed: 0,strain,parent_taxon
67,Streptomyces sp. 769,"['http://www.wikidata.org/entity/Q60748845', '..."
75,Streptomyces sp. GBA 94-10 4N24,"['http://www.wikidata.org/entity/Q60748845', '..."
81,Streptomyces sp. SPB78,"['http://www.wikidata.org/entity/Q60748845', '..."
88,Streptomyces sp. NRRL S-1831,"['http://www.wikidata.org/entity/Q60748845', '..."
103,Streptomyces sp. MBRL 601,"['http://www.wikidata.org/entity/Q60748845', '..."
111,Streptomyces sp. M10,"['http://www.wikidata.org/entity/Q60748845', '..."
113,Streptomyces sp. KE1,"['http://www.wikidata.org/entity/Q60748845', '..."
120,Streptomyces sp. NRRL F-4711,"['http://www.wikidata.org/entity/Q60748845', '..."
121,Streptomyces sp. NRRL F-4707,"['http://www.wikidata.org/entity/Q60748845', '..."
2472,Streptomyces sp. NRRL S-1868,"['http://www.wikidata.org/entity/Q60748845', '..."


In [56]:
df = df[~df["parent_taxon"].str.contains("\[|\]")]

In [57]:
len(df) # About half of the data is lost after everything is clearead

4522

In [58]:
df[["product_name", "strain", "parent_taxon"]].sample(5)

Unnamed: 0,product_name,strain,parent_taxon
4822,extended-spectrum class A beta-lactamase SHV-18,Klebsiella pneumoniae K6; ATCC 700603,http://www.wikidata.org/entity/Q132592
4492,inhibitor-resistant class C beta-lactamase PDC...,Pseudomonas aeruginosa 229474,http://www.wikidata.org/entity/Q31856
1145,extended-spectrum class C beta-lactamase ADC-256,Acinetobacter baumannii 20A3025,http://www.wikidata.org/entity/Q3241189
2452,carbapenem-hydrolyzing class A beta-lactamase ...,Klebsiella pneumoniae M25752,http://www.wikidata.org/entity/Q132592
1982,cephalosporin-hydrolyzing class C beta-lactama...,Aeromonas allosaccharophila 10CC9,http://www.wikidata.org/entity/Q16825681


In [59]:
df.keys()

Index(['allele', 'gene_family', 'whitelisted_taxa', 'product_name', 'scope',
       'type', 'subtype', 'class', 'subclass', 'refseq_protein_accession',
       'refseq_nucleotide_accession', 'curated_refseq_start',
       'genbank_protein_accession', 'genbank_nucleotide_accession',
       'genbank_strand', 'genbank_start', 'genbank_stop', 'refseq_strand',
       'refseq_start', 'refseq_stop', 'pubmed_reference', 'blacklisted_taxa',
       'synonyms', 'hierarchy_node', 'db_version', 'refseq_gene',
       'refseq_protein', 'refseq_genome', 'refseq_organism', 'refseq_parent',
       'rrefseq_protein', 'refseq_same_parent', 'genbank_organsim_nuc',
       'genbank_strain_nuc', 'genbank_organsim_prot', 'genbank_strain_prot',
       'strain', 'parent_taxon'],
      dtype='object')

In [60]:
df[["class", "subclass", "product_name", "refseq_gene", "parent_taxon", "strain"]].sample(5)

Unnamed: 0,class,subclass,product_name,refseq_gene,parent_taxon,strain
2234,BETA-LACTAM,CARBAPENEM,subclass B1 metallo-beta-lactamase IMP-78,Pseudomonas aeruginosa NCGM 3848 blaIMP,http://www.wikidata.org/entity/Q31856,Pseudomonas aeruginosa NCGM 3848
4382,BETA-LACTAM,CEPHALOSPORIN,inhibitor-resistant extended-spectrum class C ...,Pseudomonas aeruginosa 2013996 blaPDC,http://www.wikidata.org/entity/Q31856,Pseudomonas aeruginosa 2013996
4307,BETA-LACTAM,CEPHALOSPORIN,class C beta-lactamase PDC-344,Pseudomonas aeruginosa 1875293 blaPDC,http://www.wikidata.org/entity/Q31856,Pseudomonas aeruginosa 1875293
1137,BETA-LACTAM,CEPHALOSPORIN,extended-spectrum class C beta-lactamase ADC-249,Acinetobacter baumannii ARLG1929 blaADC,http://www.wikidata.org/entity/Q3241189,Acinetobacter baumannii ARLG1929
1899,BETA-LACTAM,CEPHALOSPORIN,class C beta-lactamase DHA-16,Morganella morganii 943174 blaDHA,http://www.wikidata.org/entity/Q2696880,Morganella morganii 943174


In [61]:
df.to_csv("resistance_df2.csv", index=False)

## Results

assaf


## Discussion and Conclusion

daf

## Link to GitHub


https://github.com/gjmm07/DS_LOD_and_Knowledge_Graphs_2023_Finn_Heydemann

## Litarture

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10340576/