# Utilising reconXMapper (SPARQL Query) to aid Free Text Mapping

Query suggests related identifiers based solely on name. If exact match is found, identifiers related to that metabolite are returned only. If partial matches are found, each partial match and related identifiers are returned. Will get messy with partial matches, so should seperate exact matches and partial matches into two different dataframes. Ideally majority will result in exact matches, but different naming conventions across papers and sites will render this unlikely. This process will still speed up mapping as it will suggest probable metabolites to look at, instead of manually searching sites.

In [54]:
import os
import pandas as pd
import re

Need to import openTECR recuration.csv and select 2 sheets, **table_metadata** and **actual data** 
- seperate into seperate dfs for now once extracted

In [55]:
# import the file
recuration = "/home/jackmcgoldrick/openTECR/data/openTECR recuration.xlsx"

In [57]:
table_metadata = pd.read_excel(recuration, sheet_name='table metadata')

In [8]:
table_metadata

Unnamed: 0,part,page,col l/r,table from top,reaction,reference_code,method,buffer,pH,Cofactor,Evaluation,effort_needed,table_code,ECNumber,Enzyme,reference,curator comment
0,1,522,1,1,benzyl alcohol(aq) + NAD(aq) = benzaldehyde(aq...,80COO/BLA,spectrophotometry,-,7.5 - 9.5,-,C,low,80COO/BLA_2,1.1.1.1,alcohol dehydrogenase,"Cook, P.F.; Blanchard, J.S.; Cleland, W.W.; Bi...",1
1,1,522,1,2,1-butanol(aq) + NAD(aq) = butanal(aq) + NADH(aq),68ERI,spectrophotometry,sodium pyrophosphate (0.01 mol dm-3),8.2 - 8.4,-,B,low,68ERI_3,1.1.1.1,alcohol dehydrogenase,"Eriksson, C.E.; J. Food Sci.; 33, 525 (1968).",1
2,1,522,2,1,1-butanol(aq) + NAD(aq) = butanal(aq) + NADH(aq),83BRA,calorimetry,Tris and glycylglycine,8.8,-,A,low,83BRA_4,1.1.1.1,alcohol dehydrogenase,"Brattlie, W.J.; ""Thermochemistry of the Nicoti...",1
3,1,522,2,2,cyclohexanol(aq) + NAD(aq) = cyclohexanone(aq)...,59MER/TOM,spectrophotometry,phosphate (0.001 mol dm-3),7.2 - 9.5,-,C,low,59MER/TOM_5,1.1.1.1,alcohol dehydrogenase,"Merritt, A.D.; Tomkins, G.M.; J. Biol. Chem.; ...",1
4,1,522,2,3,cyclohexanol(aq) + NAD(aq) = cyclohexanone(aq)...,80COO/BLA,spectrophotometry,Tris (0.1 mol dm-3) + HCl,8,-,B,low,80COO/BLA_6,1.1.1.1,alcohol dehydrogenase,"Cook, P.F.; Blanchard, J.S.; Cleland, W.W.; Bi...",1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1655,7,1386,1,1,ATP(aq) + L-serine(aq) + tRNA-Ser(aq) = AMP(aq...,06AIR2,radioactivity,Hepes (0.050 mol dm-3) + KOH,7.4,Mg(acetate)2,A,low,06AIR2_1505,6.1.1.1,tyrosine-tRNA ligase,,added
1656,7,1386,2,1,ATP(aq) + L-arginine(aq) + tRNA-Arg(aq) = AMP(...,06AIR,radioactivity,Hepes (0.050 mol dm-3) + KOH,7.4,Mg(acetate)2,A,low,,6.1.1.1,tyrosine-tRNA ligase,,added
1657,7,1386,2,2,ATP(aq) + L-phenylalanine(aq) + tRNA-Phe(aq) =...,06AIR2,radioactivity,Hepes (0.050 mol dm-3) + KOH,7.4,Mg(acetate)2,A,low,06AIR2_1508,6.1.1.1,tyrosine-tRNA ligase,,added
1658,7,1387,1,1,ATP(aq) + L-histidine(aq) + tRNA-His(aq) = AMP...,06AIR2,radioactivity,Hepes (0.050 mol dm-3) + KOH,7.4,Mg(acetate)2,A,low,06AIR2_1506,6.1.1.1,tyrosine-tRNA ligase,,added


In [9]:
# do the same for actual data
actual_data = pd.read_excel(recuration, sheet_name='actual data')

In [10]:
actual_data

Unnamed: 0,id,EC,reference_code,reaction,K,temperature,ionic_strength,p_h,p_mg,K_prime,...,wrong_value,ph_not_present_in_pdf,missing_value_to_be_added,kprime_added,misannotated_value_type,todo_check_in_primary_literature,value_given_with_approximate_sign,virtual_entry,typo_in_pdf_corrected_in_situ,additional data
0,https://w3id.org/related-to/doi.org/10.5281/ze...,1.1.1.1,80COO/BLA,benzyl alcohol(aq) + NAD(aq) = benzaldehyde(aq...,,298.15,,7.5,,0.00098,...,,,,,,,,,,
1,https://w3id.org/related-to/doi.org/10.5281/ze...,1.1.1.1,80COO/BLA,benzyl alcohol(aq) + NAD(aq) = benzaldehyde(aq...,,298.15,,8.0,,0.00310,...,,,,,,,,,,
2,https://w3id.org/related-to/doi.org/10.5281/ze...,1.1.1.1,80COO/BLA,benzyl alcohol(aq) + NAD(aq) = benzaldehyde(aq...,,298.15,,8.5,,0.00980,...,,,,,,,,,,
3,https://w3id.org/related-to/doi.org/10.5281/ze...,1.1.1.1,80COO/BLA,benzyl alcohol(aq) + NAD(aq) = benzaldehyde(aq...,,298.15,,9.0,,0.03100,...,,,,,,,,,,
4,https://w3id.org/related-to/doi.org/10.5281/ze...,1.1.1.1,80COO/BLA,benzyl alcohol(aq) + NAD(aq) = benzaldehyde(aq...,,298.15,,9.5,,0.09800,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5740,https://w3id.org/related-to/doi.org/10.5281/ze...,4.3.-.-,+59BLA,"THF(aq) + formaldehyde(aq) = 5,10-CH2-THF(aq)",,293.15,,7.2,,7700.00000,...,,,,,,,,,,[added by Elad]
5741,https://w3id.org/related-to/doi.org/10.5281/ze...,4.1.2.43,+74FER,D-Ribulose 5-phosphate + Formaldehyde = D-arab...,,303.15,,7.0,,25000.00000,...,,,,,,,,,,[added by Elad]
5742,https://w3id.org/related-to/doi.org/10.5281/ze...,5.3.1.27,+74FER,D-arabino-Hex-3-ulose 6-phosphate = D-Fructose...,,303.15,,7.0,,188.00000,...,,,,,,,,,,[added by Elad]
5743,https://w3id.org/related-to/doi.org/10.5281/ze...,2.4.1.216,01AND/LEV,",-trehalose 6-phosphate(aq) + orthophosphate(a...",,308.15,,7.0,,0.03200,...,,,,,,,,,,"[presumably should have been in part 7, becaus..."


In [11]:
# extract empty "id" rows into new df for further analysis
unmapped_rxns = actual_data[actual_data["id"].isnull()]

In [12]:
unmapped_rxns

Unnamed: 0,id,EC,reference_code,reaction,K,temperature,ionic_strength,p_h,p_mg,K_prime,...,wrong_value,ph_not_present_in_pdf,missing_value_to_be_added,kprime_added,misannotated_value_type,todo_check_in_primary_literature,value_given_with_approximate_sign,virtual_entry,typo_in_pdf_corrected_in_situ,additional data
82,,1.1.1.1,38SCH/HEL,ethanol(aq) + desamino NAD(aq) = acetaldehyde(...,,298.15,,6.39,,0.000009,...,,,,,,,,,,
83,,1.1.1.1,38SCH/HEL,ethanol(aq) + desamino NAD(aq) = acetaldehyde(...,,298.15,,6.60,,0.000030,...,,,,,,,,,,
84,,1.1.1.1,38SCH/HEL,ethanol(aq) + desamino NAD(aq) = acetaldehyde(...,,298.15,,6.85,,0.000051,...,,,,,,,,,,
85,,1.1.1.1,38SCH/HEL,ethanol(aq) + desamino NAD(aq) = acetaldehyde(...,,298.15,,7.18,,0.000150,...,,,,,,,,,,
86,,1.1.1.1,38SCH/HEL,ethanol(aq) + desamino NAD(aq) = acetaldehyde(...,,298.15,,7.31,,0.000230,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5558,,4.6.1.3,02TEW/HAW,2-dehydro-3-deoxy-D-arabino-heptonate 7-phosph...,,298.15,,7.50,,4.600000,...,,,,,,,,,,I_m = 0.065 mol/kg; [Kprime noted as K'm]; [co...
5560,,4.6.1.3,02TEW/HAW,2-dehydro-3-deoxy-D-arabino-heptonate 7-phosph...,,298.15,,7.46,,,...,,,,,,,,,,I_m = 0.07 mol/kg; pMn = 3.30; [corrects https...
5564,,5.1.1.1,03WAT/YAM,L-alanine(aq) = D-alanine(aq),,310.15,,9.00,,0.800000,...,,,,,,1.0,,,,[corrects https://w3id.org/related-to/doi.org/...
5630,,6.1.1.1,06AIR2,ATP(aq) + L-tyrosine(aq) + tRNA-Tyr(aq) = AMP(...,,303.15,,7.40,1.74,3.030000,...,,,,,,,,,,c(Mg2+) = 18 mM; c(spermidine) = 1 mM


In [14]:
print(unmapped_rxns.columns)

Index(['id', 'EC', 'reference_code', 'reaction', 'K', 'temperature',
       'ionic_strength', 'p_h', 'p_mg', 'K_prime', 'part', 'page', 'col l/r',
       'table from top', 'entry nr',
       'authoritative_version_of_a_duplicate_table', 'duplicate_table',
       'solvent_reaction', 'enthalpy', 'error_correction', 'wrong_value',
       'ph_not_present_in_pdf', 'missing_value_to_be_added', 'kprime_added',
       'misannotated_value_type', 'todo_check_in_primary_literature',
       'value_given_with_approximate_sign', 'virtual_entry',
       'typo_in_pdf_corrected_in_situ', 'additional data'],
      dtype='object')


'reference_code' contains what we can potentially map to the values in the more reliable **table_metadata** dataframe (the reactions have been corrected for in this df)

To map them to retrieve only reliable reactions which havent been mapped:
- Extract both reference_code and reaction from unmapped_rxns

- Compare these to reference_code and reaction in table_metadata

- Save only reactions where both of these match in both dfs

- Will produce a set of unmapped reactions, which have been corrected for

- Proceed to work of this new set of reactions

## Mapping reactions/ref codes from actual_data to table_metadata

In [18]:
# Perform an inner join on the columns "reference_code" and "reaction"
matched_data = table_metadata.merge(
    unmapped_rxns,
    on=["reference_code", "reaction"],  # Columns to match on
    how="inner"  # Only keep rows where matches are found
)

# Display the resulting DataFrame
print(matched_data)

     part_x  page_x  col l/r_x  table from top_x  \
0         1     525          2                 2   
1         1     525          2                 2   
2         1     525          2                 2   
3         1     525          2                 2   
4         1     525          2                 2   
..      ...     ...        ...               ...   
447       7    1381          1                 1   
448       7    1381          1                 2   
449       7    1381          1                 2   
450       7    1381          2                 1   
451       7    1385          1                 2   

                                              reaction reference_code  \
0    ethanol(aq) + desamino NAD(aq) = acetaldehyde(...      38SCH/HEL   
1    ethanol(aq) + desamino NAD(aq) = acetaldehyde(...      38SCH/HEL   
2    ethanol(aq) + desamino NAD(aq) = acetaldehyde(...      38SCH/HEL   
3    ethanol(aq) + desamino NAD(aq) = acetaldehyde(...      38SCH/HEL   
4    ethan

In [19]:
matched_data.to_csv('/home/jackmcgoldrick/openTECR/results/freeText_rxns_tobe_mapped.csv')

In [20]:
matched_data

Unnamed: 0,part_x,page_x,col l/r_x,table from top_x,reaction,reference_code,method,buffer,pH,Cofactor,...,wrong_value,ph_not_present_in_pdf,missing_value_to_be_added,kprime_added,misannotated_value_type,todo_check_in_primary_literature,value_given_with_approximate_sign,virtual_entry,typo_in_pdf_corrected_in_situ,additional data
0,1,525,2,2,ethanol(aq) + desamino NAD(aq) = acetaldehyde(...,38SCH/HEL,spectrophotometry,-,6.39 - 8.06,-,...,,,,,,,,,,
1,1,525,2,2,ethanol(aq) + desamino NAD(aq) = acetaldehyde(...,38SCH/HEL,spectrophotometry,-,6.39 - 8.06,-,...,,,,,,,,,,
2,1,525,2,2,ethanol(aq) + desamino NAD(aq) = acetaldehyde(...,38SCH/HEL,spectrophotometry,-,6.39 - 8.06,-,...,,,,,,,,,,
3,1,525,2,2,ethanol(aq) + desamino NAD(aq) = acetaldehyde(...,38SCH/HEL,spectrophotometry,-,6.39 - 8.06,-,...,,,,,,,,,,
4,1,525,2,2,ethanol(aq) + desamino NAD(aq) = acetaldehyde(...,38SCH/HEL,spectrophotometry,-,6.39 - 8.06,-,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
447,7,1381,1,1,2-dehydro-3-deoxy-D-arabino-heptonate 7-phosph...,02TEW/HAW,chromatography,Hepes + NaOH,7.5,NAD(ox) and Zn2+(aq),...,,,,,,,,,,I_m = 0.07 mol/kg; pMn = 3.30; [corrects https...
448,7,1381,1,2,2-dehydro-3-deoxy-D-arabino-heptonate 7-phosph...,02TEW/HAW,calorimetry,Hepes + NaOH,7.5,NAD(ox) and Zn2+(aq),...,,,,,,,,,,I_m = 0.065 mol/kg; [Kprime noted as K'm]; [co...
449,7,1381,1,2,2-dehydro-3-deoxy-D-arabino-heptonate 7-phosph...,02TEW/HAW,calorimetry,Hepes + NaOH,7.5,NAD(ox) and Zn2+(aq),...,,,,,,,,,,I_m = 0.07 mol/kg; pMn = 3.30; [corrects https...
450,7,1381,2,1,L-alanine(aq) = D-alanine(aq),03WAT/YAM,chromatography,Bis-tris propane (0.10 mol dm-3),9,-,...,,,,,,1.0,,,,[corrects https://w3id.org/related-to/doi.org/...


451 reactions returned, many of which are the same, but different due to their pH etc. recorded in actual data.

- Now need to extract metabolite names from this list using regex pattern, ensuring no duplicates are included

In [21]:
# access the contents of reaction column
reactions = matched_data['reaction']

reactions

0      ethanol(aq) + desamino NAD(aq) = acetaldehyde(...
1      ethanol(aq) + desamino NAD(aq) = acetaldehyde(...
2      ethanol(aq) + desamino NAD(aq) = acetaldehyde(...
3      ethanol(aq) + desamino NAD(aq) = acetaldehyde(...
4      ethanol(aq) + desamino NAD(aq) = acetaldehyde(...
                             ...                        
447    2-dehydro-3-deoxy-D-arabino-heptonate 7-phosph...
448    2-dehydro-3-deoxy-D-arabino-heptonate 7-phosph...
449    2-dehydro-3-deoxy-D-arabino-heptonate 7-phosph...
450                        L-alanine(aq) = D-alanine(aq)
451    ATP(aq) + L-tyrosine(aq) + tRNA-Tyr(aq) = AMP(...
Name: reaction, Length: 452, dtype: object

### Extracting metabolite names from reactions

In [22]:
matched_data['metabolites'] = matched_data['reaction'].str.findall(r"([A-Za-z0-9\s,\.\-'\(\)]+)(?=\s*\+|\s*\=|$)")

In [23]:
matched_data['metabolites']

0      [ethanol(aq) ,  desamino NAD(aq) ,  acetaldehy...
1      [ethanol(aq) ,  desamino NAD(aq) ,  acetaldehy...
2      [ethanol(aq) ,  desamino NAD(aq) ,  acetaldehy...
3      [ethanol(aq) ,  desamino NAD(aq) ,  acetaldehy...
4      [ethanol(aq) ,  desamino NAD(aq) ,  acetaldehy...
                             ...                        
447    [2-dehydro-3-deoxy-D-arabino-heptonate 7-phosp...
448    [2-dehydro-3-deoxy-D-arabino-heptonate 7-phosp...
449    [2-dehydro-3-deoxy-D-arabino-heptonate 7-phosp...
450                     [L-alanine(aq) ,  D-alanine(aq)]
451    [ATP(aq) ,  L-tyrosine(aq) ,  tRNA-Tyr(aq) ,  ...
Name: metabolites, Length: 452, dtype: object

In [24]:
# need to strip (aq) etc. from each
def strip_state_specifiers(metabolites):
    if isinstance(metabolites, list):
        return [re.sub(r'\s*\(.*?\)', '', metabolite) for metabolite in metabolites]
    return metabolites

In [25]:
# apply to our use case
matched_data['metabolites'] = matched_data['metabolites'].apply(strip_state_specifiers)

print(matched_data)

     part_x  page_x  col l/r_x  table from top_x  \
0         1     525          2                 2   
1         1     525          2                 2   
2         1     525          2                 2   
3         1     525          2                 2   
4         1     525          2                 2   
..      ...     ...        ...               ...   
447       7    1381          1                 1   
448       7    1381          1                 2   
449       7    1381          1                 2   
450       7    1381          2                 1   
451       7    1385          1                 2   

                                              reaction reference_code  \
0    ethanol(aq) + desamino NAD(aq) = acetaldehyde(...      38SCH/HEL   
1    ethanol(aq) + desamino NAD(aq) = acetaldehyde(...      38SCH/HEL   
2    ethanol(aq) + desamino NAD(aq) = acetaldehyde(...      38SCH/HEL   
3    ethanol(aq) + desamino NAD(aq) = acetaldehyde(...      38SCH/HEL   
4    ethan

In [26]:
matched_data['metabolites']

0      [ethanol ,  desamino NAD ,  acetaldehyde ,  de...
1      [ethanol ,  desamino NAD ,  acetaldehyde ,  de...
2      [ethanol ,  desamino NAD ,  acetaldehyde ,  de...
3      [ethanol ,  desamino NAD ,  acetaldehyde ,  de...
4      [ethanol ,  desamino NAD ,  acetaldehyde ,  de...
                             ...                        
447    [2-dehydro-3-deoxy-D-arabino-heptonate 7-phosp...
448    [2-dehydro-3-deoxy-D-arabino-heptonate 7-phosp...
449    [2-dehydro-3-deoxy-D-arabino-heptonate 7-phosp...
450                             [L-alanine ,  D-alanine]
451    [ATP ,  L-tyrosine ,  tRNA-Tyr ,  AMP ,  pyrop...
Name: metabolites, Length: 452, dtype: object

State specifiers removed, now to remove duplicates

In [32]:
# Initialising a set to store unique metabolite names
unique_mets = set()

# loop through lists in metabolites 
for mets in matched_data['metabolites']:
    if isinstance(mets, list):
        unique_mets.update(mets)

# convert set back to a list of unique metabolites
unique_mets_list = list(unique_mets)

print(unique_mets_list)

[' phosphoribosyl-1-O-', '', ' 2 L-methionine', ' L-carnitine ', ' cyclohexanol ', 'pyrophosphate ', ' 4-methyl-2-oxopentanooate ', ' pyrophosphate', ' L-tyrosine ', 'acetyl phosphate ', 'D-glucose 6-phosphate ', ' 15-oxo-prostaglandin E2 ', 'phosphorylcholine ', ' thiopyrophosphate ', 'D-fructose 1,6-bisphosphate ', ' D-sedoheptulose ', ' methyl viologen ', 'glycerol ', ' tRNA-Tyr ', ' diphosphate', ' acetyl phosphate ', ' H2O ', 'glycolate ', " 4'-methylacetanilide ", ' pyrophosphate ', ' D-glyceraldehyde 3-phosphate ', ' cyclooctanol ', 'ADP ', ' cis-hex-2-enoyl-CoA ', 'L-phenylalanine ', 'cycloheptanone', 'S-methylmethionine ', ' 2 ammonia ', ' maleate ', " 4'-chloroaniline", ' heteronicotinathiamine', ' desamino NAD ', ' dihydroxyacetone ', ' nicotinamide ', 'L-glutamine ', ' propionaldehyde', ' D-glycine', ' palmitic acid', '5-deoxypyridoxamine ', ' L-glutamate ', 'L-leucine ', ' 8 H2O ', ' cis-aconitate ', ' ADP-N1-oxide ', ' 1-butanol', " 4'-acetylacetanalide ", ' D-glyceraldeh

In [33]:
print(len(unique_mets_list)) # the list of metabolites before whitespace stripping

259


In [35]:
# convert the metabolite list into df to inspect further for mistakes
unique_mets_df = pd.DataFrame(unique_mets_list, columns=['Metabolites'])

In [38]:
unique_mets_df.to_csv('/home/jackmcgoldrick/openTECR/results/list_metabolites_tobe_mapped.csv', index=False)

#### Stripping Whitespace from Each Compound, For correctness in SPARQL Query

In [41]:
############## stripping whitespace to xclude unwanted strings 

# Initialising a set to store unique metabolite names
unique_mets_stripped = set()

# loop through lists in metabolites 
for mets in matched_data['metabolites']:
    if isinstance(mets, list):
        unique_mets.update(m.strip() for m in mets)

# convert set back to a list of unique metabolites
unique_mets_striiped_list = list(unique_mets)

print(unique_mets_striiped_list)

['', 'D-sedoheptulose', 'L-glutamate', 'carbon monoxide', 'phospholysozyme', 'cyclobutanol', 'D-rhamnulose', 'GTP', '-3-hydroxyhexanoyl-CoA', '2-oxo-4-methiolbutyrate', 'ADP-N1-oxide', 'heteroanilithiamine', 'cyclooctaamylose', '3-phospho-D-glyceroyl phosphate', 'L-O-phosphoserine', 'ethanol', 'cycloheptanone', 'L-homocysteine', 'L-mannose', '3-hydroxypyridine-4-aldehyde', 'NADH', 'L-glutamine', 'norpyridoxamine', ')-2-octanol', '2-oxoglutarate', 'O2', 'dihydroxyacetone', '6 D-glucose', "5'-phosphate", 'butyl acetate', '2-phospho-D-glycerate', 'prostaglandin E2', 'cyclohexanol', '-(', '-2-heptanol', '8 D-glucose', 'NADPH', 'GDP', 'cyclobutanone', 'cyclohexanone', 'acetone', 'phosphorylcholine', 'ethanolamine', '2-propanol', 'butyryl-CoA', 'L-tyrosine', '2-oxoglutaramate', '7,8-dihydrofolate', 'thiopyrophosphate', '5,6,7,8-tetrahydrofolate', 'phosphocreatine', 'UMP', '2-oxoisocaproate', '-malate', '2-dehydro-3-deoxy-D-arabino-heptonate 7-phosphate', '4-methyl-5--thiazole', 'L-butyrylcar

In [42]:
print(len(unique_mets_striiped_list))

222


In [44]:
stripped_mets_df = pd.DataFrame(unique_mets_striiped_list, columns=(['Stripped Metabolites']))

In [None]:
# converting to csv to analyse outputs easily
stripped_mets_df.to_csv("/home/jackmcgoldrick/openTECR/results/stripped_mets_tobe_mapped.csv", index=False)

### Removing problematic mets i.e "8 h20" and "6 glucophosphate" etc..

In [49]:
# not removing, stripping the digits, afterwards will discard any duplicate entries
## done to prevent possible loss of data 
stripped_mets_df['Metabolites'] = stripped_mets_df['Stripped Metabolites'].str.replace(r'^\d+\s', '', regex=True)

In [50]:
stripped_mets_df

Unnamed: 0,Stripped Metabolites,Metabolites
0,,
1,D-sedoheptulose,D-sedoheptulose
2,L-glutamate,L-glutamate
3,carbon monoxide,carbon monoxide
4,phospholysozyme,phospholysozyme
...,...,...
217,4'-cyanoacetanilide,4'-cyanoacetanilide
218,5-deoxypyridoxamine,5-deoxypyridoxamine
219,cyclooctanone,cyclooctanone
220,benzyl acetate,benzyl acetate


In [51]:
# update the corresponding csv file
stripped_mets_df.to_csv("/home/jackmcgoldrick/openTECR/results/stripped_mets_tobe_mapped.csv", index=False)

In [52]:
# Drop duplicate metabolite names
unique_cleaned_mets = stripped_mets_df.drop_duplicates(subset=['Metabolites'], keep='first')

# Display the resulting DataFrame
print(unique_cleaned_mets)

    Stripped Metabolites          Metabolites
0                                            
1        D-sedoheptulose      D-sedoheptulose
2            L-glutamate          L-glutamate
3        carbon monoxide      carbon monoxide
4        phospholysozyme      phospholysozyme
..                   ...                  ...
217  4'-cyanoacetanilide  4'-cyanoacetanilide
218  5-deoxypyridoxamine  5-deoxypyridoxamine
219        cyclooctanone        cyclooctanone
220       benzyl acetate       benzyl acetate
221          amoxicillin          amoxicillin

[212 rows x 2 columns]


In [53]:
# save to csv for manual corrections of hard to correct mets
unique_cleaned_mets.to_csv("/home/jackmcgoldrick/openTECR/results/mets_toBe_mapped_Final.csv", index=False)

Manually cleaned some of the entries which were problematic. Possibly some metabolites still not accounted for but 214 mets is a good start and covers majority of reactions.

- 214 comes from the addition of two metabolites which were uncovered via manual inspection

- When inspecting **mets_toBe_mapped_Final.csv** , note that "Stripped Metabolites" column was removed manually as it was no longer needed