<a href="https://colab.research.google.com/github/patlewig/aim/blob/master/notebooks/SMARTS_to_CSRML_label_maker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Helper script to create CSRML faster automatically

This script is intended to take the CSRML generated by the original chemotype editor and convert it into a more organized format.

The Chemotype editor used smarts as an input from an excel file.

#### QAPP ID: I-CCED-0032994-QP-1-0
#### Author: Matthew Adams
#### Principal Investigator: Grace Patlewicz
#### Last Modified May 05 2022

## Purpose

This allows us to easily add labels/titles to any SMARTS 2 CSRML conversion so we can replace the original ID:

```
<subgraph id="r00001">
            <molecule id="m1">

```
To this:

```
<subgraph id = "0">
            <label>"‐CH3 [aliphatic carbon]"</label>
            <title>"‐CH3 [aliphatic carbon]"</title>
            <comment>"‐CH3 [aliphatic carbon]"</comment>
            <molecule id="m1">
```





##Workflow:

1. Upload AIM Fragments Excel File
2. Use the SMARTS to CSRML tool in ChemoTyper to generate CSRML
3. Create an XML file containing newly converted AIM structures
4. Convert Excel to Pandas dataframe
5. Iterate over Pandas rows to replace XML lines
6. Save and Download new XML file

# New Section

In [None]:
import pandas as pd
import os
import re
os.getcwd()

# I originally tried to use an XML parser instead of Regex but it did not seem to be able to by pattern instead of <tags> and their attributes

#Chemotyper does not like "<" character in Fragment names


'/content'

In [None]:
# upload excel sheet

df = pd.read_excel('AppendixB_AIM_465_fragments_v060718_missing.xlsx' ,sheet_name='AIM_Fragments_r0_011122',header=1,usecols=list(range(0,6)))
df

Unnamed: 0,Fragment,Id,Modified Name,General Comments,SMARTS,Subgraph
0,‐Hg‐ [mercury],310C,,,[Hg],r00001
1,[Pb] (Lead),311C,,,[Pb],r00002
2,[As] (Arsenic),312C,,,[As],r00003
3,[Ge] (Germanium),313C,,,[Ge],r00004
4,Tin [Sn],314C,,,[Sn],r00005
5,Tin [Sn] { oxygen attach },315C,,"R’, R”, R’’’ Can be H;\nR’’’’ Cannot be H",[SnX4](C)(C)(C)[O][!#1],r00006
6,Tin [Sn] { oxygen and aromatic attach },316C,,Must have at least one aromatic attachment.,[cR][Snv2]O[!#1],r00007
7,Tin [Sn] { halogen or ‐OH attach },317C,,"R’ = Hydroxy, Halogen; R”, R’’’, R’’’’ can be H","[#6][Snv2,Snv4][Mt]",r00008
8,Aluminum [Al],318C,,,[Al],r00009
9,Gold [Au]=P { Phosphorus attach },319C,,R’s can be H,[Au]=P,r00010


In [None]:
# Just to show what we want to be inside of our file vs what the converter does
for index, row in df.iterrows():
  if index<=1:
    original_subgraph = '<subgraph id ="'+str(row['Subgraph']) + '">'
    subgraph = '<subgraph id = "'+str(row['Id']) + '">'
    label = '     <label>"'+ row['Fragment'] +'"</label>'
    title = '     <title>"'+ row['Fragment'] +'"</title>'
    comment= '     <comment>"'+ row['Fragment'] +'"</comment>'
    #print(original_subgraph)
    print(subgraph + '\n' + label + '\n' + title + '\n' + comment)
  

<subgraph id = "310C">
     <label>"‐Hg‐ [mercury]"</label>
     <title>"‐Hg‐ [mercury]"</title>
     <comment>"‐Hg‐ [mercury]"</comment>
<subgraph id = "311C">
     <label>"[Pb] (Lead)"</label>
     <title>"[Pb] (Lead)"</title>
     <comment>"[Pb] (Lead)"</comment>


In [None]:
# Open and read our XML file_1239
with open ("AIM_V1.10_miss.xml", "r") as myfile:
    data = myfile.read()
print(data)

<?xml version='1.0' encoding='utf-8'?>
<csrml xmlns="http://www.molecular-networks.com/schema/csrml" id="AIM_v1_0" csrmlVersion="2">

	<title>
		AIM Chemotypes Version 1.0
	</title>
	<description>
		$Id: AIM_V1.0.xml 711 2022-01-19 madams $
		@manualVersion: v.1.0
		@lastChangedBy: $Author: madams $
		@revision:      $Revision: 0 $
	</description>

	<classes id="AIM_v1_0">
		<title>AIM Chemotypes Version 1.0</title>
		<comment>symphony_display</comment>
		<class id="class_0001">
			<label>atom</label>
			<class id="class_0002">
				<label>element</label>
			</class>
                </class>
		<class id="class_0003">
		        <label>bond</label>
			<class id="class_0004">
				<label>C#N</label>
				
			</class>
			<class id="class_0005">
				<label>C(~Z)~C~Q</label>
				
			</class>
			<class id="class_0006">
				<label>C(=O)N</label>
				
			</class>
			<class id="class_0007">
				<label>C(=O)O</label>
			</class>
			<class id="class_0008">
				<label>C=N</label>
			</class>
			<class

In [None]:
# Replace the text patterns with our desired pattern inside of our XML file
for index, row in df.iterrows():
    subgraph = '<subgraph id = "'+str(row['Id']) + '">'
    label = '            <label>"'+ str(row['Fragment']) +'"</label>'
    title = '            <title>"'+ str(row['Fragment']) +'"</title>'
    comment= '            <comment>"'+ str(row['Fragment']) +'"</comment>'
    data = re.sub(r'<subgraph id="\b{}\b">'.format(row['Subgraph']), subgraph + '\n' + label + '\n' + title + '\n' + comment, data)

#print(re.findall(r'"(.*[<|&].*)"',data))
for match in re.findall(r'"(.*[<|&].*)"',data):    
      match_sub = re.sub(r"[<|&]","",match)   #remove < and &
      #match_sub = re.sub(r"&","and",match)           
      data = re.sub(re.escape(match), match_sub, data)   

data = re.sub("Mt", "X", data)   #Replace Mt with X (Mt is recognized by converter but unused, X is not recgonized but represents Halogens)

#Fixes elementList and bondList <value> problem

strings = re.findall(r'<value>\n  *[a-zA-Z]{1,6}\n  *',data)

#print(strings)
for string in strings:
 string = "".join(string.split())
 #print(string[7:])
 data = re.sub(r'<value>\n  *\b{}\b\n  *'.format(string[7:]),string, data)


In [None]:
#Write our changes to new XML file
import xml.etree.ElementTree as ET
myfile = open("AIM_V1.10_missing.xml", "w")
myfile.write(data)

73735