# Finding Your reaction

## <u>Authors</u>: Laura GOMEZ, Eva GULEAC, Lilou BUFFET
## <u>Date</u>: 24/05/2024

---

## Table of Contents
1. [Introduction](##introduction)
2. [Objectives](##Objectives)
3. [Demonstration and Results](##Demonstration-and-Results)
5. [Functionalities](##Functionalities)
6. [Database](##Database)
7. [Discussion](##discussion)
8. [Conclusion](##conclusion)


## Introduction:

In the world of chemistry, finding the most efficient reaction pathway that leads to a certain product can significantly impact both research outcomes and industrial applications, and it can greatly save a chemist time and resources. Imagine being a chemist a century ago trying to synthesise a molecule but not knowing exactly which reaction to use. You would have to produce your desired molecule through trial and error, trying every reaction you can think of. While that can be part of the beauty of the art of chemistry, today's world is focused on efficiency, and this is what this code is for.

## Objectives:

We wated to make a code that would output the reaction of formation of a product inputted by the user within the limits of our database.<br>
If the product can be formed by various reactions, the one with the highest yield is returned.<br>
If the given product is not in the database, the program offers the possibility to find an isomer present in the database and to return its formation reaction, as to not leave the user empty handed.<br>
In addition, the reaction is displayed in a 2-dimensional image, and the product is plotted in a 3-dimensional graph.<br>
In case one needs help finding a product, the package offers random products to try from the database.<br>

## Demonstration and Results:
For example, if we enter "cyanide" when the program prompts us:<br>
<img src="attachment:78991813-c708-4d1a-be7c-97c9c16f82e5.png" alt="Image" width="600" height="500"><br>
The reaction informations as follow: <br>
<img src="attachment:25651f50-81f4-49f5-82d4-269cf9ac40a9.png" alt="Image" width="500" height="500"><br>
The reaction in 2D:<br>
<img src="attachment:580ad6c3-639b-430b-af9c-a1249d42e83c.png" alt="Image" width="900" height="500"> <br>
The 3D representation of the cyanide:<br>
<img src="attachment:24d0f4a8-8524-43fc-b4a6-8f701f927978.png" alt="Image" width="500" height="500"> <br>

## Functionalities

In this part the multiple functions of this code will be described as well as their arguments and returns.

The function `remove_atom_mapping` has been created to remove the atom mapping from a given smile notation, to then apply it to our data base.
>The arguments taken by the function are smiles (str), the SMILES-like notation with atom mapping numbers.<br>
>The returns are smiles_without_mapping (str): The SMILES notation without atom mapping numbers.
Please find a exemple of its use below.

In [11]:
from functions import remove_atom_mapping
print(remove_atom_mapping("C[CH2:1][OH:2]"))

ModuleNotFoundError: No module named 'functions'

The function `remove_percent_symbol` to remove the percentage symbol from the yield column has been created to compare the yields in our data base.
>The arguments are a value (str): The percentage value with the '%' symbol.<br>
>It returns a value_without_percent (str): The percentage value without the '%' symbol.<br>

This function has been made in order to be able to compare the different yields of our reactions.

In [12]:
from functions import remove_percent_symbol
remove_percent_symbol("15%")

ModuleNotFoundError: No module named 'functions'

To find constitutional isomers, a function `clean_string` has been created to delete all the brakets, parenthesis, =, + or -. 
This function is applied to the data base to form the isomer data base and, later on, give the molecule smiles if the user wants to.
>**Arguments:**
    value (str), This string is potentially contain special characters that need to be removed.<br>
>**Returns:**
    value_without_caracter (str), The characters removed are [, ], (, ), +, -, and #.

In [13]:
from functions import clean_string
clean_string ("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")

ModuleNotFoundError: No module named 'functions'

To generate a random product in the data base a function `main()` has been created.
>**Usage**:<br>
    - Press Enter: To get a random product from the DataFrame.<br>
    - Type 'exit': To quit the program.<br>
    - After displaying a random product, type 'yes' to continue or 'no' to exit.<br>

In [14]:
from functions import main
main()

ModuleNotFoundError: No module named 'functions'

To check if an input is in smiles notation or not, the is_smile function has been created.
>**Arguments** of the function smiles (str): The string to check.<br>
>**Returns**: is_valid (bool), True if the string is a valid SMILES notation, False otherwise

In [15]:
from functions import is_smiles
is_smiles("2-Acetoxybenzoic acid")
is_smile ("CC(=O)OC1=CC=CC=C1C(=O)O")

ModuleNotFoundError: No module named 'functions'

The name_function_to_smile function has been used to convert a UPAC name into a smile in order to be able to search the molecule in the data base. The PubChemPy's PubChem database has been used.
 > **Arguments:** molecule_name (str), the name of the molecule.<br>
   **Returns:** smiles (str), the SMILES notation of the molecule, or None if retrieval fails.

This function can fail if the molecule isn't in the PubChem database

In [16]:
from functions import name_to_smiles
name_to_smiles("1,3,7-Trimethylxanthine")

ModuleNotFoundError: No module named 'functions'

To find the rows in the data base where the required product is, we used the find_molecule_rows function.
> **Arguments:** <br>
            >> -dataFrame (pd.DataFrame): The DataFrame to search.<br>
            > -string_input_mol (str): The molecule to search for.<br>
            > -start_col (int): The starting column index for the search.<br>
            end_col (int)*: The ending column index for the search. If None, searches until the last column.<br>

>**Returns:** <br>
            >>-List[int]: A list of row indices where the molecule is found.

In [7]:
from functions import find_molecule_rows
find_molecule_rows("[Br-]")

ModuleNotFoundError: No module named 'functions'

To find all of the possible isomers for a molecule we generated a list with all of possible the permutations, using the `generate_permutations function`.
>**Arguments:**
    - input_string (str): The input string for which permutations are to be generated.
**Returns:**
    - list of str: A list containing all possible permutations of the characters in the input string.


In [8]:
from functions import generate_permutations
generate_permutations("CCO")

ModuleNotFoundError: No module named 'functions'

This list has been then passed into the is_smile function to only keep the existing configurations.

To have more information about the molecule that we want to form, the `get_molecule_function` converts the Smiles of the user into the IUPAC name. 
> **Arguments:** smiles (str), SMILES representation of the molecule.<br>
  **Returns:** name (str), Common name of the molecule.

In [9]:
from functions import get_molecule_name
get_molecule_name("C1=CC=CC=C1")

ModuleNotFoundError: No module named 'functions'

There is also a function to give the molecular weight, `get_molecular_weight`, from the molecule that the user has entered
>**Arguments:** smiles (str): SMILES string of the molecule.<br>
 **Returns:** float, Molecular weight of the molecule.

In [10]:
from functions import get_molecular_weight
get_moleular_weight ("C1=CC=C(C=C1)C(=O)O")

ModuleNotFoundError: No module named 'functions'

To have a 3D-visualisation of the molecule, we have the function `plot_molecule_3D`
> **Arguments:**
     smiles (str), SMILES representation of the molecule.<br>
   **Returns:** 3D plot image

In [None]:
from functions import plot_molecule_3D
plot_molecule_3D("C=C")

## Database 
To use the database used as reference is this code, for it's installation please have a look to the readme of the project.
In the case where you want to change the database for an other one, you should remplace in the code "1976_Sep2016_USPTOgrants_smiles.rsmi" by the name of your file. Verify if the title of your columns match with the ones that we use. The part of the code that you should also change would be the following:

In [None]:
#Create a Dataframe where the reaction is in one column
dataFrameImage= pd.read_csv("your file.rmsi", delimiter='\t',low_memory=False)
columns_to_delete = ["PatentNumber", "ParagraphNum", "Year", "TextMinedYield"]# If there is column that you want to delete from your dataframe
dataFrameImage.drop(columns=columns_to_delete, inplace=True)
dataFrameImage["ReactionSmiles"] = dataFrameImage["ReactionSmiles"].apply(remove_atom_mapping)

And that's it! You should suceed to use our code to your own database!

# Discussion:

## Advantages: 

-With slight modifications, the package can be used with **various databases**. Our conversion in smiles without mapping can be applied easily.<br>

-The **time spent searching** for the formation reaction of the needed molecule is, if contained in our database, drastically reduced and far less tedious than if one needed to look anywhere else.<br>

-The database used in the project covers all reactions patented in the **USA between 1996 and 2016**. Therefore it searches, after some selection, through more than **700 000 reactions**.<br>

-The code is interactive and can be **used by everyone** as we have various inputs such as the common name or the smiles notation of the molecule.<br>

-The input doesn't need to be in SMILES. Common molecule names are converted by the package as chemists use that langage only in programming.<br>                                                             
-If the molecule isn’t in the data base, the user still has the option to search for a possible **isomer formation reaction** instead of the initial molecule.<br>

-To test the code or if the user doesn’t know which molecule to form, they have the possibility to ask for a **random molecule’s smiles** in the database.<br>

-**Visualizations of the reaction and the product** that make the informations more digestible and readable than only the SMILES line codes.<br>

-There is a **progress bar** at few step of the code to see the progression of the research in the database.


## Limitations and possible improvement

-The library used in the `name_to_smiles` function is the pubchempy one. Therefore, if the **input** that needs to be converted in SMILES **isn't in the library**, the code won't work, even if a reaction exist in our database.<br>

-The initial data has over 1 million reactions, but because the choice to remove all reactions that didn t display their yields was made, a serious amount of **information is lost**. However, more than a hundred thousands reactions remain.<br>

-The code has a pretty **long running time** as the database is large.

-The 3D plot could be interactive.

## Challenges faced:

A challenge was rooted in the size of the databse. Indeed, some functions each took a few minutes to run and some had their kernels die before completing the task. This made each creation of function tedious. <br>
To tackle this problem, and because the size of the data could not be reduced, some functions had to be seriously optimized as for everything to run smoothly.<br>
To search the isomers, it has been a difficult take to inculde all the possible permutations for the molecule.


## Conclusion

So here you have it, a complete tool every experimental chemist would wish to have, despite some aspects which could still be improved. 