<div align="center">

# **💥MolStorage💥**

</div>

<div align="center">

### A simple tool to help students analyze and understand the safety of different molecules and their proper storage

</div>

## **1. Introduction**
### 1.1. Motivation


<div style="text-align: justify; text-justify: inter-word; width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
    Chemistry students are often faced with the recurring challenge of gathering critical information about chemical compounds, such as safety data sheets (SDS), hazard pictograms, and specific properties. This process is not only tedious and time-consuming but also essential for ensuring proper handling and safety in both academic and laboratory settings.
    <br><br>
    We therefore came up with the idea of developing an algorithm that can correctly organize various chemical products according to the guidelines provided by EPFL, in order to automate the process and save users from having to spend a long time searching for the different characteristics of each product. This tool also identifies and displays the different hazard categories associated with each substance.
</div>

### 1.2. Theory
<div style="text-align: justify; text-justify: inter-word; width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
    Our first reference is from the EPFL website (<a href="https://www.epfl.ch/campus/security-safety/en/lab-safety/hazards/chemical-hazards/chemicals-storage/">https://www.epfl.ch/campus/security-safety/en/lab-safety/hazards/chemical-hazards/chemicals-storage/</a>), which outlines whether certain products can be stored together based on their hazard pictograms. The table provided offers the necessary guidelines to ensure proper and safe storage. The relevant criteria are illustrated in the figure below.
    <br><br>
    <div style="text-align: center;">
        <img src="../assets/security_table_english.png" alt="Illustration" style="max-width: 80%; height: auto;">
    </div>
    
</div>

<div style="text-align: justify; text-justify: inter-word; width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
    By looking at this table, we can see that in order to properly store the products, we need to identify their pictograms as well as determine whether they are acids or bases.
    <br><br>
The second source, also from EPFL shown below, details the proper storage of a product based on its pictogram. (<a href="https://www.epfl.ch/campus/security-safety/wp-content/uploads/2024/01/Chemicals-Storage-flowchart_2024.pdf">EPFL Chemicals Storage Flowchart 2024</a>)
 <br><br>
    <div style="text-align: center;">
        <img src="../assets/Security_flowchart.jpg" alt="Illustration" style="max-width: 80%; height: auto;">
    </div>
</div>

## **2. Functionalities**
### 2.1. Definition of functions available in the package

<div style="text-align: justify; text-justify: inter-word; width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
Our package molstorage contains 8 main functions. <br><br>
1. <code>get_compound_safety_data</code> <br><br>
Takes a molecule's common name as a <code>str</code> input and returns a <code>Tuple[str, List[str], List[str]]</code> containing the PubChem CID (ID), GHS pictograms, and hazard statements retrieved from PubChem’s REST API. <br><br>
2. <code>get_name_and_smiles</code> <br><br>
Takes a PubChem compound ID (CID) as a <code>str</code> input and returns a <code>Tuple[str, str, str]</code> containing the compound’s Record Title (generic name), IUPAC name, and SMILES string, using the PubChemPy library. <br><br>
3. <code>classify_acid_base</code> <br><br>
Takes four inputs: <code>name: str, iupac_name: str, smiles: str, ghs_statements: List[str]</code>. <br><br>
Where: <p style="text-indent: 2em;">- <strong>name</strong> is the generic name of the compound</p>
<p style="text-indent: 2em;">- <strong>iupac_name</strong> is the IUPAC name of the compound</p>
<p style="text-indent: 2em;">- <strong>smiles</strong> is the SMILES string representing the compound’s chemical structure</p>
<p style="text-indent: 2em;">- <strong>ghs_statements</strong> is the list of GHS hazard statements related to the compound</p>
The function returns a classification as a <code>Union[str, Tuple[str, ...]]</code> such as "acid", "base", "neutral", or "amphoteric", based on the compound’s chemical features and names. <br><br>
4. <code>get_mp_bp</code> <br><br>
Takes a compound name as a <code>str</code> input and returns a <code>Tuple[Optional[float], Optional[float], Optional[float], Optional[float]]</code> representing the average melting and boiling points in Celsius and Fahrenheit, respectively, extracted from PubChem data. Returns <code>None</code> for any temperature that cannot be found. <br><br>
5. <code>compound_state</code> <br><br>
Takes four optional floating-point numbers representing melting and boiling points in Celsius and Fahrenheit (<code>Optional[float]</code>) and returns a <code>str</code> indicating the predicted physical state ("solid", "liquid", "gas", or "unknown") at room temperature (20°C / 68°F). <br><br>
6. <code>prioritize_pictograms</code> <br><br>
Takes a <code>List[str]</code> of GHS pictogram names and returns a <code>List[str]</code> sorted according to a predefined hazard severity priority, where lower numbers indicate more severe hazards. Unknown pictograms are placed at the end of the list. <br><br>
7. <code>is_chemically_compatible</code> <br><br>
Takes multiple inputs: <code>existing_pictograms: List[str], new_pictograms: List[str], existing_acid_base_class: str, new_acid_base_class: str, existing_state: str, new_state: str, group_name: str</code>. <br><br>
Where: <p style="text-indent: 2em;">- <strong>existing_pictograms</strong> are the GHS pictograms for the existing chemical</p>
<p style="text-indent: 2em;">- <strong>new_pictograms</strong> are the GHS pictograms for the new chemical</p>
<p style="text-indent: 2em;">- <strong>existing_acid_base_class</strong> is the acid/base classification of the existing chemical</p>
<p style="text-indent: 2em;">- <strong>new_acid_base_class</strong> is the acid/base classification of the new chemical</p>
<p style="text-indent: 2em;">- <strong>group_name</strong> is the storage group used for applying group-specific compatibility rules</p>
The function returns a <code>bool</code> indicating whether the two chemicals are compatible for storage based on their pictograms, acid/base classifications, physical states, and group. Returns <code>True</code> if compatible, <code>False</code> otherwise. <br><br>
8. <code>chemsort_multiple_order_3</code
<br><br>The function processes compounds by assigning them into predefined groups if they are compatible with it and the compounds contained in the group. If a compound is not compatible with any known group, it is either added to a matching custom group or a new custom group is created. <br><br>  
This functions takes into input <code>compounds</code> (<code>List[Dict[str, Any]]</code>) and <code>storage_groups</code> (<code>Dict[str, Dict[str, List[Dict[str, Any]]]]</code>).<br><br>
Where:
<p style="text-indent: 2em;">- <strong>compounds</strong> are all the compounds to sort into their proper storage. Each compound is represented with a dictionary with keys: <code>name</code>, <code>sorted_pictograms</code>, <code>hazard_statements</code>, <code>acid_base_class</code> and <code>state_room_temp</code>, which are needed for determining incompatibilities between compounds.
<p style="text-indent: 2em;">- <strong>storage_groups</strong> is a dictionary representing existing storage groups. Each key is a group name, mapping to another dictionary with keys 'solid', 'liquid', and 'gas', each containing a list of compatible compounds. Empty at first use of the function.
The functions returns <code>Dict[str, Dict[str, List[Dict[str, Any]]]]</code>, an update dictionnary of storage groups with the compounds sorted in the appropriate categories.
</div>

### 2.2 Results
<div style="text-align: justify; text-justify: inter-word; width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
    Results are shown in a Streamlit Web App, as it is easier to understand where compounds are stored, with a display like this.
</div>

<style>
  .justified-text {
    text-align: justify;
    text-justify: inter-word;
    width: 95%;
    margin-left: 0%;
    margin-right: 0;
    font-size: 15px;
    line-height: 1.6;
  }
</style>

<div class="justified-text">

  <h2><b>3. Limitations</b></h2>

  <h4><b>PubChem Registration of Compounds</b></h4>  

  <p>When entering the name of a compound on PubChem, it will return the compound itself and not a solution, which is a problem as many chemicals in labs are already in the form of prepared solutions. The pictograms might differ between the pure compound and a solution of that same compound, especially if the concentration is low. For example, hydrochloric acid will not return a solution of hydrochloric acid but the gas itself.</p>

  <p>This could have been improved by instead searching the molecule's name on another website like Sigma-Aldrich or Fisher Scientific—sites that offer chemical products—and then retrieving the pictograms or additional information from there. However, this proved tedious, as the relevant information is often scattered across various parts of the website. Given that the method using the PubChem REST API was already completed and worked well for many compounds, it was not replaced. Still, this limitation should be kept in mind when using our <code>chemsort</code> package.</p>  

  <h4><b>Specific Incompatibilities</b></h4>  

  <p>In the EPFL guidelines for the storage of hazardous chemicals, it is recommended to consult sections 7 and 10 in the Safety Data Sheet (SDS) of a compound, as these include specific storage instructions—such as suitable storage materials or incompatibilities with other chemicals.</p>

  <p>A script was developed to fetch the SDS PDF for a compound from the Fisher Scientific website, which has a dedicated SDS search page. This made it possible to extract sections 7 and 10 from the PDF, as the format of the Fisher SDS documents is consistent.</p>

  <p>However, due to the limitations of PubChem's compound registration, the retrieved PDF might not correspond to the exact compound of interest. Moreover, the PDF was identified through the first search result for a compound's name, which may not always be accurate. This could have been improved by refining the search process or expanding it to include multiple sources—not just Fisher Scientific. But since SDS formats vary between sites, a simple function alone would not suffice to extract consistent information across providers.</p>

  <p>Another improvement could have been to use machine learning for better SDS interpretation, but due to time constraints, we focused on extracting pictograms, acid/base information, and hazard statements—which proved sufficient in most cases to sort hazardous chemicals effectively.</p>

</div>

<p><i>Results are shown in a Streamlit web app, as it is easier to understand where compounds are stored, with a display like this.</i></p>

<style>
  .justified-text {
    text-align: justify;
    text-justify: inter-word;
    width: 95%;
    margin-left: 0%;
    margin-right: 0;
    font-size: 15px;
    line-height: 1.6;
  }
</style>

<div class="justified-text">

  <h2><b>3. Problems and Challenges</b></h2>

  <h4><b>Database Availability</b></h4>  

  <p>Pictograms and safety hazard statements were not available in any existing database. Although it was possible to download databases for certain categories, such as "corrosive" compounds from PubChem, the corresponding hazard statements were missing or difficult to extract.</p>

  <h4><b>No Direct Database Access</b></h4>  

  <p>By inspecting the HTML code of PubChem’s website, it seemed feasible to extract safety pictogram names and hazard statements using the <code>BeautifulSoup</code> package. However, this approach failed because the pictogram images are loaded dynamically via JavaScript on the compound pages, and therefore do not appear in the static HTML that BeautifulSoup parses.</p>

  <p>To overcome this, the <code>selenium</code> package (<a href="https://pypi.org/project/selenium/" target="_blank">https://pypi.org/project/selenium/</a>) was tested, as it can control a web browser (e.g., Google Chrome) and scrape dynamic JavaScript-loaded content. Although Selenium worked well for retrieving information for a single compound, it proved too slow when processing multiple chemicals because it required fully loading each PubChem page in a browser, taking several minutes per compound—an unacceptable delay for large datasets.</p>

  <p>Ultimately, the PubChem PUG-View REST API (<a href="https://pubchem.ncbi.nlm.nih.gov/docs/pug-view" target="_blank">https://pubchem.ncbi.nlm.nih.gov/docs/pug-view</a>) was used instead. Initially, it was believed that this API did not contain the needed data, but after thorough analysis of its structure, the locations of the pictograms and hazard statements were successfully identified. This method was significantly faster, even when tested on a dozen compounds, and thus was kept.</p>

  <p>Additionally, the <code>pubchempy</code> package (<a href="https://pubchempy.readthedocs.io/en/latest/guide/introduction.html" target="_blank">https://pubchempy.readthedocs.io/en/latest/guide/introduction.html</a>) was used to retrieve each compound’s generic name, IUPAC name, and SMILES notation.</p>

  <h4><b>Hazardous Chemicals Sorting</b></h4>

  <p>The sorting of hazardous chemicals followed EPFL’s safety directives (<a href="https://www.epfl.ch/campus/security-safety/wp-content/uploads/2024/01/Chemicals-Storage-flowchart_2024.pdf" target="_blank">EPFL Chemicals Storage Flowchart 2024</a>), which specify how to store chemicals based on their safety pictograms and hazard statements.</p>

  <p>Incompatibilities between pictograms were considered according to a referenced diagram, which also recommends separating liquids from solids and storing explosive compounds or compressed gases separately, sometimes in isolation from other chemicals of the same class (e.g., oxygen storage).</p>

  <p>The pictograms have a defined priority order; this was incorporated in the code by sorting each compound’s pictograms from highest to lowest priority before sorting the chemicals accordingly. Moreover, acids and bases were always separated due to the risk of violent reactions.</p>

  <h4><b>Storage Categories and Complex Cases</b></h4>

  <p>The EPFL storage categories (such as no pictograms or exclamation point, hazardous to the environment, acute toxicity, CMR/STOT, toxicity category 2/3, corrosive category 1, irritant, pyrophoric, flammable, oxidizer) become insufficient when handling chemicals with multiple hazard pictograms, especially if they include conflicting hazards.</p>

  <p>For instance, triethylamine is flammable, corrosive, and acutely toxic, and as a base, it should not be stored with acids or corrosive bases due to its flammability, creating a storage conflict.</p>

  <p>To resolve this, a new storage (custom storage) is created each time a chemical has conflicting pictograms, to make sure that no pictograms incompatibilities are present in the storage group.</p>

</div>

