<div align="center">

# <span style="font-size:1.1em; font-weight:900;"> ChemStorM </span>

<p style="font-size:1.5em; font-weight:500; margin-top:1.2em; margin-bottom:2.2em; text-align:center;">Chemical Storage Manager - storage of hazardous substances, automated and secure.</p>
</div>

## **I. Introduction**

### 1.1 Motivation

<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
As chemistry students, we are constantly responsible for understanding the properties and hazards of the compounds we use, particularly when it comes to storing them back safely. However, determining storage compatibility requires research and analysis that is both tedious and time-consuming, while also demanding careful focus and attention to detail. This task is essential for maintaining safe laboratory conditions, but when performed manually, it often relies on subjective judgments, and sometimes incomplete information. All of us encountered this challenge repeatedly during our first year at EPFL. In our eyes, it was a process ideal for automation.
</div>

<br>

<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
We therefore developed an algorithm that could correctly organize various chemical products into storage compartments, based on their respective characteristics and the EPFL storage guidelines. Our aim was to save users from spending a long time searching and cross-checking the different characteristics of each product. To facilitate this, the tool also identifies and displays the different hazard categories associated with each substance.
</div>

### 1.2 Chemical framework

<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
To ensure correct results, our algorithm needed to be based on reliable references. For this, two inputs were necessary: the characteristics of the desired products and the rules by which to sort them. The first were extracted directly from Pubchem REST view apo, with complementary information from pubchempy. The second relied on guidelines provided by EPFL. Indeed, in the "Chemical Storage - Safety, Prevention and Health" section on the EPFL website, the OHS states several rules for safe storage:

<br>

<br>

>- Separate liquid from solid chemicals.
>- Organize your storage according to the GHS pictograms: check the Hazardous chemicals storage workflow (see Useful documents below).
>- Respect the incompatibilities: check SDS section 7 and 10 and compare with the chemical incompatibilities table below.

We decided to use these as a basis for our tool's decision-making.
</div>
<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
The first reference provided by the website consists of an incompatibility table, depicted below. It shows two main criteria for the separation of compounds: their GHS pictograms and their acid/base class. 

(<a href="https://www.epfl.ch/campus/security-safety/en/lab-safety/hazards/chemical-hazards/chemicals-storage/">https://www.epfl.ch/campus/security-safety/en/lab-safety/hazards/chemical-hazards/chemicals-storage/</a>)
</div>
<br><br>

![Illustration](../assets/security_table_english.png)
<br>
<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
The second source, also from EPFL, uses the compound's pictogram and hazard statements to assign it to its proper storage group. These storage groups will be used as a default set of compartments on which the compounds can iterate and with which they can check their compatibility. (<a href="https://www.epfl.ch/campus/security-safety/wp-content/uploads/2024/01/Chemicals-Storage-flowchart_2024.pdf">EPFL Chemicals Storage Flowchart 2024</a>)
</div>
<br><br>

![Illustration](../assets/Security_flowchart.jpg)


## **II. Our Code**

### 2.1 Functions overview

<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
Our package `chemstorm` contains 9 main functions for chemical compound management, safety analysis, and storage organization.
</div>

---

#### 1. `get_compound_safety_data`
**Purpose**: Retrieves safety data for a chemical compound from PubChem.

**Input**:
- `name: str` - The molecule's common name

**Output**:
- `Tuple[str, List[str], List[str]]` - Contains:
  - PubChem CID (ID)
  - GHS pictograms
  - Hazard statements

**Description**: Connects to PubChem's REST API to fetch safety information for the specified compound.

---

#### 2. `get_name_and_smiles`
**Purpose**: Retrieves identification information for a chemical compound.

**Input**:
- `cid: str` - PubChem compound ID (CID)

**Output**:
- `Tuple[str, str, str]` - Contains:
  - Record Title (generic name)
  - IUPAC name
  - SMILES string

**Description**: Uses the PubChemPy library to retrieve standardized chemical identifiers.

---

#### 3. `classify_acid_base`
**Purpose**: Determines the acid/base classification of a compound.

**Input**:
- `name: str` - The generic name of the compound
- `iupac_name: str` - The IUPAC name of the compound
- `smiles: str` - SMILES string representing the compound's chemical structure
- `ghs_statements: List[str]` - List of GHS hazard statements related to the compound

**Output**:
- `Union[str, Tuple[str, ...]]` - Classification such as "acid", "base", "neutral", or "amphoteric"

**Description**: Analyzes compound properties to determine its acid/base behavior for compatibility assessment.

---

#### 4. `get_mp_bp`
**Purpose**: Retrieves melting and boiling points for a compound.

**Input**:
- `name: str` - The compound's name

**Output**:
- `Tuple[Optional[float], Optional[float], Optional[float], Optional[float]]` - Contains:
  - Melting point (°C)
  - Boiling point (°C)
  - Melting point (°F)
  - Boiling point (°F)

**Description**: Extracts temperature data from PubChem, returning `None` for any values that cannot be found.

---

#### 5. `compound_state`
**Purpose**: Predicts the physical state of a compound at room temperature.

**Input**:
- `mp_c: Optional[float]` - Melting point in Celsius
- `bp_c: Optional[float]` - Boiling point in Celsius
- `mp_f: Optional[float]` - Melting point in Fahrenheit
- `bp_f: Optional[float]` - Boiling point in Fahrenheit

**Output**:
- `str` - Physical state ("solid", "liquid", "gas", or "unknown")

**Description**: Determines compound state at room temperature (20°C / 68°F) based on melting and boiling points.

---

#### 6. `prioritize_pictograms`
**Purpose**: Sorts GHS pictograms by hazard severity.

**Input**:
- `pictograms: List[str]` - List of GHS pictogram names

**Output**:
- `List[str]` - Sorted list of pictograms

**Description**: Orders pictograms according to predefined hazard severity priority (lower numbers = more severe hazards).

---

#### 7. `is_chemically_compatible`
**Purpose**: Determines whether two chemicals can be safely stored together.

**Input**:
- `existing_pictograms: List[str]` - GHS pictograms for the existing chemical
- `new_pictograms: List[str]` - GHS pictograms for the new chemical
- `existing_acid_base_class: str` - Acid/base classification of the existing chemical
- `new_acid_base_class: str` - Acid/base classification of the new chemical
- `existing_state: str` - Physical state of the existing chemical
- `new_state: str` - Physical state of the new chemical
- `group_name: str` - Storage group used for applying group-specific compatibility rules

**Output**:
- `bool` - `True` if compatible, `False` otherwise

**Description**: Evaluates compatibility based on pictograms, acid/base classifications, physical states, and storage group.

---
#### 8. Three functions to initialize storage groups

#### a. `default_group()`

**Purpose**: To create a consistent structure for storing chemicals based on their physical state within a general storage group.

**Input**: None

**Output**:
- `Dict[str, List]` - dictionary with keys for physical states (`solid`, `liquid`, `unknown`, `gas`)

**Description**: Returns a dictionary with keys representing physical states ("solid", "liquid", "unknown", "gas"). Each key maps to an empty list. This serves as a template for organizing chemicals by their physical state within a storage group.
#### b. `default_group_gas()`

**Purpose**:To emphasize gas-phase substances in storage groups where gas categorization is prioritized, such as for compressed gases.

**Input**: None

**Output**: 
- `Dict[str, List]` - dictionary identical in structure to default_group(), but with the `gas` key listed first

**Description**:
Similar to default_group(), but the order of keys starts with "gas" instead of "solid". Functionally equivalent, but intended to emphasize gases in certain categories like "compressed_gas".
#### c. `initialize_storage_groups()`

**Purpose**: To define and initialize all chemical compatibility and hazard-based storage categories used in a chemical management system, allowing for safe sorting and retrieval of chemical data.

**Input**: None

**Output**: 
- `Dict[str, Dict[str, List[Dict[str, Any]]]]` - dictionary of predefined chemical storage categories

**Description**:
Initializes a dictionary of predefined chemical storage categories. Each category is mapped to a default group (organized by physical state), allowing for structured classification and storage of chemicals based on their properties and hazards.

---
#### 9. `chemsort_multiple_order`
**Purpose**: Sorts compounds into compatible storage groups.

**Input**:
- `compounds: List[Dict[str, Any]]` - List of compounds to sort, each represented as a dictionary with keys:
  - `name`
  - `sorted_pictograms`
  - `hazard_statements`
  - `acid_base_class`
  - `state_room_temp`
- `storage_groups: Dict[str, Dict[str, List[Dict[str, Any]]]]` - Existing storage groups dictionary

**Output**:
- `Dict[str, Dict[str, List[Dict[str, Any]]]]` - Updated dictionary of storage groups with sorted compounds

**Description**: Processes compounds by assigning them to predefined groups if compatible, or creates new custom groups as needed. Each storage group contains sub-categories for solids, liquids, and gases.

### 2.2 Streamlit Web Interface

<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
Results are shown in a Streamlit Web App, as it is easier to understand where compounds are stored. The interface is shown below.
<br><br>

![Illustration](../assets/chemstorm_app_start.png)

To analyze a single molecule, enter its name or SMILES notation in the search bar and click the <code>Store Compounds</code> button. For multiple molecules, press Enter after each entry to add multiple compounds to the list. Once all target compounds have been added, click <code>Store Compounds</code> to proceed with storage and analysis. The results are shown in the screenshot below. All products are sorted into storage categories based on all the criteria presented earlier.
<br><br>

![Illustration](../assets/chemstorm_app_storage.png)

Clicking the <code>Details</code> button for this specific compound will display its information in the <code>Compound Details</code>  tab, where various data and hazard warnings are shown.
<br><br>

![Illustration](../assets/chemstorm_app_details.png)
</div>


## **III. Limitations of the tool**

### Limited to pure compounds

<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
When entering the name of a compound on PubChem, it returns information for the compound in its pure state. This can be a problem because a lot of chemicals in the lab are in the form of already prepared solutions, and pictograms may differ between the compound in its pure and solvent form, especially at low concentration. For example, hydrochloric acid will not return a solution of hydrochloric acid, but returns the gas itself, chlorane. This could have been improved by instead searching the molecule on other websites like Sigma Aldrich or Fischer Scientific, and checking for specifications on the pictograms or additional information there. This proved to be quite tedious as it was a lot of information at specific places on the website, and the method with PubChem REST API had already been finished and worked well for a lot of compounds. Hence it was not replaced, but this limitation has to be kept in mind when using our <code>chemsort</code> package.
</div>


### Omits SDS information

<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
In the EPFL guidelines for the storage of harzardous chemicals, it is said to check sections 7 and 10 in the Safety Data Sheet (SDS) of a compound. Indeed, these may include specific storage indications, like what material the storage compartments should be made of or with which chemicals not to store the compound of interest. A code was made for fetching the SDS pdf of a compound on the Fisher Scientific website, which specifically has a SDS searching page that allowed to retrieve sections 7 and 10 from the pdf. The file allowed to extract the information and juxtapose the various statements (hazardous reactions, hazardous side product…) for two inputted compounds. However, because this information differed for each product, several complications (which will be further elaborated in section 4) were met which made it impossible to reliably integrate the function into the final code. A way of resolving this would be to implement Machine Learning, where the algorithm would be able to develop a judgment of statement relevance and priority on a large amount of SDS sheets. <br><br>
</div> 

## **IV. Issues faced**

<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
Three main issues were encountered during the coding process: finding a database that was both complete and efficient, storing special cases of compounds with overlapping pictograms, and integrating SDS information into the final code.<br>

### Finding an ideal database

<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
Firstly, it was noted that pictograms and safety hazard statements were not available on any existing database. Although it was possible to download databases for certain categories, such as "corrosive" compounds from PubChem, the corresponding hazard statements were missing or difficult to extract. By inspecting the HTML code of PubChem’s website, it seemed feasible to extract safety pictogram names and hazard statements using the <code>BeautifulSoup</code> package. However, this approach failed because the pictogram images are loaded dynamically via JavaScript on the compound pages, and therefore do not appear in the static HTML that BeautifulSoup parses.
<br><br>
To overcome this, the <code>selenium</code> package (<a href="https://pypi.org/project/selenium/">https://pypi.org/project/selenium/</a>) was tested, as it can control a web browser (e.g., Google Chrome) and scrape dynamic JavaScript-loaded content. Although Selenium worked well for retrieving information for a single compound, it proved too slow when processing multiple chemicals because it required fully loading each PubChem page in a browser, taking several minutes per compound—an unacceptable delay for large datasets.
<br><br>
Ultimately, the PubChem PUG-View REST API (<a href="https://pubchem.ncbi.nlm.nih.gov/docs/pug-view">https://pubchem.ncbi.nlm.nih.gov/docs/pug-view</a>) was used instead. Initially, it was believed that this API did not contain the needed data, but after thorough analysis of its structure, the locations of the pictograms and hazard statements were successfully identified. This method was significantly faster, even when tested on a dozen compounds, and thus was kept. Additionally, the <code>pubchempy</code> package (<a href="https://pubchempy.readthedocs.io/en/latest/guide/introduction.html">https://pubchempy.readthedocs.io/en/latest/guide/introduction.html</a>) was used to retrieve each compound’s generic name, IUPAC name, and SMILES notation.
</div>

### Treating complex cases

<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
Secondly, as we advanced in the testing of our algorithm, we noticed some cases were placed incorrectly. These consisted mainly in compounds with a larger number of pictograms, which was initially overlooked by the code.
<br><br>
Indeed, the pictograms are treated in a defined priority order; this was incorporated in the code by sorting each compound’s pictograms from highest to lowest priority before sorting the chemicals accordingly (function 6). Moreover, acids and bases were always separated due to the risk of violent reactions. But these EPFL storage categories (such as no pictograms, or exclamation point, hazardous to the environment, acute toxicity, CMR/STOT, toxicity category 2/3, corrosive category 1, irritant, pyrophoric, flammable, oxidizer) become insufficient when handling chemicals with multiple hazard pictograms, especially if they include conflicting hazards. For instance, triethylamine is flammable, corrosive, and acutely toxic, and as a base, it should not be stored with acids or corrosive bases due to its flammability, creating a storage conflict.
<br><br>
To resolve this, a new storage (custom storage) is created each time a chemical has conflicting pictograms, to make sure that no pictograms incompatibilities are present in the storage group.
</div>

### Integrating SDS information
<div style="text-align: justify; text-justify: inter-word;width: 95%; margin-left: 0%; margin-right: 0; font-size: 15px; line-height: 1.6;">
Lastly, it was found complicated to process and integrate information from the SDS sheets. Since the contained information did not follow a predefined format (like pictograms and hazard statements do) and differed for each product, it was necessary to filter each SDS individually. Initially a code was written to attempt this: while it was able to filter redundant information, eliminate errors contained within the online file and organise the remaining information into subsections, it failed to deduce clear conclusions from the comparison of this information between compounds. As mentioned, the algorithm could be developed with Machine Learning to be trained to recognise more nuanced and variable incompatibilities.
</div>