<div align="center">

# <span style="font-size:1.1em; font-weight:900;">✨🧪 ChemStorM – Chemical Storage Manager 🧪✨</span>
<div align="center">
<p style="font-size:1.5em; font-weight:500; margin-top:1.2em; margin-bottom:2.2em;">A simple tool to help students analyze and understand the safety of different molecules and their proper storage.</p>

</div>

## 1. Introduction

### Motivations

Chemistry students are often faced with the recurring challenge of gathering critical information about chemical compounds, such as safety data sheets (SDS), hazard pictograms, and specific properties. This process is not only tedious and time-consuming but also essential for ensuring proper handling and safety in both academic and laboratory settings.

We therefore came up with the idea of developing an algorithm that can correctly organize various chemical products according to the guidelines provided by EPFL, in order to automate the process and save users from having to spend a long time searching for the different characteristics of each product. This tool also identifies and displays the different hazard categories associated with each substance.

Our first reference is from the EPFL website ([https://www.epfl.ch/campus/security-safety/en/lab-safety/hazards/chemical-hazards/chemicals-storage/](https://www.epfl.ch/campus/security-safety/en/lab-safety/hazards/chemical-hazards/chemicals-storage/)), which outlines whether certain products can be stored together based on their hazard pictograms. The table provided offers the necessary guidelines to ensure proper and safe storage. The relevant criteria are illustrated in the figure below.

![Illustration](../assets/security_table_english.png)

By looking at this table, we can see that in order to properly store the products, we need to identify their pictograms as well as determine whether they are acids or bases.

The second source, also from EPFL shown below, details the proper storage of a product based on its pictogram. ([EPFL Chemicals Storage Flowchart 2024](https://www.epfl.ch/campus/security-safety/wp-content/uploads/2024/01/Chemicals-Storage-flowchart_2024.pdf))

![Illustration](../assets/Security_flowchart.jpg)

## 2. Functionalities

### Functions Overview

Our package `chemstorm` contains 8 main functions for chemical compound management, safety analysis, and storage organization.

---

#### 1. `get_compound_safety_data`
**Purpose**: Retrieves safety data for a chemical compound from PubChem.

**Input**:
- `name: str` - The molecule's common name

**Output**:
- `Tuple[str, List[str], List[str]]` - Contains:
  - PubChem CID (ID)
  - GHS pictograms
  - Hazard statements

**Description**: Connects to PubChem's REST API to fetch safety information for the specified compound.

---

#### 2. `get_name_and_smiles`
**Purpose**: Retrieves identification information for a chemical compound.

**Input**:
- `cid: str` - PubChem compound ID (CID)

**Output**:
- `Tuple[str, str, str]` - Contains:
  - Record Title (generic name)
  - IUPAC name
  - SMILES string

**Description**: Uses the PubChemPy library to retrieve standardized chemical identifiers.

---

#### 3. `classify_acid_base`
**Purpose**: Determines the acid/base classification of a compound.

**Input**:
- `name: str` - The generic name of the compound
- `iupac_name: str` - The IUPAC name of the compound
- `smiles: str` - SMILES string representing the compound's chemical structure
- `ghs_statements: List[str]` - List of GHS hazard statements related to the compound

**Output**:
- `Union[str, Tuple[str, ...]]` - Classification such as "acid", "base", "neutral", or "amphoteric"

**Description**: Analyzes compound properties to determine its acid/base behavior for compatibility assessment.

---

#### 4. `get_mp_bp`
**Purpose**: Retrieves melting and boiling points for a compound.

**Input**:
- `name: str` - The compound's name

**Output**:
- `Tuple[Optional[float], Optional[float], Optional[float], Optional[float]]` - Contains:
  - Melting point (°C)
  - Boiling point (°C)
  - Melting point (°F)
  - Boiling point (°F)

**Description**: Extracts temperature data from PubChem, returning `None` for any values that cannot be found.

---

#### 5. `compound_state`
**Purpose**: Predicts the physical state of a compound at room temperature.

**Input**:
- `mp_c: Optional[float]` - Melting point in Celsius
- `bp_c: Optional[float]` - Boiling point in Celsius
- `mp_f: Optional[float]` - Melting point in Fahrenheit
- `bp_f: Optional[float]` - Boiling point in Fahrenheit

**Output**:
- `str` - Physical state ("solid", "liquid", "gas", or "unknown")

**Description**: Determines compound state at room temperature (20°C / 68°F) based on melting and boiling points.

---

#### 6. `prioritize_pictograms`
**Purpose**: Sorts GHS pictograms by hazard severity.

**Input**:
- `pictograms: List[str]` - List of GHS pictogram names

**Output**:
- `List[str]` - Sorted list of pictograms

**Description**: Orders pictograms according to predefined hazard severity priority (lower numbers = more severe hazards).

---

#### 7. `is_chemically_compatible`
**Purpose**: Determines whether two chemicals can be safely stored together.

**Input**:
- `existing_pictograms: List[str]` - GHS pictograms for the existing chemical
- `new_pictograms: List[str]` - GHS pictograms for the new chemical
- `existing_acid_base_class: str` - Acid/base classification of the existing chemical
- `new_acid_base_class: str` - Acid/base classification of the new chemical
- `existing_state: str` - Physical state of the existing chemical
- `new_state: str` - Physical state of the new chemical
- `group_name: str` - Storage group used for applying group-specific compatibility rules

**Output**:
- `bool` - `True` if compatible, `False` otherwise

**Description**: Evaluates compatibility based on pictograms, acid/base classifications, physical states, and storage group.

---

#### 8. `chemsort_multiple_order_3`
**Purpose**: Sorts compounds into compatible storage groups.

**Input**:
- `compounds: List[Dict[str, Any]]` - List of compounds to sort, each represented as a dictionary with keys:
  - `name`
  - `sorted_pictograms`
  - `hazard_statements`
  - `acid_base_class`
  - `state_room_temp`
- `storage_groups: Dict[str, Dict[str, List[Dict[str, Any]]]]` - Existing storage groups dictionary

**Output**:
- `Dict[str, Dict[str, List[Dict[str, Any]]]]` - Updated dictionary of storage groups with sorted compounds

**Description**: Processes compounds by assigning them to predefined groups if compatible, or creates new custom groups as needed. Each storage group contains sub-categories for solids, liquids, and gases.

### Streamlit Web Interface

Results are shown in a Streamlit Web App, as it is easier to understand where compounds are stored, with a display like this.

## 3. Limitations

### PubChem Compounds Registration

When entering the name of a compound on PubChem, it will return the compound itself and not a solution for example, which is a problem as a lot of chemicals in the labs are in the form of already prepared solutions and the pictograms might differ from the pure compound and a solution of that same compound, especially if the concentration is low. For example, hydrochloric acid will not return a solution of hydrochloric acid but the gas itself. This could have been improved by instead searching the molecule's name on another website like Sigma Aldrich or Fischer Scientific or another site that offers chemical products, and then searching the pictograms or additional information on there. This proved to be quite tedious as it was a lot of information at specific places on the website and the method with PubChem REST API had already been finished and worked well already for a lot of compounds, so it was not replaced, but this limitation has to be kept in mind when using our `chemsort` package.

### Specific Incompatibilities

In the EPFL guidelines for the storage of harzardous chemicals, it is said to check sections 7 and 10 in the SDS (Safety Data Sheet) of a compound, as those include the specific storage indications, like in what material the storages should be made or with which chemicals not to store the compound of interest. A code was made for fetching the SDS pdf of a compounds on the Fisher Scientific website, which specifically has a SDS searching page and so made it possible to retrieve sections 7 and 10 on the pdf, as its format was always the same.
Again, because of the PubChem registration of compounds, the pdf retrieved may not have been the correct one associated with the compounds, and the pdf was found with a link, which was the first link found for a compound's name, which might have also not corresponded exactly to that compounds. This could have been improved by also searching for the specific pdf of the chemical, and maybe on multiple websites and not just Fisher Scientific, but as the format of the sds pdf changes, a simple function alone would be insufficient to retrieve specific informations. Another improvement would have been to use Machine learning, but we were time-limited for this and the pictograms, acids and bases and hazards statements proved sufficient in most cases to sort hazardous chemicals.

## 4. Problems and Challenges

### No Direct Database Access

Pictograms and safety hazard statements were not available in any existing database. Although it was possible to download databases for certain categories, such as "corrosive" compounds from PubChem, the corresponding hazard statements were missing or difficult to extract.

By inspecting the HTML code of PubChem’s website, it seemed feasible to extract safety pictogram names and hazard statements using the `BeautifulSoup` package. However, this approach failed because the pictogram images are loaded dynamically via JavaScript on the compound pages, and therefore do not appear in the static HTML that BeautifulSoup parses.

To overcome this, the `selenium` package ([https://pypi.org/project/selenium/](https://pypi.org/project/selenium/)) was tested, as it can control a web browser (e.g., Google Chrome) and scrape dynamic JavaScript-loaded content. Although Selenium worked well for retrieving information for a single compound, it proved too slow when processing multiple chemicals because it required fully loading each PubChem page in a browser, taking several minutes per compound—an unacceptable delay for large datasets.

Ultimately, the PubChem PUG-View REST API ([https://pubchem.ncbi.nlm.nih.gov/docs/pug-view](https://pubchem.ncbi.nlm.nih.gov/docs/pug-view)) was used instead. Initially, it was believed that this API did not contain the needed data, but after thorough analysis of its structure, the locations of the pictograms and hazard statements were successfully identified. This method was significantly faster, even when tested on a dozen compounds, and thus was kept.

Additionally, the `pubchempy` package ([https://pubchempy.readthedocs.io/en/latest/guide/introduction.html](https://pubchempy.readthedocs.io/en/latest/guide/introduction.html)) was used to retrieve each compound’s generic name, IUPAC name, and SMILES notation.

### Storage Categories and Complex Cases

The sorting of hazardous chemicals followed EPFL’s safety directives ([EPFL Chemicals Storage Flowchart 2024](https://www.epfl.ch/campus/security-safety/wp-content/uploads/2024/01/Chemicals-Storage-flowchart_2024.pdf)), which specify how to store chemicals based on their safety pictograms and hazard statements.

Incompatibilities between pictograms were considered according to a referenced diagram, which also recommends separating liquids from solids and storing explosive compounds or compressed gases separately, sometimes in isolation from other chemicals of the same class (e.g., oxygen storage).

The pictograms have a defined priority order; this was incorporated in the code by sorting each compound’s pictograms from highest to lowest priority before sorting the chemicals accordingly. Moreover, acids and bases were always separated due to the risk of violent reactions.

The EPFL storage categories (such as no pictograms or exclamation point, hazardous to the environment, acute toxicity, CMR/STOT, toxicity category 2/3, corrosive category 1, irritant, pyrophoric, flammable, oxidizer) become insufficient when handling chemicals with multiple hazard pictograms, especially if they include conflicting hazards.

For instance, triethylamine is flammable, corrosive, and acutely toxic, and as a base, it should not be stored with acids or corrosive bases due to its flammability, creating a storage conflict.

To resolve this, a new storage (custom storage) is created each time a chemical has conflicting pictograms, to make sure that no pictograms incompatibilities are present in the storage group.