# LlaSMol: Advancing Large Language Models for Chemistry

* Repository: [LlaSMol](https://github.com/OSU-NLP-Group/LLM4Chem)
* Pre-print: [arXiv:2402.09391](https://arxiv.org/abs/2402.09391)
* Hugging Face Hub Collection: [LlaSMol Collection](https://huggingface.co/collections/osunlp/llasmol-6778699822ca8585edcc1c5d)

## Clone the GitHub Repository

In [None]:
!git clone https://github.com/OSU-NLP-Group/LLM4Chem.git

In [None]:
%cd LLM4Chem

In [None]:
# import the necessary modules
from generation import LlaSMolGeneration

# initialize the generator model
generator = LlaSMolGeneration('osunlp/LlaSMol-Mistral-7B')

## Name Conversion

### SMILES to IUPAC

In [None]:
generator.generate('Can you tell me the IUPAC name of <SMILES> c1ccccc1C(=O)O </SMILES> ?')

### SMILES to Molecular Formula

In [None]:
generator.generate('What is the molecular formula of <SMILES> c1ccccc1C(=O)O </SMILES>?')

### IUPAC to SMILES

In [None]:
generator.generate('Could you provide the SMILES for <IUPAC> benzoic acid </IUPAC>?')

### IUPAC to Molecular Formula

In [None]:
generator.generate('Could you please give me the molecular formula for <IUPAC> benzoic acid </IUPAC>?')

## Molecule Description

### Molecule Captioning

In [None]:
generator.generate('Describe this molecule: <SMILES> c1ccccc1C(=O)O </SMILES>')

### Molecule Generation

In [None]:
generator.generate('Give me a molecule based on the provided conditions: The molecule is composed of a benzene ring with a carboxylic acid group attached to it. ')

## Property Prediction

### ESOL

In [None]:
generator.generate('How soluble is <SMILES> c1ccccc1C(=O)O </SMILES> in water?')

### LIPO

In [None]:
generator.generate('Predict the octanol/water distribution coefficient logD under pH 7.4 for <SMILES> c1ccccc1C(=O)O </SMILES> in water.')

### BBBP

In [None]:
generator.generate('Is blood-brain barrier permeability (BBBP) a property of <SMILES> c1ccccc1C(=O)O </SMILES> in water?')

### ClinTox

In [None]:
generator.generate('Is <SMILES> c1ccccc1C(=O)O </SMILES> toxic to humans?')

### HIV

In [None]:
generator.generate('Can <SMILES> c1ccccc1C(=O)O </SMILES> inhibit HIV replication?')

### SIDER

In [None]:
generator.generate('Are there any known side effects of <SMILES> c1ccccc1C(=O)O </SMILES> affecting human heart?')

## Chemical Reaction

### Forward Synthesis

In [None]:
generator.generate('<SMILES> c1ccccc1C(=O)O.CCO</SMILES> Based on the reactants and reagents provided, can you suggest a possible product?')

### Retrosynthesis

In [None]:
generator.generate('Suggest possible reactants that could have been used to synthesize <SMILES> CCOC(=O)C1=CC=CC=C1 </SMILES>.')

## Off-Domain Prediction

### Molecular Formula to SMILES

In [None]:
generator.generate('Could you provide the molecular formula for <SMILES> c1ccccc1C(=O)O </SMILES> ?')

### Molecular Formula to IUPAC

In [None]:
generator.generate('Could you provide the IUPAC name for <MOLFORMULA> C7H6O2 </MOLFORMULA> ?')

## Hugging Face Hub Usage

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from peft import PeftModelForCausalLM
import torch
import os
import sys

In [None]:
# set the Hub path variables to the base and fine-tuned models
base_model = "mistralai/Mistral-7B-v0.1"
model_name = "osunlp/LlaSMol-Mistral-7B"

# fetch the models from Hugging Face Hub
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    token=os.environ.get("HF_TOKEN")
)

# load the fine-tuned model
model = PeftModelForCausalLM.from_pretrained(
    model,
    model_name,
    torch_dtype=torch.bfloat16,
    token=os.environ.get("HF_TOKEN")
)

# fetch the tokenizer from the Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained(base_model)

In [None]:
# setup tokenizer special tokens
tokenizer.padding_side = 'left'
tokenizer.pad_token = '<pad>'
tokenizer.sep_token = '<unk>'
tokenizer.cls_token = '<unk>'
tokenizer.mask_token = '<unk>'

# make sure the model's config to use the tokenizer's special tokens
model.config.pad_token_id = tokenizer.pad_token_id
model.config.bos_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id

# compile the model for performance
model.eval()
if torch.__version__ >= "2" and sys.platform != "win32":
    model = torch.compile(model)

In [None]:
# use utility functions for chat and prompt generation
from utils.chat_generation import generate_chat
from utils.general_prompter import GeneralPrompter, get_chat_content

# sample input text from the user
input_text = "Could you provide the SMILES for <IUPAC> benzoic acid </IUPAC>?"

# create the prompt and the chat between user and the assistant
prompter = GeneralPrompter(get_chat_content)
chat = generate_chat(input_text, output_text=None)
print(chat)

# generate the full prompt for the model generation
full_prompt = prompter.generate_prompt(chat)
print(full_prompt)

In [None]:
# encode the full prompt into input IDs for the model
input_ids = tokenizer.encode(full_prompt, return_tensors="pt", padding=True, add_special_tokens=False)
input_ids

In [None]:
# test the tokenizer by decoding the input IDs
tokenizer.decode(input_ids[0])

In [None]:
generation_settings = {}

generation_config = GenerationConfig(
    pad_token_id=model.config.pad_token_id,
    bos_token_id=model.config.bos_token_id,
    eos_token_id=model.config.eos_token_id,
    **generation_settings,
)

model.eval()
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=1024,
    )
s = generation_output.sequences
output = tokenizer.batch_decode(s, skip_special_tokens=False)

output_text = []
for output_item in output:
    text = prompter.get_response(output_item)
    output_text.append(text)

print(output_text)