# The SMolInstruct dataset

* Repository: [LlaSMol](https://github.com/OSU-NLP-Group/LLM4Chem)
* Pre-print: [arXiv:2402.09391](https://arxiv.org/abs/2402.09391)
* Hugging Face Hub: [SMolInstruct Dataset](https://huggingface.co/datasets/osunlp/SMolInstruct)

## Loading the Dataset from Hugging Face Hub

In [None]:
# import the necessary modules
from datasets import load_dataset

In [None]:
# Load the dataset
dataset = load_dataset(path='osunlp/SMolInstruct',
                       trust_remote_code=True,
                       split='train',
                       cache_dir="./tmp")

In [None]:
# dataset structure
dataset

In [None]:
# inspect the data samples
dataset["train"][0]

## Using SELFIES

In [None]:
# using SELFIES
dataset_selfies = load_dataset(path='osunlp/SMolInstruct',
                               trust_remote_code=True,
                               cache_dir="./tmp",
                               use_selfies=True)

# inspecting a sample with SELFIES
dataset_selfies["train"][0]

## Selecting Specific Tasks

In [None]:
# setting the task tuple
all_tasks = (
  'forward_synthesis',
  'retrosynthesis',
  'molecule_captioning',
  'molecule_generation',
  'name_conversion-i2f',
  'name_conversion-i2s',
  'name_conversion-s2f',
  'name_conversion-s2i',
  'property_prediction-esol',
  'property_prediction-lipo',
  'property_prediction-bbbp',
  'property_prediction-clintox',
  'property_prediction-hiv',
  'property_prediction-sider',
)

# selecting specific tasks
task_dataset = load_dataset(path='osunlp/SMolInstruct',
                           trust_remote_code=True,
                           cache_dir="./tmp",
                           tasks=all_tasks[:2])  # example with first two tasks

In [None]:
task_dataset["train"].shuffle()[0]

## Removing Task-Specific Tags

In [None]:
# removing task-specific tags from the inputs
dataset_without_tags = load_dataset(path='osunlp/SMolInstruct',
                                   trust_remote_code=True,
                                   cache_dir="./tmp",
                                   insert_core_tags=False)

# inspecting a sample without task-specific tags
dataset_without_tags["train"][0]