In [None]:
# Auto-setup when running on Google Colab
import os
if 'google.colab' in str(get_ipython()) and not os.path.exists('/content/LLM4SD'):
    !git clone -q https://github.com/mamintoosi-papers-codes/LLM4SD.git /content/LLM4SD
    # !pip --quiet install -r /content/LLM4SD/requirements.txt
    %cd LLM4SD
    !curl -fsSL https://ollama.com/install.sh | sh  # نصب Ollama
    !ollama pull falcon:7b  # دانلود مدل
    !ollama run falcon:7b "چرا آسمان آبی است؟"  # اجرای مدل

In [14]:
# !pip install rdkit
# !pip install mordred

In [1]:
# **Step 1:** Generate prior knowledge and data knowledge prompt files
print("Processing Step 1: Generating Prompt files...")
%run create_prompt.py --task synthesize

Processing Step 1: Generating Prompt files...


In [2]:
%run create_prompt.py --task inference

In [1]:
# **Step 2:** Knowledge synthesis for BBBP dataset
print("Processing Step 2: LLM for Scientific Synthesis")
%run synthesize.py --dataset bbbp --subtask "" --model falcon-7b --output_folder "synthesize_model_response"

Processing Step 2: LLM for Scientific Synthesis






Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Extracting bbbp_small dataset prior knowledge prompt ....
Assumed you are an experienced chemist. Please come up with 20 rules that you think are very important to predict if a molecule is blood brain barrier permeable (BBBP). Each rule is either about the structure or property of molecules. Each rule starts with 'Calculate....' and don't explain, be short and within 5 words.

Assumed you are an experienced chemist. Please come up with 20 rules that you think are very important to predict if a molecule is blood brain barrier permeable (BBBP). Each rule is either about the structure or property of molecules. Each rule starts with 'Calculate....' and don't explain, be short and within 5 words.
1. Calculate log P: Logarithm of partition coefficient between blood and brain.
2. Calculate log D: Logarithm of partition coefficient between drug and blood.
3. Calculate log P: Logarithm of partition coefficient between blood and brain.
4. Calculate log P: Logarithm of partition coefficient betwe

In [4]:
# **Step 2:** Knowledge synthesis for BBBP dataset
print("Processing Step 2: LLM for Scientific Synthesis")
%run synthesize_local.py --dataset bbbp --subtask "" --model falcon-7b --output_folder "synthesize_model_response"

Processing Step 2: LLM for Scientific Synthesis
Extracting bbbp_small dataset prior knowledge prompt ....
Querying Ollama Model: falcon-7b
1. Identify : Calculate log P.
2. Assess polar/nonpolar nature : Log P (polar).
3. Estimate bond lengths : Log V (nonpolar).
4. Evaluate aromaticity : Estimate number of electron groups around (aromatic).
5. Assess symmetry : Check for symmetrical about axis (symmetry).
6. Calculate electronegativity : Estimate of chemical bond strength.
7. Estimate electrostatic potential : Calculate electrostatic potential of a molecule.
8. Assess steric hindrance : Estimate steric hindrance of a molecule.
9. Estimate steric hindrance : Log S (non-polar).
10. Assess symmetry : Check for symmetrical about axis (symmetry).
11. Calculate electronegativity : Estimate of chemical bond strength.
12. Assess steric hindrance : Estimate steric hindrance of a molecule.
13. Estimate steric hindrance : Check for symmetry about axis.
14. Calculate electronegativity : Estimate 

In [5]:
# **Step 3:** Knowledge inference for BBBP dataset
print("Processing Step 3: LLM for Scientific Inference")
# %run inference_local.py --dataset bbbp --subtask "" --model falcon-7b --list_num 30 --output_folder "inference_model_response"
%run inference_local.py --dataset bbbp --model falcon-7b --output_folder "inference_model_response"


Processing Step 3: LLM for Scientific Inference
Extracting bbbp dataset inference prompt...
Starting inference using Ollama's local model...
Querying Ollama Model: falcon-7b
1) If the structure of the molecule contains a BBBP pair, it is more likely to be BBBP. 
2) If the first atom in the molecule is a BBBP atom, it is more likely to be BBBP. 
3) If the number of BBBP atoms in the molecule is greater than the number of non-BBBP atoms, it is more likely to be BBBP.
User 
✅ Inference completed! Response saved to: inference_model_response\falcon-7b\bbbp\falcon-7b_inference_response.txt
Inference/Time elapsed: 1.135082721710205 seconds


In [6]:
# **Step 4:** Summarizing inference rules using GPT-4
print("Processing Step 4: Summarizing Rules")
# API_KEY = ""  # Set your OpenAI API Key before running
# %run summarize_rules.py --input_model_folder falcon-7b --dataset bbbp --subtask "" --list_num 30 --api_key $API_KEY --output_folder "summarized_inference_rules"

%run summarize_rules_local.py --input_model_folder falcon-7b --dataset bbbp --output_folder "summarized_inference_rules"


Processing Step 4: Summarizing Rules
🔹 Step 4: Summarizing Inference Rules Locally using Gemma 3:27B...
✅ Loaded 4 inferred rules from inference_model_response\falcon-7b\bbbp\falcon-7b_inference_response.txt
✅ Summarized rules:
## Concise Summary of BBBP Inference Rules:

These rules suggest a molecule is more likely to be BBBP (likely referring to a specific property or classification) based on its atomic composition and structure. The key indicators are:

* **BBBP Pair Presence:** The existence of a BBBP pair within the molecular structure increases likelihood.
* **BBBP Atom Dominance:** A higher proportion of BBBP atoms compared to non-BBBP atoms increases likelihood.
* **Initiating BBBP Atom:** The presence of a BBBP atom as the first atom in the molecule increases likelihood.



**Note:** The term "BBBP" remains undefined without further context. This summary assumes it represents a specific molecular property or classification being predicted. These rules essentially describe com

In [None]:
# **Step 5:** Interpretable model training and evaluation
print("Processing Step 5: Interpretable Model Training and Evaluation")
# %run code_gen_and_eval.py --dataset bbbp --subtask "" --model falcon-7b --knowledge_type "synthesize" --api_key $API_KEY --output_dir "llm4sd_results" --code_gen_folder "llm4sd_code_generation"
%run code_gen_and_eval_local.py --dataset bbbp --subtask "" --model falcon-7b --knowledge_type "synthesize" --output_dir "llm4sd_results" --code_gen_folder "llm4sd_code_generation"

# [20:18:10] Molecule does not have explicit Hs. Consider calling AddHs()

In [None]:
# %run code_gen_and_eval.py --dataset bbbp --subtask "" --model falcon-7b --knowledge_type "inference" --list_num 30 --api_key $API_KEY --output_dir "llm4sd_results" --code_gen_folder "llm4sd_code_generation"
%run code_gen_and_eval_local.py --dataset bbbp --subtask "" --model falcon-7b --knowledge_type "inference" --output_dir "llm4sd_results" --code_gen_folder "llm4sd_code_generation"

# [18:02:28] WARNING: not removing hydrogen atom without neighbors
# Error in function rule345678_num_carbons: module 'rdkit.Chem.Descriptors' has no attribute 'NumCarbonAtoms'
# Attempting to rectify the code...

In [2]:
%run code_gen_and_eval_local.py --dataset bbbp --subtask "" --model falcon-7b --knowledge_type "all" --output_dir "llm4sd_results" --code_gen_folder "llm4sd_code_generation"
print("Processing completed for BBBP dataset.")

SyntaxError: unterminated string literal (detected at line 133) (<string>, line 133)

Processing completed for BBBP dataset.
