# Documentation for Plant Compound Scraper

## Overview
This Python script automates the extraction of natural products from the **COCONUT** database and retrieves molecular properties from **PubChem**. The program uses **Selenium** for web scraping and **requests** for API calls. Extracted data is saved to a CSV file.

---

## Dependencies
Ensure you have the following libraries installed before running the script:
```bash
pip install selenium webdriver-manager pandas tqdm requests
```

---

## Features
- Scrapes **COCONUT** for natural products related to user-provided plant names.
- Fetches molecular properties from **PubChem**, including:
  - Molecular Formula
  - Molecular Weight
  - IUPAC Name
  - Canonical SMILES
  - Compound CID
- Uses **ThreadPoolExecutor** for efficient parallel processing.
- Saves extracted data into a CSV file for easy access.

---

## Best Practices to Ensure Optimal Results
1. **Ensure Correct Spelling of Plant Names**: Incorrect spellings can lead to no results or incorrect data.
2. **Run in a Stable Network Environment**: The script relies on API calls and web scraping, which need stable internet connectivity.
3. **Use a Headless Browser for Performance**: The script is optimized for Google Colab with `--headless` mode enabled.
4. **Avoid Overloading Servers**: Excessive requests may lead to temporary bans. Keep `time.sleep(2)` to prevent this.
5. **Verify Data Integrity**: Check the CSV output for missing or incorrect values before proceeding with analysis.
6. **Keep ChromeDriver Updated**: Use `webdriver-manager` to always fetch the latest ChromeDriver version.
7. **Monitor Errors and Logs**: The script prints error messages for debugging in case of failures.

---

## Script Breakdown
### 1. `setup_driver()`
Initializes the **Selenium WebDriver** for scraping data from **COCONUT**.

### 2. `scrape_coconut_database(plant_name: str) -> List[str]`
Searches for compounds related to the input plant in the **COCONUT** database and extracts their names.

### 3. `fetch_pubchem_properties(compound_name: str) -> Tuple[str, str, str, str]`
Queries **PubChem** for molecular properties using the **PUG REST API**.

### 4. `fetch_pubchem_cid(compound_name: str) -> str`
Retrieves the **Compound ID (CID)** for a given compound from **PubChem**.

### 5. `save_results_to_csv(data: Dict[str, List[str]], filename: str)`
Saves the extracted data into a **CSV file**.

### 6. `process_compound(plant: str, compound: str, extracted_data: Dict[str, List[str]])`
Processes each compound in parallel using **ThreadPoolExecutor**.

### 7. `main()`
Handles user input, initiates scraping, and saves data.

---

## Running the Script
Run the script in a Python environment:
```bash
python script.py
```
Follow the prompt and enter plant names as comma-separated values.

---

## Output
A **CSV file (`natural_products.csv`)** containing extracted data with the following columns:
- **Plant**: Name of the input plant.
- **Compound**: Extracted compound name.
- **CID**: PubChem Compound ID.
- **MolecularFormula**: Chemical formula.
- **MolecularWeight**: Molecular weight.
- **IUPACName**: Standard chemical name.
- **CanonicalSMILES**: Machine-readable molecular structure.

---

## Error Handling
- If a compound is **not found** on PubChem, `N/A` is recorded.
- **Exceptions** during scraping or API calls are logged with appropriate messages.
- The script **skips** plants with no identified compounds.

---

## Conclusion
This script is an efficient way to gather phytochemical data for research. Follow the best practices to maximize accuracy and performance.



#Instruction
NCBI’s PubChem API has the following usage limits (as per their guidelines):
Unauthenticated Requests:
Maximum 5 requests per second.

Maximum 10,000 requests per day.

Authenticated Requests (with API key):
Maximum 10 requests per second.

Maximum 100,000 requests per day.

Concurrent Connections: Limited to 5 simultaneous connections per IP.

Make sure you don't cross the limitation otherwise your ip will get blocked and you will not get any result

#Plants/ Herbs Names from the assignments

Aloe vera, Phyllanthus emblica, Murraya koenigii, Cinnamomum camphora, Cocos nucifera, Eclipta prostrata, Hibiscus rosa-sinensis, Lawsonia inermis, Azadirachta indica, Trigonella foenum-graecum, Salvia officinalis, Achyranthes aspera, Allium cepa, Vitis vinifera, Nardostachys jatamansi, Rosmarinus officinalis, Thymus vulgaris, Ocimum tenuiflorum, Allium sativum, Serenoa repens, Panax ginseng, Urtica dioica, Ricinus communis, Simmondsia chinensis, Arnica montana, Capsicum annuum, Nigella sativa, Acacia concinna, Moringa oleifera, Terminalia bellirica, Withania somnifera, Polygonum multiflorum, Angelica sinensis, Lycium barbarum, Ganoderma lucidum, Schisandra chinensis, Carthamus tinctorius, Cinnamomum verum, Angelica archangelica, Bacopa monnieri, Brassica juncea, Arctium lappa, Catharanthus roseus, Capsicum frutescens, Curcuma longa, Cymbopogon citratus, Cyperus rotundus, Equisetum arvense, Linum usitatissimum, Zingiber officinale, Glycyrrhiza glabra, Lavandula angustifolia, Citrus limon, Melaleuca alternifolia, Mentha piperita, Piper nigrum, Cucurbita pepo, Oryza sativa, Santalum album, Sapindus mukorossi, Scutellaria baicalensis, Sesamum indicum, Syzygium aromaticum, Thuja occidentalis

#More Plants

Aloe vera, Phyllanthus emblica, Murraya koenigii, Cinnamomum camphora, Cocos nucifera, Eclipta prostrata, Hibiscus rosa-sinensis, Lawsonia inermis, Azadirachta indica, Trigonella foenum-graecum, Salvia officinalis, Achyranthes aspera, Allium cepa, Vitis vinifera, Nardostachys jatamansi, Rosmarinus officinalis, Thymus vulgaris, Ocimum tenuiflorum, Allium sativum, Serenoa repens, Panax ginseng, Urtica dioica, Ricinus communis, Simmondsia chinensis, Arnica montana, Capsicum annuum, Nigella sativa, Acacia concinna, Moringa oleifera, Terminalia bellirica, Withania somnifera, Polygonum multiflorum, Angelica sinensis, Lycium barbarum, Ganoderma lucidum, Schisandra chinensis, Carthamus tinctorius, Cinnamomum verum, Angelica archangelica, Bacopa monnieri, Brassica juncea, Arctium lappa, Catharanthus roseus, Capsicum frutescens, Curcuma longa, Cymbopogon citratus, Cyperus rotundus, Equisetum arvense, Linum usitatissimum, Zingiber officinale, Glycyrrhiza glabra, Lavandula angustifolia, Citrus limon, Melaleuca alternifolia, Mentha piperita, Piper nigrum, Cucurbita pepo, Oryza sativa, Santalum album, Sapindus mukorossi, Scutellaria baicalensis, Sesamum indicum, Syzygium aromaticum, Thuja occidentalis, Achyranthes aspera, Bacopa monnieri, Ginkgo biloba, Camellia sinensis, Fucus vesiculosus, Polygonum cuspidatum, Foeniculum vulgare, Astragalus membranaceus, Apium graveolens, Celosia argentea, Angelica dahurica, Rheum palmatum, Paeonia lactiflora, Perilla frutescens, Prunus persica, Prunus mume.

#Some More plants from Patanjali Keshkanti Oil Composition

##Common Name

Aloe Vera,Indian Gooseberry (Amla),Curry Leaves,Camphor,Coconut Oil,Bhringraj,Hibiscus,Henna,Neem,Fenugreek,Sage,Prickly Chaff Flower,Onion,Grape Seed Extract,Spikenard,Rosemary,Thyme,Holy Basil (Tulsi),Garlic,Saw Palmetto,Ginseng,Stinging Nettle,Castor Oil,Jojoba Oil,Arnica,Cayenne Pepper,Black Seed,Shikakai,Moringa,Baheda,Ashwagandha,Fo-Ti,Dong Quai,Goji Berry,Reishi Mushroom,Schisandra,Safflower,True Cinnamon,Angelica Root,Brahmi,Mustard,Burdock Root,Madagascar Periwinkle,Chili Pepper,Turmeric,Lemongrass,Nutgrass,Horsetail,Flaxseed,Ginger,Licorice Root,Lavender,Lemon,Tea Tree Oil,Peppermint,Black Pepper,Pumpkin Seed,Rice Extract,Sandalwood,Soapnut,Baikal Skullcap,Sesame,Clove,White Cedar,Ginkgo,Green Tea,Bladderwrack,Japanese Knotweed,Fennel,Astragalus,Celery,Cockscomb,Dahurian Angelica,Rhubarb,Chinese Peony,Perilla,Peach,Japanese Apricot,Bhringraj,Curry,Kalonji,Nimbu (Lemon),Coconut,Gurhal (Hibiscus),Ghrit Kumari (Aloe Vera),Vijaya (Cannabis),Manjistha,Amla (Indian Gooseberry),Malkangani,Vat Jata (Banyan Tree),Methi (Fenugreek),Henna,Shikakai,Brahmi,Castor,Rose hip,Neem,Mandookparni (Gotu Kola),Ratanjot,Charila (Stone Flower),Reetha (Soapnut),Jatamansi (Spikenard),Laal Chandan (Red Sandalwood),Devdaru (Himalayan Cedar),Cajuput

##Scientific Name

Aloe vera,Phyllanthus emblica,Murraya koenigii,Cinnamomum camphora,Cocos nucifera,Eclipta prostrata,Hibiscus rosa-sinensis,Lawsonia inermis,Azadirachta indica,Trigonella foenum-graecum,Salvia officinalis,Achyranthes aspera,Allium cepa,Vitis vinifera,Nardostachys jatamansi,Rosmarinus officinalis,Thymus vulgaris,Ocimum tenuiflorum,Allium sativum,Serenoa repens,Panax ginseng,Urtica dioica,Ricinus communis,Simmondsia chinensis,Arnica montana,Capsicum annuum,Nigella sativa,Acacia concinna,Moringa oleifera,Terminalia bellirica,Withania somnifera,Polygonum multiflorum,Angelica sinensis,Lycium barbarum,Ganoderma lucidum,Schisandra chinensis,Carthamus tinctorius,Cinnamomum verum,Angelica archangelica,Bacopa monnieri,Brassica juncea,Arctium lappa,Catharanthus roseus,Capsicum frutescens,Curcuma longa,Cymbopogon citratus,Cyperus rotundus,Equisetum arvense,Linum usitatissimum,Zingiber officinale,Glycyrrhiza glabra,Lavandula angustifolia,Citrus limon,Melaleuca alternifolia,Mentha piperita,Piper nigrum,Cucurbita pepo,Oryza sativa,Santalum album,Sapindus mukorossi,Scutellaria baicalensis,Sesamum indicum,Syzygium aromaticum,Thuja occidentalis,Ginkgo biloba,Camellia sinensis,Fucus vesiculosus,Polygonum cuspidatum,Foeniculum vulgare,Astragalus membranaceus,Apium graveolens,Celosia argentea,Angelica dahurica,Rheum palmatum,Paeonia lactiflora,Perilla frutescens,Prunus persica,Prunus mume,Eclipta alba,Murraya koenigii,Nigella sativa,Citrus limon,Cocos nucifera,Hibiscus rosa-sinensis,Aloe barbadensis,Cannabis sativa,Rubia cordifolia,Emblica officinalis,Celastrus paniculatus,Ficus benghalensis,Trigonella foenum-graecum,Lawsonia inermis,Acacia concinna,Bacopa monnieri,Ricinus communis,Rosa moschata,Azadirachta indica,Centella asiatica,Geranium wallichianum,Parmelia perlata,Sapindus trifoliatus,Nardostachys jatamansi,Pterocarpus santalinus,Cedrus deodara,Melaleuca leucadendron

In [None]:
#Required Libraries
# !apt-get update
# !apt-get install -y chromium-chromedriver
!pip install selenium webdriver-manager

Collecting selenium
  Downloading selenium-4.29.0-py3-none-any.whl.metadata (7.1 kB)
Collecting webdriver-manager
  Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.29.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting python-dotenv (from webdriver-manager)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading selenium-4.29.0-py3-none-any.whl (9.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading webdriver_manager-4.0.2-py2.py3-none-any.

In [None]:
import os
import time
import pandas as pd
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from typing import List, Dict, Tuple
from google.colab import files  # Google Colab file upload

# 🚀 Setup Selenium WebDriver for headless browsing
def setup_driver():
    """🌐 Setup Selenium WebDriver for scraping."""
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--window-size=1920,1080")

    service = Service()
    return webdriver.Chrome(service=service, options=chrome_options)

# 🔎 Scrape compound names from the COCONUT database
def scrape_coconut_database(plant_name: str) -> List[str]:
    """Scrapes compound names from the COCONUT database."""
    search_url = f"https://coconut.naturalproducts.net/search?q={plant_name.replace(' ', '+')}"
    print(f"\n🔍 Searching compounds for {plant_name} in the 🥥 COCONUT Database...")

    driver = setup_driver()
    driver.get(search_url)
    time.sleep(2)  # Allow page to load

    try:
        compounds = [elem.text.strip() for elem in driver.find_elements(By.XPATH, "//h3[contains(@class, 'text-gray-900')]") if elem.text.strip()]
    except Exception as e:
        print(f"❌ Error extracting compounds for {plant_name}: {e}")
        compounds = []

    driver.quit()
    print(f"✅ Found {len(compounds)} compounds: {compounds}" if compounds else f"⚠️ No compounds found for {plant_name}.")
    return compounds

# ⚛️ Fetch molecular properties from PubChem
def fetch_pubchem_properties(compound_name: str) -> Tuple[str, str, str, str]:
    """Retrieves molecular properties from PubChem."""
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{compound_name}/property/MolecularFormula,MolecularWeight,IUPACName,CanonicalSMILES/JSON"
    try:
        response = requests.get(url)
        response.raise_for_status()
        data = response.json()
        props = data.get("PropertyTable", {}).get("Properties", [{}])[0]
        return (
            props.get("MolecularFormula", "N/A"),
            str(props.get("MolecularWeight", "N/A")),
            props.get("IUPACName", "N/A"),
            props.get("CanonicalSMILES", "N/A"),
        )
    except (requests.RequestException, KeyError) as e:
        print(f"⚠️ Error retrieving PubChem data for {compound_name}: {e}")
        return "N/A", "N/A", "N/A", "N/A"

# 🆔 Fetch the CID (Compound ID) from PubChem
def fetch_pubchem_cid(compound_name: str) -> str:
    """Fetches the CID (Compound ID) from PubChem."""
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{compound_name}/cids/JSON"
    try:
        response = requests.get(url)
        response.raise_for_status()
        data = response.json()
        return str(data.get("IdentifierList", {}).get("CID", ["N/A"])[0])
    except (requests.RequestException, KeyError) as e:
        print(f"⚠️ Error fetching CID for {compound_name}: {e}")
        return "N/A"

# 💾 Save extracted data to a CSV file
def save_results_to_csv(data: Dict[str, List[str]], filename: str = "natural_products.csv") -> None:
    """Saves extracted compound data to a CSV file."""
    df = pd.DataFrame(data)
    try:
        df.to_csv(filename, index=False, encoding="utf-8")
        print(f"\n✅ Data successfully saved to `{filename}`!")
    except Exception as e:
        print(f"\n❌ Error saving file: {e}")

# 📂 File Upload & Manual Input Handling (ONLY CSV ACCEPTED)
def get_plant_names():
    """Allows users to upload a CSV file or manually enter plant names."""
    print("\n📢 INSTRUCTIONS:")
    print("✔️ Please upload a CSV file containing valid compound names.")
    print("✔️ The file should have no headers and only contain compound names separated by commas.")
    print("✔️ If you don’t have a file, you can enter plant names manually.")
    print("❌ Only CSV files are accepted!")

    choice = input("\n📌 Do you want to upload a CSV file? (yes/no): ").strip().lower()

    if choice == "yes":
        print("📤 Please upload your CSV file...")
        uploaded = files.upload()  # Opens file upload dialog

        if uploaded:
            filename = list(uploaded.keys())[0]  # Get the uploaded file name

            # 🚨 Reject non-CSV files
            if not filename.endswith(".csv"):
                print(f"\n❌ Error: `{filename}` is not a CSV file. Please upload a valid `.csv` file!")
                return get_plant_names()  # Prompt user again

            # ✅ Read CSV (without header)
            try:
                df = pd.read_csv(filename, header=None)
                plants = df.iloc[:, 0].tolist()  # Extract first column
                plants = [p.strip().lower() for p in plants if p.strip()]
                print(f"✅ Successfully loaded `{len(plants)}` plant names from `{filename}`.")
                return plants
            except Exception as e:
                print(f"\n❌ Error reading `{filename}`: {e}. Please upload a valid CSV file.")
                return get_plant_names()  # Prompt user again
        else:
            print("⚠️ No file uploaded. Switching to manual input...")

    # 📌 Manual Input Fallback
    plants = input("🌿 Enter plant/herb names (comma-separated): ").split(",")
    return [p.strip().lower() for p in plants if p.strip()]

# 🔄 Process each compound sequentially
def process_compound(plant: str, compound: str, extracted_data: Dict[str, List[str]]):
    """Processes each compound and retrieves molecular data."""
    formula, weight, iupac, smiles = fetch_pubchem_properties(compound)
    extracted_data["Plant"].append(plant)
    extracted_data["Compound"].append(compound)
    extracted_data["CID"].append(fetch_pubchem_cid(compound))
    extracted_data["MolecularFormula"].append(formula)
    extracted_data["MolecularWeight"].append(weight)
    extracted_data["IUPACName"].append(iupac)
    extracted_data["CanonicalSMILES"].append(smiles)

# 🚀 Main Function
def main() -> None:
    """Main function to orchestrate data collection."""
    plant_names = get_plant_names()

    if not plant_names:
        print("⚠️ No plant names provided. Exiting...")
        return

    print(f"\n🌱 Searching for {len(plant_names)} plants...")

    # 📊 Initialize data storage
    extracted_data: Dict[str, List[str]] = {
        "Plant": [], "Compound": [], "CID": [], "MolecularFormula": [], "MolecularWeight": [], "IUPACName": [], "CanonicalSMILES": []
    }

    for plant in plant_names:
        compounds = scrape_coconut_database(plant)
        if not compounds:
            continue  # Skip if no compounds found

        for compound in compounds:
            print(f"🔄 Processing {compound} from {plant}...")
            process_compound(plant, compound, extracted_data)

    if extracted_data["Plant"]:
        save_results_to_csv(extracted_data)
    else:
        print("\n⚠️ No data extracted!")

if __name__ == "__main__":
    main()



📢 INSTRUCTIONS:
✔️ Please upload a CSV file containing valid compound names.
✔️ The file should have no headers and only contain compound names separated by commas.
✔️ If you don’t have a file, you can enter plant names manually.
❌ Only CSV files are accepted!

📌 Do you want to upload a CSV file? (yes/no): 
🌿 Enter plant/herb names (comma-separated): Aloe vera, Phyllanthus emblica, Murraya koenigii, Cinnamomum camphora, Cocos nucifera, Eclipta prostrata, Hibiscus rosa-sinensis, Lawsonia inermis, Azadirachta indica, Trigonella foenum-graecum, Salvia officinalis, Achyranthes aspera, Allium cepa, Vitis vinifera, Nardostachys jatamansi, Rosmarinus officinalis, Thymus vulgaris, Ocimum tenuiflorum, Allium sativum, Serenoa repens, Panax ginseng, Urtica dioica, Ricinus communis, Simmondsia chinensis, Arnica montana, Capsicum annuum, Nigella sativa, Acacia concinna, Moringa oleifera, Terminalia bellirica, Withania somnifera, Polygonum multiflorum, Angelica sinensis, Lycium barbarum, Ganoderma 

In [None]:
import os
import time
import pandas as pd
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from typing import List, Dict, Tuple
from google.colab import files  # Google Colab file upload

# 🚀 Setup Selenium WebDriver for headless browsing
def setup_driver():
    """🌐 Setup Selenium WebDriver for scraping."""
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--window-size=1920,1080")

    service = Service()
    return webdriver.Chrome(service=service, options=chrome_options)

# 🔎 Scrape compound names from the COCONUT database
def scrape_coconut_database(plant_name: str) -> List[str]:
    """Scrapes compound names from the COCONUT database."""
    search_url = f"https://coconut.naturalproducts.net/search?q={plant_name.replace(' ', '+')}"
    print(f"\n🔍 Searching compounds for {plant_name} in the 🥥 COCONUT Database...")

    driver = setup_driver()
    driver.get(search_url)
    time.sleep(2)  # Allow page to load

    try:
        compounds = [elem.text.strip() for elem in driver.find_elements(By.XPATH, "//h3[contains(@class, 'text-gray-900')]") if elem.text.strip()]
    except Exception as e:
        print(f"❌ Error extracting compounds for {plant_name}: {e}")
        compounds = []

    driver.quit()
    print(f"✅ Found {len(compounds)} compounds: {compounds}" if compounds else f"⚠️ No compounds found for {plant_name}.")
    return compounds

# ⚛️ Fetch molecular properties from PubChem
def fetch_pubchem_properties(compound_name: str) -> Tuple[str, str, str, str]:
    """Retrieves molecular properties from PubChem."""
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{compound_name}/property/MolecularFormula,MolecularWeight,IUPACName,CanonicalSMILES/JSON"
    time.sleep(1.5)
    try:
        response = requests.get(url)
        response.raise_for_status()
        data = response.json()
        props = data.get("PropertyTable", {}).get("Properties", [{}])[0]
        return (
            props.get("MolecularFormula", "N/A"),
            str(props.get("MolecularWeight", "N/A")),
            props.get("IUPACName", "N/A"),
            props.get("CanonicalSMILES", "N/A"),
        )
    except (requests.RequestException, KeyError) as e:
        print(f"⚠️ Error retrieving PubChem data for {compound_name}: {e}")
        return "N/A", "N/A", "N/A", "N/A"

# 🆔 Fetch the CID (Compound ID) from PubChem
def fetch_pubchem_cid(compound_name: str) -> str:
    """Fetches the CID (Compound ID) from PubChem."""
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{compound_name}/cids/JSON"
    try:
        response = requests.get(url)
        response.raise_for_status()
        data = response.json()
        return str(data.get("IdentifierList", {}).get("CID", ["N/A"])[0])
    except (requests.RequestException, KeyError) as e:
        print(f"⚠️ Error fetching CID for {compound_name}: {e}")
        return "N/A"

# 💾 Save extracted data to a CSV file
def save_results_to_csv(data: Dict[str, List[str]], filename: str = "natural_products.csv") -> None:
    """Saves extracted compound data to a CSV file."""
    df = pd.DataFrame(data)
    try:
        df.to_csv(filename, index=False, encoding="utf-8")
        print(f"\n✅ Data successfully saved to `{filename}`!")
    except Exception as e:
        print(f"\n❌ Error saving file: {e}")

# 📂 File Upload & Manual Input Handling (ONLY CSV ACCEPTED)
def get_plant_names():
    """Allows users to upload a CSV file or manually enter plant names."""
    print("\n📢 INSTRUCTIONS:")
    print("✔️ Please upload a CSV file containing valid compound names.")
    print("✔️ The file should have no headers and only contain compound names separated by commas.")
    print("✔️ If you don’t have a file, you can enter plant names manually.")
    print("❌ Only CSV files are accepted!")

    choice = input("\n📌 Do you want to upload a CSV file? (yes/no): ").strip().lower()

    if choice == "yes":
        print("📤 Please upload your CSV file...")
        uploaded = files.upload()  # Opens file upload dialog

        if uploaded:
            filename = list(uploaded.keys())[0]  # Get the uploaded file name

            # 🚨 Reject non-CSV files
            if not filename.endswith(".csv"):
                print(f"\n❌ Error: `{filename}` is not a CSV file. Please upload a valid `.csv` file!")
                return get_plant_names()  # Prompt user again

            # ✅ Read CSV (without header)
            try:
                df = pd.read_csv(filename, header=None)
                plants = df.iloc[:, 0].tolist()  # Extract first column
                plants = [p.strip().lower() for p in plants if p.strip()]
                print(f"✅ Successfully loaded `{len(plants)}` plant names from `{filename}`.")
                return plants
            except Exception as e:
                print(f"\n❌ Error reading `{filename}`: {e}. Please upload a valid CSV file.")
                return get_plant_names()  # Prompt user again
        else:
            print("⚠️ No file uploaded. Switching to manual input...")

    # 📌 Manual Input Fallback
    plants = input("🌿 Enter plant/herb names (comma-separated): ").split(",")
    return [p.strip().lower() for p in plants if p.strip()]

# 🔄 Process each compound sequentially
def process_compound(plant: str, compound: str, extracted_data: Dict[str, List[str]]):
    """Processes each compound and retrieves molecular data."""
    formula, weight, iupac, smiles = fetch_pubchem_properties(compound)
    extracted_data["Plant"].append(plant)
    extracted_data["Compound"].append(compound)
    extracted_data["CID"].append(fetch_pubchem_cid(compound))
    extracted_data["MolecularFormula"].append(formula)
    extracted_data["MolecularWeight"].append(weight)
    extracted_data["IUPACName"].append(iupac)
    extracted_data["CanonicalSMILES"].append(smiles)

# 🚀 Main Function
def main() -> None:
    """Main function to orchestrate data collection."""
    plant_names = get_plant_names()

    if not plant_names:
        print("⚠️ No plant names provided. Exiting...")
        return

    print(f"\n🌱 Searching for {len(plant_names)} plants...")

    # 📊 Initialize data storage
    extracted_data: Dict[str, List[str]] = {
        "Plant": [], "Compound": [], "CID": [], "MolecularFormula": [], "MolecularWeight": [], "IUPACName": [], "CanonicalSMILES": []
    }

    for plant in plant_names:
        compounds = scrape_coconut_database(plant)

        # 🚨 Skip processing if the only result is "No results"
        if not compounds or compounds == ["No results"]:
            print(f"⚠️ Skipping PubChem search for {plant} as no valid compounds were found.")
            continue  # Skip this plant

        for compound in compounds:
            print(f"🔄 Processing {compound} from {plant}...")
            process_compound(plant, compound, extracted_data)

    if extracted_data["Plant"]:
        save_results_to_csv(extracted_data)
    else:
        print("\n⚠️ No data extracted!")

if __name__ == "__main__":
    main()



📢 INSTRUCTIONS:
✔️ Please upload a CSV file containing valid compound names.
✔️ The file should have no headers and only contain compound names separated by commas.
✔️ If you don’t have a file, you can enter plant names manually.
❌ Only CSV files are accepted!

📌 Do you want to upload a CSV file? (yes/no): no
🌿 Enter plant/herb names (comma-separated): Aloe vera,Phyllanthus emblica,Murraya koenigii,Cinnamomum camphora,Cocos nucifera,Eclipta prostrata,Hibiscus rosa-sinensis,Lawsonia inermis,Azadirachta indica,Trigonella foenum-graecum,Salvia officinalis,Achyranthes aspera,Allium cepa,Vitis vinifera,Nardostachys jatamansi,Rosmarinus officinalis,Thymus vulgaris,Ocimum tenuiflorum,Allium sativum,Serenoa repens,Panax ginseng,Urtica dioica,Ricinus communis,Simmondsia chinensis,Arnica montana,Capsicum annuum,Nigella sativa,Acacia concinna,Moringa oleifera,Terminalia bellirica,Withania somnifera,Polygonum multiflorum,Angelica sinensis,Lycium barbarum,Ganoderma lucidum,Schisandra chinensis,Car

In [None]:
import pandas as pd

# File paths
file_paths = [
    "/content/natural_products.csv",
    "/content/natural_products(1).csv",
    "/content/natural_products(2).csv",
    "/content/natural_products(3).csv",
]

# Load and concatenate all CSV files
dfs = [pd.read_csv(file) for file in file_paths]
combined_df = pd.concat(dfs, ignore_index=True)

# Drop duplicates based on "Compound" column (assuming it defines uniqueness)
unique_df = combined_df.drop_duplicates(subset=["Compound"])

# Save the unique results to a new file
unique_file_path = "/content/unique_compounds.csv"
unique_df.to_csv(unique_file_path, index=False)

# Display number of unique records
len(unique_df), unique_file_path

(282, '/content/unique_compounds.csv')

In [None]:
# Select only the "Plant", "Compound", and "CanonicalSMILES" columns
selected_columns = unique_df[["Compound", "CanonicalSMILES"]]

# Display the first few rows
selected_columns.head()


Unnamed: 0,Compound,CanonicalSMILES
0,"4-Chloro-3,5-Dimethylphenol",CC1=CC(=CC(=C1Cl)C)O
1,Isopropyl Alcohol,CC(C)O
2,Benzethonium Chloride,CC(C)(C)CC(C)(C)C1=CC=C(C=C1)OCCOCC[N+](C)(C)C...
3,Triclosan,C1=CC(=C(C=C1Cl)O)OC2=C(C=C(C=C2)Cl)Cl
4,Nimbolide,CC1=C2C(CC1C3=COC=C3)OC4C2(C(C5(C6C4OC(=O)C6(C...


In [None]:
# Load the newly uploaded file
file_path = "/content/Alopecia Natural Product Compounds updated - uniqueresult.csv"
df = pd.read_csv(file_path)

# Remove duplicate rows based on all columns to keep only unique records
unique_df = df.drop_duplicates()

# Save the unique dataset
unique_file_path = "/content/unique_results.csv"
unique_df.to_csv(unique_file_path, index=False)

# Display the number of unique records and provide the download link
len(unique_df), unique_file_path


(380, '/content/unique_results.csv')

In [None]:
# Drop all rows that contain any blank (NaN) values
cleaned_df = unique_df.dropna()

# Save the cleaned dataset without blanks
cleaned_file_path = "/content/unique_results.csv"
cleaned_df.to_csv(cleaned_file_path, index=False)

# Display the number of records after removing blanks
len(cleaned_df), cleaned_file_path


(328, '/content/unique_results.csv')

In [None]:
import pandas as pd

# Load the CSV file from Google Colab (Uncomment the below lines if using Colab)
# from google.colab import files
# uploaded = files.upload()
# file_name = list(uploaded.keys())[0]  # Get the uploaded file name

file_path = "/content/alldataa.csv"  # Update this path for Google Colab

# Read the CSV file with error handling
df = pd.read_csv(file_path, on_bad_lines='skip')

# Check column names to ensure correct selection
print("Columns:", df.columns)

# Ensure column names are correct (Adjust if needed)
ligand_col = "ligand"  # Change if the ligand column has a different name
gscore_col = "GScore"  # Change if the GScore column has a different name

# Convert GScore column to numeric in case of formatting issues
df[gscore_col] = pd.to_numeric(df[gscore_col], errors='coerce')

# Filter ligands where GScore < -11 (since scores are negative)
filtered_df = df[df[gscore_col] < -11]

# Keep only unique ligands
unique_ligands = filtered_df[[ligand_col, gscore_col]].drop_duplicates()

# Display the results
print(unique_ligands)

# Save filtered results to a new CSV file
unique_ligands.to_csv("filtered_ligands.csv", index=False)




Columns: Index(['ligand', 'GScore', 'DockScore', 'LipophilicEvdW', 'PhobEn', 'PhobEnHB',
       'PhobEnPairHB', 'HBond', 'Electro', 'Sitemap', 'PiCat', 'ClBr', 'LowMW',
       'Penalties', 'HBPenal', 'ExposPenal', 'RotPenal', 'EpikStatePenalty',
       'Zpotr', 'Similarity', 'Activity'],
      dtype='object')
                                ligand  GScore
0                         Angoroside C  -16.11
1                   Gingerglycolipid A  -16.03
2                           Perillanin  -16.01
3                       Angoroside C-2  -15.97
4     Secoisolariciresinol Diglucoside  -15.76
...                                ...     ...
999                     Isobiflorin-30  -11.07
1008                    Isobiflorin-31  -11.05
1043                    Thymectacin-25  -11.06
1046                   Cinnamtannin B2  -11.19
1060                     Perillanin-32  -11.06

[539 rows x 2 columns]
