<a href="https://colab.research.google.com/github/mack-h/hello-world/blob/master/antibody_finder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automating peptide/immunogen analyses for antibody eligibility using **Selenium**
by mack hepker

June 2024


## Introduction

Selecting an antibody for your immunohistochemistry (IHC) or immunofluorescence (IF) study can be overwhelming -- especially when considering all of the factors that play into an antibody's affinity for binding, including variable chemical properties of the tissue sample, the protein of interest, and your protocol reagents.

These factors may be overlooked when selecting an antibody, especially when trying to account for the diversity of antibodies on the market for an individual protein of interest, each with their own properties to toss into the ring. This likely contributes to the reputation of IHC/IF for being fickle, especially in longer-fix tissue (e.g. human, non-human primate, and other species of intrigue outside the realm of highly controlled lab animal models). Alternatively, it may feel prudent to avoid using a given antibody unless it has been previously validated in similar tissue, either through word of mouth or publication.

Antibodies are often validated by the vendor with basic and clinical research in mind, wherein laboratory animal tissue is quickly processed after death. Typically, there is little further testing or acknowledgment of an antibody's binding efficacy in tissue put through variable post-mortem processing conditions, which can (and very much do) impact **1)** the structure of the protein of interest and **2)** the accessibility of the protein of interest, both of which are inherent to the antibody's ability to bind to it.

Most antibodies are made against human protein sequences for use in rapid-fix rodent tissue with minimal processing that would distort the molecular composition of the tissue. Therefore, the immunogen / peptide sequence is assuredly homologous between a human and the animal model (e.g. mouse).

Often, the immunogen is a *subset* of the peptide of interest, either because that subset is homologous between species (whereas other sections of the peptide are not); that particular domain is more amenable to antibody binding (ex. extracellular, therefore exposed); or it is a more hydrophilic in its amino acid composition (turning it outward towards the antibody solution, where it can be accessed). Immunogens corresponding to the entirety of a peptide sequence may be less IHC-friendly -- smaller peptides better permeate cell membranes and wayward fixative cross-links.


**It is important, and relatively simple, to take steps to best ensure that an antibody will similarly work with your method and range of samples.**

* This includes ensuring the immunogen -- derived from human, mouse, or rat protein sequences -- is homologous across all of your species of interest. Alternatively, you can know whether to use a certain marker to evaluate protein expression in *some,* but not other, species of interest--a low (< 65 %) homology inhibits or precludes the ability of the antibody to bind to the protein in that species, whether or not the protein is present.

* This also includes obtaining information about the cellular location of the peptide immunogen. Nuclear and intracellular domains may benefit more from increased membrane-permeabilization measures in your protocol; transmembrane receptors may benefit from less.

* Finally, obtaining essential properties of the immunogen peptide sequence, such as hydrophobicity/hydrophilicity ratio, isoelectric point, and net charge at pH 7.0.

Here, we automate these steps using the popular Selenium web-scraping tool, available as a Python package. Selenium can be run from the command line, a Google Colab notebook (such as this) or Jupyter notebook, a Python IDE, ChatGPT, or any other interface where Python is installed and accessible. We will use peptide sequence (AAs) and desired parameters as input, and amend them to a document for your reference when making your purchasing and protocol decisions. These steps may also be conducted manually, or separately given a peptide sequence.

## Outline

1) Obtain immunogen sequence.

2) Check for % homology across species of interest (ex. using NCBI Protein BLAST).
* > Greater than 75% homology is ideal to avoid false negative staining in your species of interest. < 50% homology is necessary to assure no cross-reactivity (this is especially important to look out for, as your peptide sequence may be somewhat similar to pieces of other proteins, resulting in false positive staining).

3)




>In progress: scraping immunogen sequences directly from commercial antibody webpages.


Let's say we're interested in this Neuromedin-B (NMB) antibody from ProteinTech, who publish their full immunogen sequences:

https://www.ptglab.com/products/NMB-Fusion-Protein-ag0842.htm

**Immunogen sequence:**
MARRAGGARMFGSLLLFALLAAGVAPLSWDLPEPRSRASKIRVHSRGNLWATGHFMGKKSLEPSSPSPLGTATHTSLRDQRLQLSHDLLGILLLKKALGVSLSRPAPQIQEAAGTNTAEMTPIMGQTQQRGLDCAHPGKVLNGTLLMAPSGCKS


Install Selenium

In [None]:
!apt-get update # Standard procedure
#!apt-get install -y chromium-chromedriver # Connects Selenium to Google Chrome browser
#!apt-get install -y xvfb # Dependency for virtual display (necessary for Colab environment)
!pip install selenium # Python webscraping package used here (alternative: beautifulsoup)
!pip install biopython # called as Bio in webscraping block, accesses Entrez protein database

# Install virtual display (for Colab, as there is no built-in display)
# !pip install pyvirtualdisplay

0% [Working]            Get:1 https://dl.google.com/linux/chrome/deb stable InRelease [1,825 B]
0% [Connecting to archive.ubuntu.com (185.125.190.81)] [Waiting for headers] [Connected to cloud.r-p0% [Connecting to archive.ubuntu.com (185.125.190.81)] [Waiting for headers] [Connected to cloud.r-p                                                                                                    Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.81)] [Waiting for headers] [Connecting to ppa.laun                                                                                                    Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Get:7 https://dl.google.com/linux/chrome/deb

In [None]:
%pip install -q google-colab-selenium
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
import re
from Bio import Entrez
from google.colab import files
import io
import time
import requests

In [None]:
import google_colab_selenium as gs

In [None]:
driver = gs.Chrome()

<IPython.core.display.Javascript object>

Test below to be sure the driver is working -- should 'get' https://www.google.com and then 'print' the title variable within this data -- Google.

In [None]:
driver.get('https://www.google.com')
print(driver.title)
driver.quit()

Google


*Get Chromium and Chromedriver (necessary for navigating the web through Selenium, and a common source of errors, discrepancies and incompatibilities). In Colab, because **snap** is not supported to download packages, it must be done manually, as below.*

*Run this code if you need to install new chromedrivers, to make sure the
defunct files are cleaned out and uncallable.* **This will not be necessary if google-colab-selenium is working.**

*Install compatible versions of Chrome and Chromedriver to run together and in Colab. This may change with time.*

In [None]:
# Below searches for possible Chrome paths if they can't be found after installation. We found one!

import shutil

# Check Google Chrome path
chrome_path = "/opt/google/chrome/google-chrome"
if chrome_path:
    print(f"Google Chrome path: {chrome_path}")
else:
    print("Google Chrome not found in the expected locations.")

Google Chrome path: /opt/google/chrome/google-chrome


(Deprecated block) Compatible Chromedriver installation

Further configure the ChromeDriver for our setup.

WORKING PARTS BELOW

In [None]:
"""
# Import libraries and set up virtual display
from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
from google.colab import files
from Bio import Entrez
!apt-get install -y wget unzip xvfb

# Set up virtual display
display = Display(visible=0, size=(1024, 768))
display.start()
"""

<pyvirtualdisplay.display.Display at 0x7da11f485b70>

In [None]:
# Configure Selenium to use the headless Chrome browser:

#options = webdriver.ChromeOptions()
#options.add_argument('--headless')
#options.add_argument('--no-sandbox')
#options.add_argument('--disable-dev-shm-usage')
#options.add_argument('--disable-gpu')
#options.add_argument('--window-size=1024x768')
#options.add_argument('--disable-infobars')
#options.add_argument('--disable-extensions')
#options.add_argument('--remote-debugging-port=9222')

#driver = webdriver.Chrome(options=options)



Use NCBI Blast to check for percent homology across species of interest

From the main BLAST webpage, you click 'Protein BLAST'.

The Protein BLAST page defaults to the 'blastp' tab, used for protein queries -- this is what we always want for this purpose.

# Using Selenium to batch this process for multiple immunogen sequences

**Note:** Be sure you have run the update, selenium, and driver installations in the code box above, or this won't work. If there is a green checkmark to the left of the code boxes, you're good!



Combo of ChatGPT and forum hints for installing the chrome driver.

Get updated Chromium packages that work with Colab. Also, maybe this would all work in Google chrome?

From here: https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com

Below: ChatGPT doing everything.

Configure your Entrez email (so that NCBI can contact you if needed). This is a necessary step.

In [None]:
# Configure Entrez email
Entrez.email = "mack.hepker@gmail.com" ## YOUR EMAIL HERE

Upload your text file. It should have the following template.

---

SEQUENCES:


SPECIES:

---

**Example:**

SEQUENCES:

NMB [1-60]

SPECIES:

Human

Chimpanzee

Bonobo

**Run the code below to be prompted to upload your file.**



Preloaded data (fixtures), test folder, etc.

In [None]:
mkdir test
cd content/test/
mkdir test_data

In [None]:
# Preloaded data
test_filepath = 'test/test_data/selenium blast template 2.txt'

SyntaxError: invalid syntax (<ipython-input-11-69359041b85c>, line 2)

In [None]:
import requests

def download_file(url, local_filename):
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    return local_filename

def get_taxon_id(species_name):
    # Use Entrez ESearch to search for the taxon ID
    search_handle = Entrez.esearch(db="taxonomy", term=species_name)
    search_results = Entrez.read(search_handle)
    search_handle.close()
    if search_results["IdList"]:
        return search_results["IdList"][0]
    else:
        return None

def read_and_parse_file():
    # Upload the file
    uploaded = files.upload()
    filename = list(uploaded.keys())[0]

    # Read the file
    with open(filename, 'r') as file:
        content = file.read()

    # Remove text between triple backticks (```)
    content = re.sub(r'```.*?```', '', content, flags=re.DOTALL)

    # Parse the file
    sections = content.split("SPECIES:")
    sequences_section = sections[0].split("SEQUENCES:")[1].strip()
    species_section = sections[1].strip()

    # Extract sequences and their ranges, ignoring numbering
    sequences = []
    for line in sequences_section.split("\n"):
        print(f'Parsed line: {line}')
        line = re.sub(r'^\d+\.\s*', '', line)  # Remove numbering
        if line.strip():  # Check if the line is not empty
            match = re.match(r'(.+?)\s*\[(\d+)-(\d+)\]', line)
            if match:
                protein = match.group(1).strip()
                start = int(match.group(2))
                end = int(match.group(3))
                sequences.append({"protein": protein, "start": start, "end": end})
            else:
                sequences.append({"protein": line.strip(), "start": None, "end": None})

    # Debugging: Print the sequences list
    print(f"Sequences extracted: {sequences}")

    # Extract species, ignoring numbering and setting to None if no valid species
    species = re.split(r',|\n', species_section)
    species = [re.sub(r'^\d+\.\s*', '', s.strip()) for s in species if s.strip()]

    # Filter out empty species entries and get taxon IDs
    taxon_ids = [get_taxon_id(s) for s in species]
    taxon_ids = [t for t in taxon_ids if t is not None]

    # Set species to None if the list is empty
    if not taxon_ids:
        taxon_ids = None

    # Print received taxon IDs
    if taxon_ids is None:
        print('No valid species provided.')
    else:
        for taxon_id in taxon_ids:
            print(f'Received taxon ID: {taxon_id}')

    return sequences, taxon_ids

Get the protein ID from Entrez based on the full name, gene name, genbank accession number, uniprot ID, partial sequence, or full sequence with range optionally identified.

In [None]:
from Bio import Entrez, SeqIO
import re
import io

# Configure Entrez email (replace with your own email)
Entrez.email = "your_email@example.com"

# Function to look up protein ID and fetch FASTA sequence with debugging and detailed results handling
def get_protein_id(protein, species):
    if not protein or not species:
        return None

    search_term = f"{protein} AND {species}[Organism]"
    print(f"Searching for protein: {protein} in species: {species}")

    # Perform the search in the Entrez protein database
    search_handle = Entrez.esearch(db="protein", term=search_term)
    search_results = Entrez.read(search_handle)
    search_handle.close()

    print(f"Search results: {search_results}")

    # Check if any IDs were found
    if search_results["IdList"]:
        protein_id = search_results["IdList"][0]

        # Fetch detailed information about the first result to get the accession number and FASTA sequence
        fetch_handle = Entrez.efetch(db="protein", id=protein_id, rettype="fasta", retmode="text")
        fetch_results = fetch_handle.read()
        fetch_handle.close()

        print(f"Fetch results: {fetch_results}")

        # Extract the accession number and FASTA sequence
        fasta_io = io.StringIO(fetch_results)
        fasta_record = SeqIO.read(fasta_io, "fasta")
        accession = fasta_record.id
        fasta_sequence = str(fasta_record.seq)

        return {
            "protein_id": accession,
            "fasta": fetch_results,
            "fasta_sequence": fasta_sequence
        }

        print(f'Accession: {accession}')
        print(f'FASTA Sequence: {fasta_sequence}')

    else:
        return None

# Function to determine if the input is a peptide sequence
def is_peptide_sequence(sequence):
    valid_amino_acids = set("ARNDCEQGHILKMFPSTWYV")
    return len(sequence) > 5 and all(char in valid_amino_acids for char in sequence) and sequence.isalpha()

# Process sequences
# print('Sequences to analyze:')
# for seq in sequences:
#    protein = seq["protein"]
#    if protein and is_peptide_sequence(protein):  # If it is a peptide sequence
#        seq["protein_id"] = protein
#        print(f'Received {seq["protein"]}, range {seq["start"]}-{seq["end"]}')
#    else:
#        protein_id = get_protein_id(protein)
#        seq["protein_id"] = protein_id if protein_id else protein


# Process sequences

# Process sequences
def process_sequences(sequences, species_of_interest):
    print('Sequences to analyze:')
    for seq in sequences:
        protein = seq["protein"]
        print(f"Processing sequence: {protein}")
        if protein and is_peptide_sequence(protein):  # If it is a peptide sequence
            seq["protein_id"] = protein
            seq["fasta"] = f">{protein}\n{protein}"
            seq["fasta_sequence"] = protein
            print(f'Using raw sequence as query, Range {seq["start"]}-{seq["end"]}')
        else:
            protein_info = get_protein_id(protein, species_of_interest)
            if protein_info:
                seq["protein_id"] = protein_info["protein_id"]
                seq["fasta"] = protein_info["fasta"]
                seq["fasta_sequence"] = protein_info["fasta_sequence"]
                print(f'{protein} reconfigured to {seq["protein_id"]} based on Entrez results')
            else:
                seq["protein_id"] = protein
                seq["fasta"] = None
                seq["fasta_sequence"] = None
"""
def process_sequences(sequences):
    print('Sequences to analyze:')
    for seq in sequences:
        protein = seq["protein"]
        print(f"Processing sequence: {protein}")
        if protein and is_peptide_sequence(protein):  # If it is a peptide sequence
            seq["protein_id"] = protein
            print(f'Using raw sequence as query, Range {seq["start"]}-{seq["end"]}')
        else:
            protein_id = get_protein_id(protein)
            seq["protein_id"] = protein_id if protein_id else protein
        start = seq["start"]
        end = seq["end"]
        print(f'{protein} reconfigured to {seq["protein_id"]} based on Entrez results')
"""

# Test the function with "NMB"
# protein_name = "NMB"
# protein_id = get_protein_id(protein_name)
# print(f"Protein ID for {protein_name}: {protein_id}")


'\ndef process_sequences(sequences):\n    print(\'Sequences to analyze:\')\n    for seq in sequences:\n        protein = seq["protein"]\n        print(f"Processing sequence: {protein}")\n        if protein and is_peptide_sequence(protein):  # If it is a peptide sequence\n            seq["protein_id"] = protein\n            print(f\'Using raw sequence as query, Range {seq["start"]}-{seq["end"]}\')\n        else:\n            protein_id = get_protein_id(protein)\n            seq["protein_id"] = protein_id if protein_id else protein\n        start = seq["start"]\n        end = seq["end"]\n        print(f\'{protein} reconfigured to {seq["protein_id"]} based on Entrez results\')\n'

In [None]:
# Function to format BLAST input
def format_blast_input(sequences):
    blast_input = ""
    for seq in sequences:
        fasta = seq.get("fasta")
        if fasta:
            blast_input += f"{fasta}\n"
        else:
            blast_input += f">{seq['protein_id']}\n"
    return blast_input

#blast_input = format_blast_input(sequences)
#print("Formatted BLAST Input:")
#print(blast_input)

Run BLAST and download the results

In [None]:
import requests

def download_file(url, local_filename):
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    return local_filename

def upload_and_download_blast(blast_input, sequences, species, download_dir='/content'):
    # Create download directory if it doesn't exist
    os.makedirs(download_dir, exist_ok=True)
    print(f"Download directory: {download_dir}")

    # Prompt user for download format
    download_format = input("Enter download format ('Text' or 'CSV'): ").strip().lower()
    if download_format not in ["text", "csv"]:
        print("Invalid format selected. Defaulting to 'Text'.")
        download_format = "text"

    # Prompt user for filename
    filename = input("Enter the desired filename (without extension): ").strip()

    try:
        print("Initializing Chromedriver")
        driver.get('https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome')
        print("Opened NCBI BLAST page.")

        # Wait for the text area to be present
        textarea = WebDriverWait(driver, 20).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, 'textarea#seq'))
        )
        textarea.clear()
        textarea.send_keys(blast_input)
        print("Input BLAST query.")

        if species:
            print(f"Total species to add: {len(species)}")
            # Click 'Add organism' button for each species beyond the first one
            for i in range(len(species) - 1):
                try:
                    add_button = WebDriverWait(driver, 10).until(
                        EC.presence_of_element_located((By.XPATH, '//input[@id="addOrg"]'))
                    )
                    add_button.click()
                    print(f"Clicked 'Add organism' button {i + 1} times.")
                except Exception as e:
                    print(f"Error clicking 'Add organism' button: {e}")

            # Add each species to the respective input fields

            species_added = False
            for i, taxon_id in enumerate(species):
                print(f"Adding organism with taxon ID {taxon_id}")

                try:
                    organism_input_id = f'qorganism{i}' if i > 0 else 'qorganism'
                    print(f"Attempting to add organism with taxon ID {taxon_id} to input field {organism_input_id}")
                    organism_input = WebDriverWait(driver, 10).until(
                        EC.presence_of_element_located((By.CSS_SELECTOR, f'input#{organism_input_id}'))
                    )
                    organism_input.clear()
                    organism_input.send_keys(taxon_id)
                    print(f"Typed taxon ID {taxon_id} into input field {organism_input_id}")

                    # Wait for the dropdown to update
                    print("Waiting for dropdown to update...")
                    dropdown_option = WebDriverWait(driver, 10).until(
                        EC.presence_of_element_located((By.XPATH, f'//li[contains(@valueid, "{taxon_id}")]'))
                    )
                    dropdown_option.click()
                    print(f"Selected dropdown option for taxon ID {taxon_id}")

                    species_added = True
                except Exception as e:
                    print(f"Error adding organism with taxon ID {taxon_id}: {e}")

            if not species_added:
                print("No species were successfully added. Proceeding without species restriction.")
        else:
            print("No species provided. Proceeding without species restriction.")

        # Set query subranges
        for i, seq in enumerate(sequences):
            if seq["start"] is not None and seq["end"] is not None:
                try:
                    query_from = WebDriverWait(driver, 10).until(
                        EC.presence_of_element_located((By.NAME, f'QUERY_FROM'))
                    )
                    query_to = WebDriverWait(driver, 10).until(
                        EC.presence_of_element_located((By.NAME, f'QUERY_TO'))
                    )
                    query_from.clear()
                    query_from.send_keys(str(seq["start"]))
                    query_to.clear()
                    query_to.send_keys(str(seq["end"]))
                    print(f"Set query range for sequence {i + 1}: {seq['start']}-{seq['end']}.")
                except Exception as e:
                    print(f"Error setting range for sequence {i + 1}: {e}")

        # Click the BLAST button to submit the query
        try:
            blast_button = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, '//input[@value="BLAST"]'))
            )
            blast_button.click()
            print("Clicked the BLAST button.")
        except Exception as e:
            print(f"Error clicking the BLAST button: {e}")

        # Wait for the results page to load completely
        try:
            WebDriverWait(driver, 180).until(
                EC.presence_of_element_located((By.ID, 'btnDwnld'))
            )
            print("Results page loaded.")
        except Exception as e:
            print(f"Error waiting for the results page: {e}")

        # Click the download button
        try:
            download_button = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.ID, 'btnDwnld'))
            )
            download_button.click()
            print("Clicked the download button.")
        except Exception as e:
            print(f"Error clicking the download button: {e}")

        # Get the download URL for the selected format
        try:
            if download_format == "text":
                format_option = WebDriverWait(driver, 10).until(
                    EC.presence_of_element_located((By.ID, 'dwText'))
                )
            else:  # CSV
                format_option = WebDriverWait(driver, 10).until(
                    EC.presence_of_element_located((By.ID, 'dwDescrCsv'))
                )

            download_url = format_option.get_attribute('url')
            print(f"Download URL for {download_format} format:", download_url)
        except Exception as e:
            print(f"Error selecting '{download_format}' format: {e}")

        # Download the file using the URL
        try:
            local_filename = os.path.join(download_dir, f"{filename}.{download_format}")
            download_file(f"https://blast.ncbi.nlm.nih.gov/{download_url}", local_filename)
            print(f"BLAST results downloaded to {local_filename}")

            # Open the downloaded file
            with open(local_filename, 'r') as file:
                print(file.read())
        except Exception as e:
            print(f"Error downloading the file: {e}")

    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        driver.quit()

# Ensure the WebDriver instance is initialized correctly
print("Reinitializing Chromedriver...")

try:
    driver = gs.Chrome()  # Reinitialize the WebDriver instance
    print("Chromedriver reinitialized successfully.")
except Exception as e:
    print(f"Failed to initialize Chromedriver: {e}")


Reinitializing Chromedriver...


<IPython.core.display.Javascript object>

Chromedriver reinitialized successfully.


Mount your Google Drive for the download folder.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Ensure the WebDriver instance is initialized correctly
print("Reinitializing Chromedriver...")

try:
    driver = gs.Chrome()  # Reinitialize the WebDriver instance
    print("Chromedriver reinitialized successfully.")
except Exception as e:
    print(f"Failed to initialize Chromedriver: {e}")

Reinitializing Chromedriver...


<IPython.core.display.Javascript object>

Chromedriver reinitialized successfully.


In [None]:
# Main execution
print("Reading and parsing file...")
sequences, species = read_and_parse_file()
species_of_interest = input("Enter the species of your protein of interest (e.g., human, mouse, or your model species): ").strip()
print("Processing sequences...")
process_sequences(sequences, species_of_interest)
blast_input = format_blast_input(sequences)
print("Formatted BLAST Input:")
print(blast_input)

# Call the function to upload and download BLAST
upload_and_download_blast(blast_input, sequences, species)

Reading and parsing file...


Saving selenium blast template 2.txt to selenium blast template 2 (9).txt
Parsed line: Neuromedin B [1-121]
Sequences extracted: [{'protein': 'Neuromedin B', 'start': 1, 'end': 121}]
Received taxon ID: 9606
Received taxon ID: 9598
Received taxon ID: 9597
Enter the species of your protein of interest (e.g., human, mouse, or your model species): human
Processing sequences...
Sequences to analyze:
Processing sequence: Neuromedin B
Searching for protein: Neuromedin B in species: human
Search results: {'Count': '58', 'RetMax': '20', 'RetStart': '0', 'IdList': ['1024249356', '45505145', '45505143', '7669548', '308153451', '281185514', '212286370', '128364', '2689118640', '1024249354', '161484640', '295849290', '7019577', '2670404010', '2462544319', '2462498614', '1034590777', '2559883241', '2447595853', '2227338031'], 'TranslationSet': [{'From': 'Neuromedin B', 'To': 'Neuromedin B[Protein Name] OR (Neuromedin[All Fields] AND B[All Fields])'}, {'From': 'human[Organism]', 'To': '"Homo sapiens"

Use ExPasy to get Isoelectric Point (pI):

https://web.expasy.org/compute_pi/

Use Protein 2.0 to obtain Hydrophobicity, Basic, Acidic, and Neutral Bases
https://peptide2.com/N_peptide_hydrophobicity_hydrophilicity.php

# Example antibody search: Abcam's website and pull immunogen

Navigating to the main site URL.
Entering search terms into the search bar.
Clicking the search button.
Retrieving the results and navigating to the first result.

In [None]:
%pip install -q google-colab-selenium
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
import re
from Bio import Entrez
from google.colab import files
import io
import time
import requests
import google_colab_selenium as gs
from selenium.webdriver.common.keys import Keys

driver = gs.Chrome()

def search_abcam(driver, search_terms):
    driver.get("https://www.abcam.com/")
    search_bar = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "searchfieldtop"))
    )
    search_bar.clear()
    search_bar.send_keys(search_terms)
    search_bar.send_keys(Keys.RETURN)
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, '//a[contains(@href, "products/primary-antibodies")]'))
    )
    print("Navigated to search results page.")

def click_first_result(driver):
    try:
        first_product_link = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//a[contains(@href, "products/primary-antibodies")]'))
        )
        first_product_link.click()
        print("Clicked on the first product link and navigated to the product page.")
    except Exception as e:
        print(f"Error clicking the first product link: {e}")
        return None

def extract_immunogen(driver):
    try:
        immunogen_header = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//h3[text()="Immunogen"]'))
        )
        immunogen_text_element = immunogen_header.find_element(By.XPATH, "./following-sibling::div")
        immunogen_text = immunogen_text_element.text
        print(f"Immunogen: {immunogen_text}")
        return immunogen_text
    except Exception as e:
        print(f"Error extracting immunogen: {e}")
        return None

# Add more website functions here similar to the above for Abcam

# Main function
def main():
    # Prompt user for search terms
    search_terms = input("Enter the search term for the antibody: ")

    # Perform search on Abcam and extract immunogen sequence
    search_abcam(driver, search_terms)
    click_first_result(driver)
    immunogen = extract_immunogen(driver)

    # Close the WebDriver
    driver.quit()

    if immunogen:
        print("Successfully extracted immunogen sequence.")
        # Here you can call the Flexible BLAST program with the immunogen sequence
    else:
        print("Failed to extract immunogen sequence.")

# Run the main function
if __name__ == "__main__":
    main()


<IPython.core.display.Javascript object>

Enter the search term for the antibody: oxytocin receptor human monoclonal
Navigated to search results page.
Clicked on the first product link and navigated to the product page.
Immunogen: Synthetic peptide. This information is proprietary to Abcam and/or its suppliers.
Successfully extracted immunogen sequence.


Using beautifulsoup and selenium to get antibody vendor names and URLs

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time
import google_colab_selenium as gs

driver = gs.Chrome()

def get_website_title(url):
    try:
        driver.get(url)
        time.sleep(2)  # Wait for the page to load
        return driver.title
    except Exception as e:
        print(f"Error retrieving title for {url}: {e}")
        return None

def google_search(term, num_results=50):
    search_url = f"https://www.google.com/search?q={term}&num={num_results}"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    response = requests.get(search_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    results = []
    search_results = soup.find_all('div', class_='g')

    print(f"Found {len(search_results)} results on the first page.")

    for result in search_results:
        try:
            title = result.find('h3').text
            link = result.find('a')['href']
            domain = link.split('/')[2]
            results.append((title, domain))
            if len(results) >= num_results:
                break
        except Exception as e:
            continue

    return results

def save_to_csv(results, filename='antibody_vendors.csv'):
    with open(filename, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Website Title', 'Website URL'])
        for title, url in results:
            writer.writerow([title, url])
    print(f"Results saved to {filename}")

def main():
    # Google search part
    term = input("Enter the search term: ")
    results = google_search(term)
    if results:
        print(f"Total results found: {len(results)}")
        save_to_csv(results)
        df = pd.read_csv('antibody_vendors.csv')
        print(df)
    else:
        print("No results found.")

if __name__ == "__main__":
    main()

<IPython.core.display.Javascript object>

Enter the search term: antibody products
Found 49 results on the first page.
Total results found: 49
Results saved to antibody_vendors.csv
                                        Website Title  \
0                            Antibody Products – MHIR   
1                 Discover a Wide Range of Antibodies   
2                      Antibody Products for Research   
3          Antibodies | Thermo Fisher Scientific - US   
4   Antibodies - Primary, Secondary & Recombinant ...   
5                                   ANTIBODY PRODUCTS   
6               Anti-SARS-CoV-2 Monoclonal Antibodies   
7   Abcam: Antibodies, Proteins, Kits and Reagents...   
8   Antibodies, ELISA Kits & Proteins for Life Sci...   
9   Antibody therapeutics approved or in regulator...   
10                       Monoclonal Antibodies (MABs)   
11                     COVID-19 Monoclonal Antibodies   
12                     All Antibody Products - PROGEN   
13  A systematic review of commercial high concent...   
14    

Using selenium (not working as well)

In [None]:
"""
import time
import csv
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import google_colab_selenium as gs

def google_search(term, num_results=50):
    driver = gs.Chrome()
    search_url = "https://www.google.com"
    driver.get(search_url)

    search_box = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.NAME, "q"))
    )
    search_box.clear()
    search_box.send_keys(term)
    search_box.send_keys(Keys.RETURN)

    results = []
    pages_scanned = 0

    while len(results) < num_results:
        time.sleep(2)  # Wait for the page to load
        search_results = driver.find_elements(By.CSS_SELECTOR, 'div.g')

        print(f"Page {pages_scanned + 1}: Found {len(search_results)} results.")

        for result in search_results:
            try:
                title = result.find_element(By.TAG_NAME, 'h3').text
                link = result.find_element(By.TAG_NAME, 'a').get_attribute('href')
                domain = link.split('/')[2]
                results.append((title, domain))
                if len(results) >= num_results:
                    break
            except Exception as e:
                continue

        try:
            next_button = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.ID, "pnnext"))
            )
            next_button.click()
        except Exception as e:
            print("No more pages or error navigating to the next page.")
            break

        pages_scanned += 1
        print(f"Total results so far: {len(results)}")

    driver.quit()
    return results, pages_scanned

def save_to_csv(results, filename='antibody_vendors.csv'):
    with open(filename, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Title', 'URL'])
        for title, url in results:
            writer.writerow([title, url])
    print(f"Results saved to {filename}")

def main():
    term = input("Enter the search term: ")
    results, pages_scanned = google_search(term)
    if results:
        print(f"Total pages scanned: {pages_scanned}")
        print(f"Total results found: {len(results)}")
        save_to_csv(results)
        df = pd.read_csv('antibody_vendors.csv')
        print(df)
    else:
        print("No results found.")

if __name__ == "__main__":
    main()


Enter the search term: neuromedin b polyclonal antibody


<IPython.core.display.Javascript object>

Page 1: Found 14 results.
No more pages or error navigating to the next page.
Total pages scanned: 0
Total results found: 14
Results saved to antibody_vendors.csv
                                                Title  \
0                  neuromedin B antibody (10888-1-AP)   
1       neuromedin B Polyclonal Antibody (10888-1-AP)   
2           Neuromedin B receptor Polyclonal Antibody   
3                                                 NaN   
4                                                 NaN   
5                                                 NaN   
6                                                 NaN   
7   Antibody: NMB / Neuromedin B Rabbit anti-Human...   
8                   Neuromedin B antibody (AA 25-121)   
9               Anti-Neuromedin B antibody (ab191499)   
10  Neuromedin B Rabbit anti-Human, Polyclonal, No...   
11    Neuromedin BR/NMBR Antibody - BSA Free (NLS825)   
12  Assessment of neuromedin B polyclonal antibodi...   
13                Anti-Neuromedin B Ant

Getting suppliers domains off of BioCompare

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def google_search_first_result(term):
    search_url = f"https://www.google.com/search?q={term}+antibody"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    response = requests.get(search_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    search_results = soup.find_all('div', class_='g')

    results = []
    for result in search_results:
        try:
            title = result.find('h3').text
            link = result.find('a')['href']
            domain = link.split('/')[2]
            results.append((title, domain))
            break  # Get only the first result
        except Exception as e:
            continue

    return results

def scrape_suppliers():
    # Load the local HTML file
    with open('/mnt/data/antibodies-online source code.txt', 'r') as file:
        html_content = file.read()

    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    suppliers = []

    # Iterate through each span with class "h5" and extract supplier names
    for span in soup.find_all('span', class_='h5'):
        button = span.find('button', {'name': 'suppler-submit'})
        if button:
            name = button.text.strip()
            suppliers.append(name)

    print(f"Extracted {len(suppliers)} suppliers")

    supplier_urls = []
    for supplier in suppliers:
        print(f"Searching for supplier: {supplier}")
        search_results = google_search_first_result(supplier)
        if search_results:
            supplier_urls.append((supplier, search_results[0][1]))
        else:
            supplier_urls.append((supplier, "No URL found"))

    # Save to CSV
    df = pd.DataFrame(supplier_urls, columns=["Supplier", "URL"])
    df.to_csv('/mnt/data/antibodies-online-suppliers-list.csv', index=False)
    print("Suppliers' names and URLs have been saved to 'antibodies-online-suppliers-list.csv'.")

    # Open the CSV file
    print(df)

# Running the updated function
scrape_suppliers()


Working off head(antibody_vendor) to standardize element search across websites

In [None]:
%pip install -q google-colab-selenium
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotInteractableException, NoSuchElementException
import os
import re
from Bio import Entrez
from google.colab import files
import io
import time
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd

import google_colab_selenium as gs
from selenium.webdriver.common.keys import Keys

driver = gs.Chrome()

def find_search_bar(driver):
    search_bar = None
    # Search with various common attributes and broader matching
    search_bar = search_bar or driver.find_elements(By.XPATH, '//input[contains(@placeholder, "search") or contains(@placeholder, "keyword")]')
    search_bar = search_bar or driver.find_elements(By.XPATH, '//input[contains(@class, "search") or contains(@class, "keyword")]')
    search_bar = search_bar or driver.find_elements(By.XPATH, '//input[@type="text" or @type="search"]')
    search_bar = search_bar or driver.find_elements(By.XPATH, '//input[contains(@name, "search") or contains(@name, "keyword")]')
    search_bar = search_bar or driver.find_elements(By.XPATH, '//input[contains(@id, "search") or contains(@id, "keyword")]')
    search_bar = search_bar or driver.find_elements(By.XPATH, '//input[contains(@title, "search") or contains(@title, "keyword")]')
    search_bar = search_bar or driver.find_elements(By.XPATH, '//input[contains(@aria-label, "search") or contains(@aria-label, "keyword")]')
    search_bar = search_bar or driver.find_elements(By.XPATH, '//input[contains(@role, "search") or contains(@role, "keyword")]')

    # Specific search for 'keywords' as seen in the R&D Systems HTML
    search_bar = search_bar or driver.find_elements(By.XPATH, '//input[@name="keywords"]')

    return search_bar[0] if search_bar else None


def search_website(driver, url, search_terms):
    driver.get(url)
    search_bar = find_search_bar(driver)
    if search_bar:
        try:
            search_bar.clear()
            search_bar.send_keys(search_terms)
            search_bar.send_keys(Keys.RETURN)
            print(f"Navigated to search results page for {url}.")
            return driver.current_url
        except ElementNotInteractableException:
            print(f"Search bar not interactable on {url}.")
            return None
    else:
        print(f"No search bar found on {url}.")
        return None

def click_first_result(driver):
    try:
        first_product_link = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//a[contains(@href, "products/primary-antibodies")]'))
        )
        first_product_link.click()
        print("Clicked on the first product link and navigated to the product page.")
    except Exception as e:
        print(f"Error clicking the first product link: {e}")
        return None

def extract_immunogen(driver):
    keywords = ['Immunogen', 'Antigen', 'Epitope', 'Sequence', 'Peptide']
    for keyword in keywords:
        try:
            element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, f'//*[contains(text(), "{keyword}")]'))
            )
            immunogen_text = element.text
            print(f"{keyword}: {immunogen_text}")
            return immunogen_text
        except Exception as e:
            continue
    print("No immunogen element found.")
    return "No immunogen element found."

def main():
    search_terms = "neuromedin b"
    input_file = 'antibody_vendors.csv'
    output_file = '/content/antibody_vendors_output.csv'

    with open(input_file, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        results = []
        for i, row in enumerate(reader):
            if i >= 6:
                break
            url = f"https://{row['URL']}"
            title = row['Title']
            search_results_url = search_website(driver, url, search_terms)
            if search_results_url:
                click_first_result(driver)
                immunogen = extract_immunogen(driver)
                results.append({
                    'Title': title,
                    'URL': url,
                    'Search Results URL': search_results_url,
                    'Immunogen HTML': immunogen
                })
            time.sleep(2)  # Delay to avoid being blocked by websites

    df = pd.DataFrame(results)
    df.to_csv(output_file, index=False)
    print(f"Results saved to {output_file}")
    print(df)

if __name__ == "__main__":
    main()


<IPython.core.display.Javascript object>

Navigated to search results page for https://www.abcam.com.
Clicked on the first product link and navigated to the product page.
Immunogen: Immunogen
No search bar found on https://www.rndsystems.com.
Search bar not interactable on https://www.antibodies.com.
Navigated to search results page for https://abclonal.com.
Error clicking the first product link: Message: 
Stacktrace:
#0 0x55eac190369a <unknown>
#1 0x55eac15e60dc <unknown>
#2 0x55eac1632931 <unknown>
#3 0x55eac1632a21 <unknown>
#4 0x55eac1677234 <unknown>
#5 0x55eac165589d <unknown>
#6 0x55eac16745c3 <unknown>
#7 0x55eac1655613 <unknown>
#8 0x55eac16254f7 <unknown>
#9 0x55eac1625e4e <unknown>
#10 0x55eac18c986b <unknown>
#11 0x55eac18cd911 <unknown>
#12 0x55eac18b535e <unknown>
#13 0x55eac18ce472 <unknown>
#14 0x55eac1899cbf <unknown>
#15 0x55eac18f3098 <unknown>
#16 0x55eac18f3270 <unknown>
#17 0x55eac19027cc <unknown>
#18 0x7bec7ef5fac3 <unknown>

Antigen: 
Search bar not interactable on https://antibodyresearch.com.
Navigat