# Build Testable PGx Genes

### Objective:
The script will scrape the data from https://ddrx.pharmgkb.org/genotypes and create a csv file of each PGx genes that require clinical testing, organized by therapeutic area. 

### Data Source:

#### PharmGKB Website: https://ddrx.pharmgkb.org/genotypes

PharmGKB provides comprehensive information about drug-gene interactions, including clinical guidelines and recommendations for testing specific genes.

#### Warning: This script only works for the genes which requires combination of genotypes.

The genes require combination of genotypes: ABCG2, CACNA1S, CFTR, CYP2C19, CYP2D6, CYP3A4, CYP3A5, DPYD, IFNL3, NUDT15, RYR1, SLCO1B1, TPMT, UGT1A1, VKORC1 

The gene with only one genotype: MT-RNR1

The gene with one or two genotypes: HLA-B, HLA-B, G6PD

#### Requirement to get the csv file for each gene:

1. Change gene_name
2. Change csv file name(for eg. gene_name.csv)


## Importing libraries

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import csv
from bs4 import BeautifulSoup
from collections import defaultdict



In [2]:
pip install beautifulsoup4

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install selenium

Defaulting to user installation because normal site-packages is not writeable
Collecting selenium
  Downloading selenium-4.25.0-py3-none-any.whl (9.7 MB)
[K     |████████████████████████████████| 9.7 MB 3.0 MB/s eta 0:00:01
[?25hCollecting certifi>=2021.10.8
  Downloading certifi-2024.8.30-py3-none-any.whl (167 kB)
[K     |████████████████████████████████| 167 kB 50.5 MB/s eta 0:00:01
Collecting trio~=0.17
  Downloading trio-0.26.2-py3-none-any.whl (475 kB)
[K     |████████████████████████████████| 475 kB 23.6 MB/s eta 0:00:01
[?25hCollecting websocket-client~=1.8
  Downloading websocket_client-1.8.0-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 18.9 MB/s eta 0:00:01
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting urllib3[socks]<3,>=1.26
  Downloading urllib3-2.2.3-py3-none-any.whl (126 kB)
[K     |████████████████████████████████| 126 kB 16.9 MB/s eta 0:00:01
[?25hCollecting sortedcontainer

## Starting ChromeDriver

In [3]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

driver_path = '/usr/local/bin/chromedriver'

# Initialize Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--remote-debugging-port=9222")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--user-data-dir=/tmp/chrome_profile")  # Temporary Chrome profile

# Initialize WebDriver with Service and options
service = Service(driver_path)
driver = webdriver.Chrome(service=service, options=chrome_options)

# Example: Open a page
driver.get("https://ddrx.pharmgkb.org/genotypes")


## Getting the list of gene names listed on  https://ddrx.pharmgkb.org/genotypes

In [4]:
# Initialize an empty list to store gene names
gene_names = []

# Locate the elements by the class name "fs-6 mb-1 fw-bold"
elements = driver.find_elements(By.CLASS_NAME, "fs-6.mb-1.fw-bold")

# Iterate over each element and extract the text (gene name)
for elem in elements:
    gene_names.append(elem.text.strip())

# Output the list of gene names
gene_names

['ABCG2',
 'CACNA1S',
 'CFTR',
 'CYP2B6',
 'CYP2C19',
 'CYP2C9',
 'CYP2D6',
 'CYP3A4',
 'CYP3A5',
 'DPYD',
 'G6PD',
 'HLA-A',
 'HLA-B',
 'IFNL3',
 'MT-RNR1',
 'NUDT15',
 'RYR1',
 'SLCO1B1',
 'TPMT',
 'UGT1A1',
 'VKORC1']

## Getting the gene and dropdown values

MT-RNR1 require only one genotype.

In [5]:
# Initialize a dictionary to store gene values
gene_data = {}

# Find all gene elements
gene_elements = driver.find_elements(By.CLASS_NAME, 'fs-6.mb-1.fw-bold')

# Iterate through each gene
for gene_element in gene_elements:
    gene_name = gene_element.text.strip()
    
    # Extract the values from the first dropdown
    first_dropdown = driver.find_element(By.ID, gene_name)
    first_options = [option.get_attribute('value') for option in first_dropdown.find_elements(By.TAG_NAME, 'option')]

    # Initialize the second_options variable
    second_options = []

    # Check if the second dropdown exists
    try:
        second_dropdown = driver.find_element(By.ID, f"{gene_name}2")
        second_options = [option.get_attribute('value') for option in second_dropdown.find_elements(By.TAG_NAME, 'option')]
    except:
        print(f"No second dropdown for {gene_name}, handling exception.")
        
    # Store the values in the dictionary
    gene_data[gene_name] = {
        'first_dropdown': first_options,
        'second_dropdown': second_options
    }

# Print the results
gene_data

No second dropdown for MT-RNR1, handling exception.


{'ABCG2': {'first_dropdown': ['',
   'rs2231142 reference (G)',
   'rs2231142 variant (T)'],
  'second_dropdown': ['', 'rs2231142 reference (G)', 'rs2231142 variant (T)']},
 'CACNA1S': {'first_dropdown': ['', 'Reference', 'c.520C>T', 'c.3257G>A'],
  'second_dropdown': ['', 'Reference', 'c.520C>T', 'c.3257G>A']},
 'CFTR': {'first_dropdown': ['',
   '711+3A->G',
   '2789+5G->A',
   '3272-26A->G',
   '3849+10kbC->T',
   'A455E',
   'A1067T',
   'D110E',
   'D110H',
   'D579G',
   'D1152H',
   'D1270N',
   'E56K',
   'E193K',
   'E831X',
   'F1052V',
   'F1074L',
   'G178R',
   'G551D',
   'G551S',
   'G1069R',
   'G1244E',
   'G1349D',
   'K1060T',
   'L206W',
   'P67L',
   'R74W',
   'R117C',
   'R117H',
   'R347H',
   'R352Q',
   'R1070Q',
   'R1070W',
   'S549N',
   'S549R(A>C)',
   'S549R(T>G)',
   'S945L',
   'S977F',
   'S1251N',
   'S1255P',
   'ivacaftor non-responsive CFTR sequence'],
  'second_dropdown': ['',
   '711+3A->G',
   '2789+5G->A',
   '3272-26A->G',
   '3849+10kbC->T',

## Making combinations of genotypes for each gene

In [6]:
combinations = {}

for key, dropdowns in gene_data.items():
    first_options = [option for option in dropdowns['first_dropdown'] if option and 'Reference' not in option]
    second_options = [option for option in dropdowns['second_dropdown'] if option and 'Reference' not in option]

    seen = set()
    unique_combinations = []
    
    for first in first_options:
        for second in second_options:
            # Create a sorted tuple to treat (first, second) and (second, first) as the same
            combo = tuple(sorted((first, second)))
            if combo not in seen:
                seen.add(combo)
                unique_combinations.append(combo)
    
    combinations[key] = unique_combinations

combinations

{'ABCG2': [('rs2231142 reference (G)', 'rs2231142 reference (G)'),
  ('rs2231142 reference (G)', 'rs2231142 variant (T)'),
  ('rs2231142 variant (T)', 'rs2231142 variant (T)')],
 'CACNA1S': [('c.520C>T', 'c.520C>T'),
  ('c.3257G>A', 'c.520C>T'),
  ('c.3257G>A', 'c.3257G>A')],
 'CFTR': [('711+3A->G', '711+3A->G'),
  ('2789+5G->A', '711+3A->G'),
  ('3272-26A->G', '711+3A->G'),
  ('3849+10kbC->T', '711+3A->G'),
  ('711+3A->G', 'A455E'),
  ('711+3A->G', 'A1067T'),
  ('711+3A->G', 'D110E'),
  ('711+3A->G', 'D110H'),
  ('711+3A->G', 'D579G'),
  ('711+3A->G', 'D1152H'),
  ('711+3A->G', 'D1270N'),
  ('711+3A->G', 'E56K'),
  ('711+3A->G', 'E193K'),
  ('711+3A->G', 'E831X'),
  ('711+3A->G', 'F1052V'),
  ('711+3A->G', 'F1074L'),
  ('711+3A->G', 'G178R'),
  ('711+3A->G', 'G551D'),
  ('711+3A->G', 'G551S'),
  ('711+3A->G', 'G1069R'),
  ('711+3A->G', 'G1244E'),
  ('711+3A->G', 'G1349D'),
  ('711+3A->G', 'K1060T'),
  ('711+3A->G', 'L206W'),
  ('711+3A->G', 'P67L'),
  ('711+3A->G', 'R74W'),
  ('711+3A

## Count genotypes combinations for each gene

In [7]:
# Get gene names and count of all combinations
combination_counts = {gene: len(combos) for gene, combos in combinations.items()}

# Sort the counts in descending order
sorted_combination_counts = dict(sorted(combination_counts.items(), key=lambda item: item[1]))

print("Gene Name and Count of Combinations (Sorted):", sorted_combination_counts)

Gene Name and Count of Combinations (Sorted): {'MT-RNR1': 0, 'ABCG2': 3, 'CACNA1S': 3, 'HLA-A': 3, 'IFNL3': 3, 'VKORC1': 3, 'HLA-B': 15, 'CYP3A5': 21, 'NUDT15': 45, 'UGT1A1': 45, 'CYP3A4': 78, 'CYP2C19': 666, 'CFTR': 820, 'TPMT': 946, 'SLCO1B1': 990, 'CYP2B6': 1176, 'CYP2C9': 2850, 'DPYD': 3486, 'CYP2D6': 15051, 'G6PD': 17578, 'RYR1': 57630}


## Get the genotypes of entered gene

In [8]:
gene_name='RYR1'
# Retrieve combinations for gene name
combinations = combinations.get(gene_name, [])

# Print the combinations
combinations

[('c.38T>G', 'c.38T>G'),
 ('c.38T>G', 'c.51_53del'),
 ('c.38T>G', 'c.97A>G'),
 ('c.103T>C', 'c.38T>G'),
 ('c.119G>C', 'c.38T>G'),
 ('c.130C>T', 'c.38T>G'),
 ('c.131G>A', 'c.38T>G'),
 ('c.152C>A', 'c.38T>G'),
 ('c.178G>A', 'c.38T>G'),
 ('c.178G>T', 'c.38T>G'),
 ('c.190T>C', 'c.38T>G'),
 ('c.212C>A', 'c.38T>G'),
 ('c.251C>T', 'c.38T>G'),
 ('c.38T>G', 'c.418G>A'),
 ('c.38T>G', 'c.455C>A'),
 ('c.38T>G', 'c.463C>A'),
 ('c.38T>G', 'c.467G>A'),
 ('c.38T>G', 'c.479A>G'),
 ('c.38T>G', 'c.487C>T'),
 ('c.38T>G', 'c.488G>A'),
 ('c.38T>G', 'c.488G>T'),
 ('c.38T>G', 'c.493G>A'),
 ('c.38T>G', 'c.496G>A'),
 ('c.38T>G', 'c.497A>G'),
 ('c.38T>G', 'c.526G>A'),
 ('c.38T>G', 'c.528G>T'),
 ('c.38T>G', 'c.529C>T'),
 ('c.38T>G', 'c.533A>C'),
 ('c.38T>G', 'c.533A>G'),
 ('c.38T>G', 'c.625G>A'),
 ('c.38T>G', 'c.641C>T'),
 ('c.38T>G', 'c.652G>A'),
 ('c.38T>G', 'c.677T>A'),
 ('c.38T>G', 'c.680A>T'),
 ('c.38T>G', 'c.742G>A'),
 ('c.38T>G', 'c.742G>C'),
 ('c.38T>G', 'c.946C>T'),
 ('c.38T>G', 'c.947G>T'),
 ('c.38T>G',

## Scrape the drugs which has clinical actions

In [10]:
import csv
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Define the scrape_data function
def scrape_data(gene_name, first_genotype, second_genotype):
    # Initialize a dictionary to store aggregated results
    aggregated_data = defaultdict(lambda: {'Therapeutic Area(s)': set(), 'Drug(s)': set()})
    
    try:
        # Wait for drug cards to load
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "drugCard")))
    except Exception as e:
        print(f"Error waiting for drug cards to load: {e}")
        return

    # Find the section that contains the drug information
    drugs_section = driver.find_elements(By.CLASS_NAME, "drugCard")
    
    # Debugging step: Log the number of drug cards found
    print(f"Found {len(drugs_section)} drug cards.")

    for drug_card in drugs_section:
        try:
            # Check if the 'mb-4' class exists inside the drugCard
            try:
                mb4_div = drug_card.find_element(By.CLASS_NAME, 'mb-4')
            except:
                print("No mb-4 div found, skipping drug.")
                continue

            # Get the drug name
            try:
                drug_name_element = drug_card.find_element(By.TAG_NAME, 'a')
                drug_name = drug_name_element.text.strip()
            except Exception as e:
                print(f"Error finding drug name: {e}")
                continue

            # Check if there's a div with 'text-muted' class (to skip if found)
            try:
                pgx_action_element = drug_card.find_element(By.CLASS_NAME, 'text-muted')
                pgx_action = pgx_action_element.text.strip()
                print(f"Skipping drug {drug_name} due to PGx clinical action: {pgx_action}")
                continue  # Skip this drug
            except:
                pgx_action = ""  # If no 'text-muted' class, proceed


            # Find the therapeutic area by navigating to the drug's <a> tag and then to the previous <dt> element
            try:
                # Find the 'a' tag that matches the drug_name
                drug_link = driver.find_element(By.XPATH, f"//a[contains(text(), '{drug_name}') and contains(@class, 'link-primary')]")
                
                # Navigate to the previous 'dt' element to get the therapeutic area
                therapeutic_area_element = drug_link.find_element(By.XPATH, "preceding::dt[contains(@class, 'text-start')][1]//span")
                therapeutic_area = therapeutic_area_element.text.strip()
            except Exception as e:
                print(f"Error finding therapeutic area for {drug_name}: {e}")
                therapeutic_area = "Unknown"  # Handle the missing therapeutic area gracefully

            
            # Aggregate therapeutic areas and drugs for each genotype
            key = f"{first_genotype}/{second_genotype}"
            aggregated_data[key]['Therapeutic Area(s)'].add(therapeutic_area)
            aggregated_data[key]['Drug(s)'].add(drug_name)
            
        
        except Exception as e:
            # Log any errors in scraping
            print(f"Error processing drug card: {e}")

    # Prepare data for storing in CSV
    csv_data = []
    for genotype, info in aggregated_data.items():
        csv_data.append({
            'Gene': gene_name,
            'Genotype(s)': genotype,
            'Therapeutic Area(s)': ', '.join(info['Therapeutic Area(s)']),
            'Drug(s)': ', '.join(info['Drug(s)'])
        })

    # Print the aggregated data to the console
    print(csv_data)
    
    # Store the aggregated data to the CSV file
    if csv_data:
        store_data_to_csv(csv_data)
    else:
        print("No data scraped.")



## Store data record to csv 

In [12]:
# Function to store aggregated data in CSV
def store_data_to_csv(data, filename='RYR1.csv'):
    # Define the column headers
    fieldnames = ['Gene', 'Genotype(s)', 'Therapeutic Area(s)', 'Drug(s)']

    # Open the CSV file in append mode
    with open(filename, mode='a', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=fieldnames)

        # Write the header only if the file is empty
        file.seek(0, 2)  # Move the pointer to the end of the file
        if file.tell() == 0:
            writer.writeheader()

        # Write the aggregated data to the CSV file
        writer.writerows(data)

    print(f"Data entries written to {filename}.")

## Main Function

In [None]:
# Initialize WebDriverWait
wait = WebDriverWait(driver, 10)

# Process each combination
for first, second in combinations:
    # Select the first dropdown
    first_dropdown = Select(driver.find_element(By.ID, gene_name))
    first_dropdown.select_by_value(first)  # Use first value from tuple

    # Select the second dropdown
    second_dropdown = Select(driver.find_element(By.ID, f"{gene_name}2"))
    second_dropdown.select_by_value(second)  # Use second value from tuple
    
    # Click the "View Recommendations" button
    view_recommendations_button = wait.until(EC.element_to_be_clickable(
        (By.XPATH, "//button[contains(@class, 'btn-primary') and contains(text(), 'View recommendations')]")
    ))
    view_recommendations_button.click()
    
    # Scrape the data
    scrape_data(gene_name, first, second)

    # Optionally, go back to the previous page or reset the dropdowns
    driver.back()
    wait.until(EC.element_to_be_clickable((By.ID, gene_name)))  # Wait until the first dropdown is clickable

# Close the driver
driver.quit()

Found 7 drug cards.
[{'Gene': 'RYR1', 'Genotype(s)': 'c.38T>G/c.38T>G', 'Therapeutic Area(s)': 'NERVOUS SYSTEM DRUGS, MUSCULO-SKELETAL SYSTEM DRUGS', 'Drug(s)': 'sevoflurane, isoflurane, methoxyflurane, desflurane, enflurane, halothane, succinylcholine'}]
Data entries written to RYR1.csv.
Found 7 drug cards.
[{'Gene': 'RYR1', 'Genotype(s)': 'c.38T>G/c.51_53del', 'Therapeutic Area(s)': 'NERVOUS SYSTEM DRUGS, MUSCULO-SKELETAL SYSTEM DRUGS', 'Drug(s)': 'sevoflurane, isoflurane, methoxyflurane, desflurane, enflurane, halothane, succinylcholine'}]
Data entries written to RYR1.csv.
Found 7 drug cards.
[{'Gene': 'RYR1', 'Genotype(s)': 'c.38T>G/c.97A>G', 'Therapeutic Area(s)': 'NERVOUS SYSTEM DRUGS, MUSCULO-SKELETAL SYSTEM DRUGS', 'Drug(s)': 'sevoflurane, isoflurane, methoxyflurane, desflurane, enflurane, halothane, succinylcholine'}]
Data entries written to RYR1.csv.
Found 7 drug cards.
[{'Gene': 'RYR1', 'Genotype(s)': 'c.103T>C/c.38T>G', 'Therapeutic Area(s)': 'NERVOUS SYSTEM DRUGS, MUSCULO-

# One Dropdown Gene

In [68]:
# Locate the dropdown for MT-RNR1 using its id attribute
dropdown = Select(driver.find_element(By.ID, "HLA-B"))

# Extract all options, skipping '--' and 'Reference'
options = [option.text for option in dropdown.options if option.text not in ['--', 'Reference']]

# Print the filtered dropdown options
for option in options:
    print(option)

*15:02
*15:11
*57:01
*58:01
Other


In [69]:
import csv
from collections import defaultdict
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Define the scrape_data function for MT-RNR1
def scrape_data(gene_name, genotype):
    # Initialize a dictionary to store aggregated results
    aggregated_data = defaultdict(lambda: {'Therapeutic Area(s)': set(), 'Drug(s)': set()})
    
    try:
        # Wait for drug cards to load
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "drugCard")))
    except Exception as e:
        print(f"Error waiting for drug cards to load: {e}")
        return

    # Find the section that contains the drug information
    drugs_section = driver.find_elements(By.CLASS_NAME, "drugCard")
    
    # Debugging step: Log the number of drug cards found
    print(f"Found {len(drugs_section)} drug cards.")

    for drug_card in drugs_section:
        try:
            # Check if the 'mb-4' class exists inside the drugCard
            try:
                mb4_div = drug_card.find_element(By.CLASS_NAME, 'mb-4')
            except:
                print("No mb-4 div found, skipping drug.")
                continue

            # Get the drug name
            try:
                drug_name_element = drug_card.find_element(By.TAG_NAME, 'a')
                drug_name = drug_name_element.text.strip()
            except Exception as e:
                print(f"Error finding drug name: {e}")
                continue

            # Check if there's a div with 'text-muted' class (to skip if found)
            try:
                pgx_action_element = drug_card.find_element(By.CLASS_NAME, 'text-muted')
                pgx_action = pgx_action_element.text.strip()
                print(f"Skipping drug {drug_name} due to PGx clinical action: {pgx_action}")
                continue  # Skip this drug
            except:
                pgx_action = ""  # If no 'text-muted' class, proceed

            # Find the therapeutic area by navigating to the drug's <a> tag and then to the previous <dt> element
            try:
                # Find the 'a' tag that matches the drug_name
                drug_link = driver.find_element(By.XPATH, f"//a[contains(text(), '{drug_name}') and contains(@class, 'link-primary')]")
                
                # Navigate to the previous 'dt' element to get the therapeutic area
                therapeutic_area_element = drug_link.find_element(By.XPATH, "preceding::dt[contains(@class, 'text-start')][1]//span")
                therapeutic_area = therapeutic_area_element.text.strip()
            except Exception as e:
                print(f"Error finding therapeutic area for {drug_name}: {e}")
                therapeutic_area = "Unknown"  # Handle the missing therapeutic area gracefully

            # Aggregate therapeutic areas and drugs for the single genotype
            aggregated_data[genotype]['Therapeutic Area(s)'].add(therapeutic_area)
            aggregated_data[genotype]['Drug(s)'].add(drug_name)
        
        except Exception as e:
            # Log any errors in scraping
            print(f"Error processing drug card: {e}")

    # Prepare data for storing in CSV
    csv_data = []
    for genotype, info in aggregated_data.items():
        drug_with_therapeutic_area = []
        for therapeutic_area, drug in zip(info['Therapeutic Area(s)'], info['Drug(s)']):
            drug_with_therapeutic_area.append(f"{drug}({therapeutic_area})")

        csv_data.append({
            'Gene': gene_name,
            'Genotype(s)': genotype,
            'Therapeutic Area(s)': ', '.join(info['Therapeutic Area(s)']),
            'Drug(s)': ', '.join(drug_with_therapeutic_area)
        })

    # Print the aggregated data to the console
    print(csv_data)
    
    # Store the aggregated data to the CSV file
    if csv_data:
        store_data_to_csv(csv_data)
    else:
        print("No data scraped.")


In [71]:
# Initialize WebDriverWait
wait = WebDriverWait(driver, 10)

gene_name='HLA-B'

# Find the dropdown for gene
dropdown = Select(driver.find_element(By.ID, gene_name))

# Iterate through each genotype option and process it
for genotype in options:  # This is the list of filtered genotype options
    # Select the genotype from the dropdown
    dropdown.select_by_value(genotype)

    # Click the "View Recommendations" button
    view_recommendations_button = wait.until(EC.element_to_be_clickable(
        (By.XPATH, "//button[contains(@class, 'btn-primary') and contains(text(), 'View recommendations')]")
    ))
    view_recommendations_button.click()
    
    # Scrape the data for the selected genotype
    scrape_data(gene_name=gene_name, genotype=genotype)

    # Optionally, go back to the previous page or reset the dropdown
    driver.back()
    wait.until(EC.element_to_be_clickable((By.ID, gene_name)))  # Wait until the dropdown is clickable again
    dropdown = Select(driver.find_element(By.ID, gene_name))  # Re-initialize the dropdown to ensure it's fresh

# Close the driver after processing all genotypes
driver.quit()

Found 11 drug cards.
Skipping drug abacavir due to PGx clinical action: No PGx clinical action has been found for the genotypes specified.
For more information, read the following
CPIC Guideline Annotation
DPWG Guideline Annotation
FDA Label Annotation
FDA Table of PGx Associations
Skipping drug allopurinol due to PGx clinical action: No PGx clinical action has been found for the genotypes specified.
For more information, read the following
CPIC Guideline Annotation
DPWG Guideline Annotation
FDA Label Annotation
FDA Table of PGx Associations
Skipping drug cabotegravir / rilpivirine due to PGx clinical action: No PGx clinical action has been found for the genotypes specified.
For more information, read the following
FDA Label Annotation
Skipping drug flucloxacillin due to PGx clinical action: No PGx clinical action has been found for the genotypes specified.
For more information, read the following
DPWG Guideline Annotation
Skipping drug pazopanib due to PGx clinical action: No PGx clin

Skipping drug ribavirin due to PGx clinical action: No PGx clinical action has been found for the genotypes specified.
For more information, read the following
DPWG Guideline Annotation
[{'Gene': 'HLA-B', 'Genotype(s)': '*58:01', 'Therapeutic Area(s)': 'MUSCULO-SKELETAL SYSTEM DRUGS', 'Drug(s)': 'allopurinol'}]
Data entries written to HLA-B.csv.
Found 11 drug cards.
Skipping drug abacavir due to PGx clinical action: No PGx clinical action has been found for the genotypes specified.
For more information, read the following
CPIC Guideline Annotation
DPWG Guideline Annotation
FDA Label Annotation
FDA Table of PGx Associations
Skipping drug allopurinol due to PGx clinical action: No PGx clinical action has been found for the genotypes specified.
For more information, read the following
CPIC Guideline Annotation
DPWG Guideline Annotation
FDA Label Annotation
FDA Table of PGx Associations
Skipping drug cabotegravir / rilpivirine due to PGx clinical action: No PGx clinical action has been fou

## Start to get the data from specific genotype of the gene

In [50]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
from collections import defaultdict


# Define the combination to start with
start_combination = ('c.11126C>T','c.178G>T')
start_found = False  # Flag to track when to start processing

# Example gene name
gene_name = 'RYR1'

# Define the scrape_data function (from the original script)
def scrape_data(gene_name, first_genotype, second_genotype):
    aggregated_data = defaultdict(lambda: {'Therapeutic Area(s)': set(), 'Drug(s)': set()})

    try:
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "drugCard")))
    except Exception as e:
        print(f"Error waiting for drug cards to load: {e}")
        return

    drugs_section = driver.find_elements(By.CLASS_NAME, "drugCard")
    print(f"Found {len(drugs_section)} drug cards.")

    for drug_card in drugs_section:
        try:
            try:
                mb4_div = drug_card.find_element(By.CLASS_NAME, 'mb-4')
            except:
                print("No mb-4 div found, skipping drug.")
                continue

            try:
                drug_name_element = drug_card.find_element(By.TAG_NAME, 'a')
                drug_name = drug_name_element.text.strip()
            except Exception as e:
                print(f"Error finding drug name: {e}")
                continue

            try:
                pgx_action_element = drug_card.find_element(By.CLASS_NAME, 'text-muted')
                pgx_action = pgx_action_element.text.strip()
                print(f"Skipping drug {drug_name} due to PGx clinical action: {pgx_action}")
                continue
            except:
                pgx_action = ""

            try:
                drug_link = driver.find_element(By.XPATH, f"//a[contains(text(), '{drug_name}') and contains(@class, 'link-primary')]")
                therapeutic_area_element = drug_link.find_element(By.XPATH, "preceding::dt[contains(@class, 'text-start')][1]//span")
                therapeutic_area = therapeutic_area_element.text.strip()
            except Exception as e:
                print(f"Error finding therapeutic area for {drug_name}: {e}")
                therapeutic_area = "Unknown"

            key = f"{first_genotype}/{second_genotype}"
            aggregated_data[key]['Therapeutic Area(s)'].add(therapeutic_area)
            aggregated_data[key]['Drug(s)'].add(drug_name)
        
        except Exception as e:
            print(f"Error processing drug card: {e}")

    # Prepare data for storing in CSV
    csv_data = []
    for genotype, info in aggregated_data.items():
        csv_data.append({
            'Gene': gene_name,
            'Genotype(s)': genotype,
            'Therapeutic Area(s)': ', '.join(info['Therapeutic Area(s)']),
            'Drug(s)': ', '.join(info['Drug(s)'])
        })

    print(csv_data)

    if csv_data:
        store_data_to_csv(csv_data)
    else:
        print("No data scraped.")

def store_data_to_csv(data, filename='RYR1.csv'):
    fieldnames = ['Gene', 'Genotype(s)', 'Therapeutic Area(s)', 'Drug(s)']
    with open(filename, mode='a', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        file.seek(0, 2)
        if file.tell() == 0:
            writer.writeheader()
        writer.writerows(data)

    print(f"Data entries written to {filename}.")

# Initialize WebDriverWait
wait = WebDriverWait(driver, 10)

# Process each combination
for first, second in combinations:
    # Start only after reaching the desired combination
    if not start_found:
        if (first, second) == start_combination:
            start_found = True
        else:
            continue

    # Select the first dropdown
    first_dropdown = Select(driver.find_element(By.ID, gene_name))
    first_dropdown.select_by_value(first)

    # Select the second dropdown
    second_dropdown = Select(driver.find_element(By.ID, f"{gene_name}2"))
    second_dropdown.select_by_value(second)
    
    # Click the "View Recommendations" button
    view_recommendations_button = wait.until(EC.element_to_be_clickable(
        (By.XPATH, "//button[contains(@class, 'btn-primary') and contains(text(), 'View recommendations')]")
    ))
    view_recommendations_button.click()
    
    # Scrape the data
    scrape_data(gene_name, first, second)

    # Optionally, go back to the previous page or reset the dropdowns
    driver.back()
    wait.until(EC.element_to_be_clickable((By.ID, gene_name)))  # Wait until the first dropdown is clickable

# Close the driver
driver.quit()


Found 7 drug cards.
[{'Gene': 'RYR1', 'Genotype(s)': 'c.11126C>T/c.178G>T', 'Therapeutic Area(s)': 'MUSCULO-SKELETAL SYSTEM DRUGS, NERVOUS SYSTEM DRUGS', 'Drug(s)': 'succinylcholine(MUSCULO-SKELETAL SYSTEM DRUGS), halothane(NERVOUS SYSTEM DRUGS)'}]
Data entries written to RYR1.csv.
Found 7 drug cards.
[{'Gene': 'RYR1', 'Genotype(s)': 'c.11132C>T/c.178G>T', 'Therapeutic Area(s)': 'MUSCULO-SKELETAL SYSTEM DRUGS, NERVOUS SYSTEM DRUGS', 'Drug(s)': 'succinylcholine(MUSCULO-SKELETAL SYSTEM DRUGS), halothane(NERVOUS SYSTEM DRUGS)'}]
Data entries written to RYR1.csv.
Found 7 drug cards.
[{'Gene': 'RYR1', 'Genotype(s)': 'c.11266C>G/c.178G>T', 'Therapeutic Area(s)': 'MUSCULO-SKELETAL SYSTEM DRUGS, NERVOUS SYSTEM DRUGS', 'Drug(s)': 'succinylcholine(MUSCULO-SKELETAL SYSTEM DRUGS), halothane(NERVOUS SYSTEM DRUGS)'}]
Data entries written to RYR1.csv.
Found 7 drug cards.
[{'Gene': 'RYR1', 'Genotype(s)': 'c.11314C>T/c.178G>T', 'Therapeutic Area(s)': 'MUSCULO-SKELETAL SYSTEM DRUGS, NERVOUS SYSTEM DRUGS

WebDriverException: Message: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
  (Session info: chrome=129.0.6668.89)
Stacktrace:
0   chromedriver                        0x0000000101034d28 chromedriver + 4996392
1   chromedriver                        0x000000010102c5ea chromedriver + 4961770
2   chromedriver                        0x0000000100bcfb5d chromedriver + 387933
3   chromedriver                        0x0000000100bb9a21 chromedriver + 297505
4   chromedriver                        0x0000000100bb8d5d chromedriver + 294237
5   chromedriver                        0x0000000100bb8538 chromedriver + 292152
6   chromedriver                        0x0000000100bb84ba chromedriver + 292026
7   chromedriver                        0x0000000100bb6808 chromedriver + 284680
8   chromedriver                        0x0000000100bb6eaf chromedriver + 286383
9   chromedriver                        0x0000000100bc5d97 chromedriver + 347543
10  chromedriver                        0x0000000100bdbfbd chromedriver + 438205
11  chromedriver                        0x0000000100be11ab chromedriver + 459179
12  chromedriver                        0x0000000100bb7537 chromedriver + 288055
13  chromedriver                        0x0000000100bdbb81 chromedriver + 437121
14  chromedriver                        0x0000000100c5e8ec chromedriver + 973036
15  chromedriver                        0x0000000100c40753 chromedriver + 849747
16  chromedriver                        0x0000000100c0f635 chromedriver + 648757
17  chromedriver                        0x0000000100c0fe5e chromedriver + 650846
18  chromedriver                        0x0000000100ffb010 chromedriver + 4759568
19  chromedriver                        0x0000000100ffff28 chromedriver + 4779816
20  chromedriver                        0x00000001010005f5 chromedriver + 4781557
21  chromedriver                        0x0000000100fddab9 chromedriver + 4639417
22  chromedriver                        0x00000001010008e9 chromedriver + 4782313
23  chromedriver                        0x0000000100fcf054 chromedriver + 4579412
24  chromedriver                        0x000000010101ca18 chromedriver + 4897304
25  chromedriver                        0x000000010101cc13 chromedriver + 4897811
26  chromedriver                        0x000000010102c1ee chromedriver + 4960750
27  libsystem_pthread.dylib             0x00007ff80f45c18b _pthread_start + 99
28  libsystem_pthread.dylib             0x00007ff80f457ae3 thread_start + 15


## Combine CSV's of all gene

In [128]:
import os
import pandas as pd

# Folder where CSV files are stored
folder_path = '/Users/meghajotangiya/Desktop/EPICS/OneDrug'

# List to hold the file paths and their sizes
file_info_list = []

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith('.csv'):
        file_path = os.path.join(folder_path, filename)
        # Get the file size
        file_size = os.path.getsize(file_path)
        # Append the file path and size as a tuple to the list
        file_info_list.append((file_path, file_size))

# Sort the files based on their size in ascending order
file_info_list.sort(key=lambda x: x[1])

# List to hold the data from all CSV files
csv_list = []

# Read and combine CSVs in ascending order of file size
for file_info in file_info_list:
    file_path = file_info[0]
    try:
        csv_data = pd.read_csv(file_path)
        csv_list.append(csv_data)
    except pd.errors.ParserError as e:
        print(f"Error reading {file_path}: {e}")
        continue  # Skip the problematic file and continue with the next

# Combine all valid CSVs in the list into a single DataFrame
if csv_list:
    combined_csv = pd.concat(csv_list, ignore_index=True)
    # Output to a single CSV file
    combined_csv.to_csv('Testable_PGx_Genes.csv', index=False)
    print("CSV files combined successfully into 'combined_output.csv'.")
else:
    print("No valid CSV files were combined.")



CSV files combined successfully into 'combined_output.csv'.
