### Objective 

The below code is to collect the information for each company Trade Fair from the website https://www.medica-tradefair.com/vis/v1/en/search?ticket=g_u_e_s_t&_query=&f_type=profile. Total companies obtained from web scraping is around 3930. 

· Company Name

· Description (if any)

· Country

· Website (If Any)

### Virtual Python environment with conda

Using the existing Python environment case-study, installed few more additional libraries like selenium and webdriver_manager. 



In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

In [None]:

# Base URL
base_url = "https://www.medica-tradefair.com"

# Starting page for alphabet navigation
landing_page_url = "https://www.medica-tradefair.com/vis/v1/en/directory/b?oid=80398&lang=2"

# Send a GET request to fetch the alphabet page
response = requests.get(landing_page_url)
soup = BeautifulSoup(response.content, 'html.parser')


In [None]:

# Find the div containing alphabet links
alphabet_div = soup.find("div", class_="list-tag-labels no-print")

# Initialize a list to store all company links across all letters
all_company_links = []

# Iterate over each alphabet link, A-Z and "Other"
for link_tag in alphabet_div.find_all("a"):
    letter_href = link_tag.get("href")
    letter_url = base_url + letter_href
    letter = link_tag.get_text(strip=True)

    # Send request to the specific letter's page
    letter_response = requests.get(letter_url)
    letter_soup = BeautifulSoup(letter_response.content, 'html.parser')
    
    # Find the section containing companies for the current letter
    while True:
        # Extract only the divs with class "exh-table-col exh-table-col--name"
        companies_section = letter_soup.find_all("div", class_="exh-table-col exh-table-col--name")

        # Find all individual company links within these divs
        for company_div in companies_section:
            company_tag = company_div.find("a", href=True)
            if company_tag:
                # Extract the company name and URL
                company_name = company_tag.find("h2", class_="exh-table-item__name").text.strip()
                company_href = company_tag.get("href")
                company_url = base_url + company_href
                
                # Append to list
                all_company_links.append({
                    "Letter": letter,
                    "Company Name": company_name,
                    "Company URL": company_url
                })
        # Check for pagination (next page link)
        next_page_tag = letter_soup.find("a", class_="pagination__btn--next")
        if next_page_tag and "href" in next_page_tag.attrs:
            next_page_url = base_url + next_page_tag["href"]
            letter_response = requests.get(next_page_url)
            letter_soup = BeautifulSoup(letter_response.content, 'html.parser')
        else:
            break 
        # Optional delay to prevent rapid requests
        time.sleep(1)
print("Total companies found:", len(all_company_links))


Total companies found: 5793


### Using Selenium, on click of company details button and extract the required information from the pop up displayed through the screen.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

In [None]:
# Initialize the WebDriver
options = webdriver.ChromeOptions()
options.add_argument("--disable-popup-blocking")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
prefs = {
    "profile.managed_default_content_settings.images": 2,
    "profile.default_content_setting_values.media_stream": 2
}
options.add_experimental_option("prefs", prefs)
service = Service(r'C:\Users\prade\Downloads\chromedriver-win64\chromedriver.exe')
driver = webdriver.Chrome(service=service, options=options)
driver.implicitly_wait(5)  


In [None]:

# Initialize list for collected data
all_company_data = []

# Loop through each company URL in all_company_links
for idx, company_data in enumerate(all_company_links):
    url = company_data['Company URL']
    try:
        driver.get(url)
        driver.set_page_load_timeout(180)  
        # Try to find and click the "Company data" button to get additional information
        try:
            # Wait for the "Company data" button to be clickable
            company_data_button = WebDriverWait(driver, 30).until(
                EC.element_to_be_clickable((By.XPATH, "//button[.//span[text()='Company data']]"))
            )
            # Attempting to click the button using JavaScript
            driver.execute_script("arguments[0].click();", company_data_button)

            # Wait for pop-up 
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.ID, "profile-title"))
            )

            # Extract description if available
            description = driver.find_element(By.CSS_SELECTOR, ".profile-details-text").text if driver.find_elements(By.CSS_SELECTOR, ".profile-details-text") else "N/A"
            
            # Extract country if available
            country = driver.find_element(By.CSS_SELECTOR, ".address-country").text if driver.find_elements(By.CSS_SELECTOR, ".address-country") else "N/A"

            # Extract website link if available
            website = ""
            try:
                website = driver.find_element(By.CSS_SELECTOR, ".exh-contact__links a").get_attribute("href")
            except Exception:
                pass  
            
        except Exception as e:
            print(f"Warning: Unable to retrieve additional data for {company_data['Company Name']} from the pop-up: {e}")
            description = "N/A"
            country = "N/A"
            website = ""

        # Add extracted data to the list
        all_company_data.append({
            "Company Name": company_data['Company Name'],
            "Description": description,
            "Country": country,
            "Website": website
        })
        # Save every 50 records to reduce data loss risk
        if (idx + 1) % 50 == 0:
            pd.DataFrame(all_company_data).to_excel("medica_exhibitor_partial.xlsx", index=False)
            print(f"Saved {idx + 1} records as a partial backup.")

    except Exception as e:
        print(f"Error retrieving data for {url}: {e}")

# Close the driver and save the final data
driver.quit()

# Save all collected data to an Excel file
df = pd.DataFrame(all_company_data)
df.to_excel("medica_exhibitor.xlsx", index=False)
print("Data collection complete.")


Saved 50 records as a partial backup.
Saved 100 records as a partial backup.
Stacktrace:
	GetHandleVerifier [0x00007FF6F57E3AF5+28005]
	(No symbol) [0x00007FF6F57483F0]
	(No symbol) [0x00007FF6F55E580A]
	(No symbol) [0x00007FF6F5635A3E]
	(No symbol) [0x00007FF6F5635D2C]
	(No symbol) [0x00007FF6F567EA97]
	(No symbol) [0x00007FF6F565BA7F]
	(No symbol) [0x00007FF6F567B8B3]
	(No symbol) [0x00007FF6F565B7E3]
	(No symbol) [0x00007FF6F56275C8]
	(No symbol) [0x00007FF6F5628731]
	GetHandleVerifier [0x00007FF6F5AD646D+3118813]
	GetHandleVerifier [0x00007FF6F5B26CC0+3448624]
	GetHandleVerifier [0x00007FF6F5B1CF3D+3408301]
	GetHandleVerifier [0x00007FF6F58AA44B+841403]
	(No symbol) [0x00007FF6F575344F]
	(No symbol) [0x00007FF6F574F4C4]
	(No symbol) [0x00007FF6F574F65D]
	(No symbol) [0x00007FF6F573EBB9]
	BaseThreadInitThunk [0x00007FF9D534259D+29]
	RtlUserThreadStart [0x00007FF9D762AF38+40]

Saved 150 records as a partial backup.
Stacktrace:
	GetHandleVerifier [0x00007FF6F57E3AF5+28005]
	(No symbol

### Summary
The idealogy I used is to collect all the company URLs that contain the necessary information for extraction, from the letters A-Z and "Other". I looped through each record to retrieve the URL and interact with the webpage by clicking the "Company Detail" button. This action opens a popup containing the required information for scraping. Each set of data is saved to an Excel file in batches of 50 records to prevent data loss. Finally, the complete results are saved to an Excel file as instructed. Due to time constraints and connectivity issues, 3,930 records were collected out of a total of 5,793.

### Refernce

https://www.geeksforgeeks.org/interacting-with-webpage-selenium-python/?ref=lbp

https://www.testim.io/blog/selenium-click-button/

https://www.geeksforgeeks.org/set_page_load_timeout-driver-method-selenium-python/

https://stackoverflow.com/questions/71722222/how-to-pull-the-href-information-from-specific-class-using-selenium-and-python