### Objective 

The below code is to collect the information for each company Trade Fair from the website https://london.vetshow.com/exhibitor-list. Total companies obtained from web scraping is 428.

· Company Name

· Description (If Any)

· Stand (If Any)

### Virtual Python environment with conda

Created a new virtual environment named case-study with the version Python 3.9 with the below command. The purpose of using virtual environment is to install packages specifically for the project that we are working on.

$ conda create -y python=3.9 --name case-study

Created a requirements.txt file in the root and added the installed package. Easy to check what packages are required and what versions they need to be. 

$ pip install -r requirements.txt

In [4]:
# Import necessary libraries 
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

In [5]:
# Base URL for the exhibitor list and company pages
main_base_url = "https://london.vetshow.com/exhibitor-list?page="
company_base_url = "https://london.vetshow.com/"

# Headers to mimic a browser visit
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
}

# Step 1: Fetch the first page to determine the total number of pages
response = requests.get(main_base_url + "1", headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')


In [6]:
# Find the pagination section and get the last page number
pagination = soup.find('ul', class_="pagination__list")
last_page_link = pagination.find_all('a', element="pages")[-1]
total_pages = int(last_page_link.get('data-page'))

print(f"Total number of pages found: {total_pages}")


Total number of pages found: 44


In [7]:
# Create an empty list to hold each company's data across all pages
companies_data = []

# Step 2: Looping through each page to gather company data and links to details
for page in range(1, total_pages + 1):
    url = f"{main_base_url}{page}"
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Locate each company on the current page
    company_list = soup.find_all('li', class_="m-exhibitors-list__items__item")

    for company in company_list:
        # Extract Company Name
        company_name = company.find("h2", class_="m-exhibitors-list__items__item__header__title").get_text(strip=True)

        # Extract Description (truncated on main page)
        description = company.find("div", class_="m-exhibitors-list__items__item__body__description")
        description_text = description.get_text(strip=True) if description else None

        # Extract Stand (if available)
        stand = company.find("div", class_="m-exhibitors-list__items__item__header__meta__stand")
        stand_text = stand.get_text(strip=True).replace("Stand:", "").strip() if stand else None
        if description_text is not None:
        # Extract Company Page Link
            company_relative_link = company.find("a", class_="js-librarylink-entry").get("href")
            company_full_url = company_base_url + company_relative_link
        # Fetch the complete description from the company's individual page if there is description
            company_response = requests.get(company_full_url, headers=headers)
            company_soup = BeautifulSoup(company_response.text, 'html.parser')

            # Locate the full description
            full_description = company_soup.find("div", class_="m-exhibitor-entry__item__body__description")
            description_text = full_description.get_text(strip=True) if full_description else description_text

        # Append all collected data for each company
        companies_data.append({
            "Company Name": company_name,
            "Description": description_text,
            "Stand": stand_text
        })

        # Adding a delay to avoid overwhelming the server
        time.sleep(1)


In [8]:
# Step 3: Save the data to an Excel file
df = pd.DataFrame(companies_data)
output_file = "LondonVetShow_Exhibitors_Full_Description.xlsx"
df.to_excel(output_file, index=False)

print(f"Data scraped successfully from {total_pages} pages and saved to {output_file}")


Data scraped successfully from 44 pages and saved to LondonVetShow_Exhibitors_Full_Description.xlsx


To remove duplicate rows with the same company name and combine their stand values into a single row.

In [9]:
# Load the Excel file with the scraped data
input_file = "LondonVetShow_Exhibitors_Full_Description.xlsx"
df = pd.read_excel(input_file)

# Grouping by Company Name and merge Stand values with comma separated for duplicates
merged_df = df.groupby('Company Name', as_index=False).agg({
    'Description': 'first',  # Take the first non-null description
    'Stand': lambda x: ', '.join(x.dropna().unique())  # Join unique stand values with commas
})

# Save the file
output_file = "LondonVetShow_Exhibitors.xlsx"
merged_df.to_excel(output_file, index=False)

print(f"Duplicates merged and data saved to {output_file}")

Duplicates merged and data saved to LondonVetShow_Exhibitors.xlsx


### Summary
The idealogy I used is to determine the total number of pages, then iterating through each page to extract relevant company details (if there is description available to any company then navigate to those companies webpage and collect the complete description), and finally saving the collected data into excel file.  Since there were some duplicate company names with different stands, I implemented logic to group the stand values by the same company name.

### Reference

https://www.geeksforgeeks.org/how-to-scrape-multiple-pages-of-a-website-using-python/

https://www.geeksforgeeks.org/http-headers-user-agent/