# Step 1: Scraping and Cleaning

In this notebook we are building the base of our database by scraping austrian startup listings on the following site: 

https://www.eu-startups.com/

The following code will extract the information for each of the startups based on the listing example below:

In [1]:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import pandas as pd
import time
import random
import re
import json

In [2]:
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=openai_api_key)

# 1.1 Scraping the listings

## Example Listing: 

### Listing Example:

<div style="text-align: center;">
  <img src="../images/startupseu_listing.png" alt="Listing Example" width="80%">
</div>

### HTML Structure:

```html
<div id="wpbdp-listing-310342" class="wpbdp-listing">
  <div class="listing-title">
    <h3><a href="https://www.eu-startups.com/directory/digna-gmbh/">digna GmbH</a></h3>
  </div>
  
  <div class="excerpt-content">
    <div class="listing-thumbnail">
      <img src="https://www.eu-startups.com/wp-content/uploads/2024/08/digna-logo.png" alt="digna - data quality solution">
    </div>
    
    <div class="listing-details">
      <div class="wpbdp-field-display wpbdp-field-business_name">
        <span class="field-label">Business Name:</span>
        <div class="value"><a href="#">digna GmbH</a></div>
      </div>
      
      <div class="wpbdp-field-display wpbdp-field-category">
        <span class="field-label">Category:</span>
        <div class="value"><a href="#">Austria</a></div>
      </div>
      
      <div class="wpbdp-field-display wpbdp-field-based_in">
        <span class="field-label">Based in:</span>
        <div class="value">Vienna</div>
      </div>
      
      <div class="wpbdp-field-display wpbdp-field-tags">
        <span class="field-label">Tags:</span>
        <div class="value">data quality, data observability tool, data quality solution, data warehouses, data centers</div>
      </div>
      
      <div class="wpbdp-field-display wpbdp-field-founded">
        <span class="field-label">Founded:</span>
        <div class="value">2020</div>
      </div>
    </div>
  </div>
</div>
```

### Extracted Fields:

| Field | Value | Source |
|-------|-------|--------|
| **name** | digna GmbH | `<a>` tag in `.listing-title` |
| **link** | https://www.eu-startups.com/directory/digna-gmbh/ | `href` attribute in `.listing-title > a` |
| **category** | Austria | `.wpbdp-field-category .value` |
| **based_in** | Vienna | `.wpbdp-field-based_in .value` |
| **tags** | data quality, data observability tool, ... | `.wpbdp-field-tags .value` |
| **founded** | 2020 | `.wpbdp-field-founded .value` |


## Scraping initial listings

### Scraping function

This is the main scraping function used to extract all the listings from the directory, it works as follows: 

1. Build the connection to the page using the requests library and initialize BeautifulSoup, the scraping tool.

2. Find all the matching containers, each representing a startup.

3. Extracting the information from the listings matching the HTML structure from above. 

In [5]:
def extract_listings_from_page(url, headers=None):

    # 1. HTTP Get request to fetch the page content
    response = requests.get(url, headers=headers)
    response.raise_for_status()

    soup = BeautifulSoup(response.content, 'html.parser')
    
    # 2. Finding all the listing containers matching our pattern "wpbdp-listing-..."
    listings = soup.find_all('div', id=lambda value: value and value.startswith("wpbdp-listing-"))
    data = []

    for listing in listings:
        # Initialize a dictionary for this listing
        listing_data = {
            'name': None,
            'link': None,
            'category': None,
            'based_in': None,
            'tags': None,
            'founded': None
        }

        # 3. Extract data from each listing
        
        # 3.1 Extract name and link from the listing-title block
        title_div = listing.find('div', class_='listing-title')
        if title_div:
            a_tag = title_div.find('a')
            if a_tag:
                listing_data['name'] = a_tag.get_text(strip=True)
                listing_data['link'] = a_tag.get('href')
        
        # 3.2 Extract details from the listing-details block
        details_div = listing.find('div', class_='listing-details')
        if details_div:
            field_divs = details_div.find_all('div', class_=lambda x: x and "wpbdp-field-display" in x)
            for field in field_divs:
                label_span = field.find('span', class_='field-label')
                if not label_span:
                    continue
                # Normalize the label for matching
                label = label_span.get_text(strip=True).rstrip(':').lower()
                
                value_div = field.find('div', class_='value')
                value = value_div.get_text(strip=True) if value_div else None

                if label == 'business name':
                    continue
                elif label == 'category':
                    listing_data['category'] = value
                elif label == 'based in':
                    listing_data['based_in'] = value
                elif label == 'tags':
                    listing_data['tags'] = value
                elif label == 'founded':
                    listing_data['founded'] = value
        
        data.append(listing_data)
    
    return data

### Scraping Process

In [6]:
# Define headers with a custom User-Agent to mimic a real browser.
headers = {
    'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0; +http://example.com/contact)'
}

base_url = "https://www.eu-startups.com/directory/wpbdp_category/austrian-startups/"


Here is an example on how the loop below works:

In [10]:
example_url = f"{base_url}page/50/"

example_page_listings = extract_listings_from_page(example_url, headers=headers)

pd.DataFrame.from_dict([example_page_listings[0]])

Unnamed: 0,name,link,category,based_in,tags,founded
0,FetoLife Science,https://www.eu-startups.com/directory/fetolife...,Austria,Vienna,"Apps, Health Care, Medical, mHealth, Personal ...",2019


In [None]:
all_listings = []

# Loop trough all of the pages and scrape the data
for page in tqdm(range(1, 70), desc="Scraping pages"):
    # For page 1, use the base URL
    if page == 1:
        url = base_url
    # Construct the URL for other pages
    else:
        url = f"{base_url}page/{page}/"
    
    try:
        page_listings = extract_listings_from_page(url, headers=headers)
        all_listings.extend(page_listings)
    except Exception as e:
        print(f"Error scraping {url}: {e}")
    
    # Rate limiting: sleep for a random time between 1 to 3 seconds
    delay = random.uniform(1, 3)
    time.sleep(delay)

print("Total listings extracted:", len(all_listings))

## Preprocessing

To make our (my) life easier, let's do some preprocessing which looks as follows: 

1. Converting it to a pandas dataframe and do some basic cleaning

2. Remove all the listings which are not in Austria using OpenAI API

### Pandas Dataframe and Cleaning

In [6]:
df_listings = pd.read_csv("./data/eustartup_listings.csv")

In [7]:
df_listings = pd.DataFrame.from_dict(all_listings)

# Remove duplicates based on 'link' and 'name' columns
df_listings.drop_duplicates(subset=['link'], keep='first', inplace=True)
df_listings.drop_duplicates(subset=['name'], keep='first', inplace=True)

# Clean unusual line terminators found in the listings
def clean_unusual_terminators(text):
    if isinstance(text, str):
        return text.replace('\u2028', ' ').replace('\u2029', ' ')
    return text

df_listings = df_listings.map(clean_unusual_terminators)

### Location filtering using OpenAI API

Here we use ChatGPT (OpenAI API) to filter out non-austrian startups using the based_in columns. We also add the postal code and the region for further steps.

In [8]:
def classify_cities_openai(cities_list):
    
    prompt = f"""Given this list of city/location names from a startup database, please:
1. Identify which are Austrian locations (cities, towns, or regions in Austria)
2. Standardize the names to proper English spelling
3. Add the main postal code (PLZ) for each Austrian location
4. Add the Austrian federal state (Bundesland) for each location
5. Return a JSON object with two keys:
   - "austrian": a dict mapping original name -> {{"standardized": "English name", "postal_code": "PLZ", "region": "Bundesland"}}
   - "non_austrian": a list of locations not in Austria

Austrian federal states are: Vienna, Lower Austria, Upper Austria, Styria, Tyrol, Carinthia, Salzburg, Vorarlberg, Burgenland.
For regions or areas without a specific postal code, use the postal code of the main city in that region.

Cities to classify:
{json.dumps(cities_list)}"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"}
    )
    
    content = response.choices[0].message.content
    return json.loads(content)

In [9]:
# Get unique cities
unique_cities = df_listings['based_in'].dropna().unique().tolist()

# Classify with LLM and standardize
result = classify_cities_openai(unique_cities)

print("Austrian locations:", len(result['austrian']))
print("Non-Austrian locations:", len(result['non_austrian']))

Austrian locations: 100
Non-Austrian locations: 24


Now we create a mapping and apply it to the dataframe:

In [10]:
# Create mappings
austrian_mapping = result['austrian']

# Create separate mappings for standardized names, postal codes, and regions
name_mapping = {k: v['standardized'] for k, v in austrian_mapping.items()}
postal_mapping = {k: v['postal_code'] for k, v in austrian_mapping.items()}
region_mapping = {k: v['region'] for k, v in austrian_mapping.items()}

# Filter to only Austrian locations
df_listings = df_listings[df_listings['based_in'].isin(austrian_mapping.keys())]

# Add postal code and region columns, then standardize city names
df_listings['postal_code'] = df_listings['based_in'].map(postal_mapping)
df_listings['region'] = df_listings['based_in'].map(region_mapping)
df_listings['based_in'] = df_listings['based_in'].map(name_mapping)

In [12]:
df_listings.reset_index(drop=True, inplace=True)
df_listings.head(3)

Unnamed: 0,name,link,category,based_in,tags,founded,postal_code,region
0,Yumi42 – Growth Made Easy,https://www.eu-startups.com/directory/yumi42-g...,Austria,Vienna,"coaching, marketplace, transparent",2025,1010,Vienna
1,Krajete,https://www.eu-startups.com/directory/krajete/,Austria,Pasching,"carbon capture, nox control, nox removal, Clea...",2025,4061,Upper Austria
2,Submit Ninja,https://www.eu-startups.com/directory/submit-n...,Austria,Vienna,"Marketing, Voice of customer, Customer Feedbac...",2025,1010,Vienna


In [13]:
df_listings.to_csv("./data/1_raw_listings.csv", index=False)
df_listings = pd.read_csv("./data/1_raw_listings.csv")

## 1.2 Detailed Listing Scraping

Now lets scrape the detailed listings for the startups we collected above, here is what an example page looks like:

<div style="text-align: center;">
  <img src="../images/startupseu_detailed_listing.png" alt="Detailed Listing" width="50%">
</div>

### Function

In [None]:
def extract_listing_details(url, headers=None):

    details = {
        'business_name': None,
        'logo_link': None,
        'long_business_description': None,
        'business_description': None,
        'total_funding': None,
        'website': None,
        'company_status': None,
        'social_links': []
    }
    
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return details

    soup = BeautifulSoup(response.content, 'html.parser')
    
    # 1. Extract business name either from header or listing title
    header = soup.find('div', class_='td-page-header')
    if header:
        h1 = header.find('h1', class_='entry-title td-page-title')
        if h1:
            span = h1.find('span')
            if span:
                details['business_name'] = span.get_text(strip=True)
    if not details['business_name']:
        listing_title = soup.find('div', class_='listing-title')
        if listing_title:
            h2 = listing_title.find('h2')
            if h2:
                details['business_name'] = h2.get_text(strip=True)
    
    # 2. Get logo link from the thumbnail section
    thumbnail_div = soup.find('div', class_='listing-thumbnail')
    if thumbnail_div:
        a_tag = thumbnail_div.find('a')
        if a_tag:
            img_tag = a_tag.find('img')
            if img_tag and img_tag.get('src'):
                details['logo_link'] = img_tag.get('src')
    
    # 3. Helper function to extract field by its identifier
    def extract_field(field_identifier):
        field_div = soup.find('div', class_=lambda x: x and field_identifier in x)
        if field_div:
            value_div = field_div.find('div', class_='value')
            if value_div:
                return value_div.get_text(" ", strip=True)
        return None

    # 3.1 Long Business Description
    details['long_business_description'] = extract_field('wpbdp-field-long_business_description')

    # 3.2 Business Description
    details['business_description'] = extract_field('wpbdp-field-business_description')
    
    # 3.3 Total Funding
    details['total_funding'] = extract_field('wpbdp-field-total_funding')

    # 3.4 Website
    details['website'] = extract_field('wpbdp-field-website')
    
    # 3.5 Company Status
    details['company_status'] = extract_field('wpbdp-field-company_status')
    
    # 4. Social Links 
    social_container = soup.find('div', class_='social-fields')
    if social_container:
        anchor_tags = social_container.find_all('a')
        for a in anchor_tags:
            href = a.get('href')
            if href:
                details['social_links'].append(href)
    
    return details

In [None]:
headers = {
    'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0; +http://example.com/contact)'
}

detailed_rows = []

# Iterate over each row in our startup dataframe
for idx, row in tqdm(df_listings.iterrows(), total=len(df_listings), desc="Scraping details"):
    listing_url = row['link']
    
    # Extract additional details via our function
    details = extract_listing_details(listing_url, headers=headers)
    
    # Add the extracted details to the row
    combined_row = row.to_dict()
    combined_row.update(details)
    detailed_rows.append(combined_row)
    
    delay = random.uniform(2, 3)
    time.sleep(delay)

df_detailed_listings = pd.DataFrame(detailed_rows)

Scraping details from: https://www.eu-startups.com/directory/avdain/
Sleeping for 2.14 seconds...
Scraping details from: https://www.eu-startups.com/directory/pdf-to-brainrot/
Error fetching https://www.eu-startups.com/directory/pdf-to-brainrot/: 404 Client Error: Not Found for url: https://www.eu-startups.com/directory/pdf-to-brainrot/
Sleeping for 2.86 seconds...
Scraping details from: https://www.eu-startups.com/directory/softgen/
Sleeping for 2.54 seconds...
Scraping details from: https://www.eu-startups.com/directory/popper-power-gmbh/
Sleeping for 2.11 seconds...
Scraping details from: https://www.eu-startups.com/directory/setter-ai/
Sleeping for 2.98 seconds...
Scraping details from: https://www.eu-startups.com/directory/surveysensum/
Sleeping for 2.21 seconds...
Scraping details from: https://www.eu-startups.com/directory/artypa/
Sleeping for 2.76 seconds...
Scraping details from: https://www.eu-startups.com/directory/share-your-party/
Sleeping for 2.01 seconds...
Scraping deta

In [None]:
df_dl = pd.DataFrame(detailed_rows)

df_dl['social_links'] = df_dl['social_links'].apply(lambda x: tuple(x) if isinstance(x, list) else x)

df_dl.drop_duplicates(keep='first', inplace=True)

df_dl.drop_duplicates(subset=['link'], keep='first', inplace=True)

df_dl.reset_index(drop=True, inplace=True)

df_dl = df_dl.applymap(clean_unusual_terminators)

#df_dl.to_csv("data/eustartup_listings.csv", index=False)

df_detailed_listings = pd.read_csv("data/eustartup_listings.csv")

df_detailed_listings

Unnamed: 0,name,link,category,based_in,tags,founded,business_name,logo_link,long_business_description,business_description,total_funding,website,company_status,social_links
0,Avdain,https://www.eu-startups.com/directory/avdain/,Austria,Vienna,"Company, Startup, One Person",2020.0,Avdain,https://www.eu-startups.com/wp-content/uploads...,Avdain is a enterprise that embodies a fusion ...,Avdain is an technology company founded and so...,No funding announced yet,avdain.com,Active,()
1,PDF To Brainrot,https://www.eu-startups.com/directory/pdf-to-b...,Austria,"245 Wo Lung Street, Fanling, North District, H...","AI,Video,Learning",2024.0,,,,,,,,()
2,Softgen,https://www.eu-startups.com/directory/softgen/,Austria,Austria,"AI code assistant, full stack developer",2024.0,Softgen,https://www.eu-startups.com/wp-content/uploads...,"Beyond a starter: Your complete project, ready...",Softgen is your AI Web App Developer. Describe...,No funding announced yet,https://softgen.ai/,Active,()
3,Popper Power GmbH,https://www.eu-startups.com/directory/popper-p...,Austria,Vienna,"EV, Battery, BESS, Charging",2022.0,Popper Power GmbH,https://www.eu-startups.com/wp-content/uploads...,,Popper Power GmbH develops advanced energy sto...,Between €500K-€ 1 million,www.popperpower.com,Active,"('https://www.linkedin.com/company/86313916',)"
4,Setter AI,https://www.eu-startups.com/directory/setter-ai/,Austria,Wien,"AI, AI Agents, Sales & Marketing, AI SaaS, AI ...",2024.0,Setter AI,https://www.eu-startups.com/wp-content/uploads...,Speed matters when you want more sales. That’s...,Easy-to-use AI appointment setter for WhatsApp...,Between €1-€100K,https://www.trysetter.com,Active,()
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
697,Runtastic,https://www.eu-startups.com/directory/runtastic/,Austria,Linz,"Activity Tracker, Fitness app, Healthcare",2009.0,Runtastic,https://www.eu-startups.com/wp-content/uploads...,,runtastic helps you to track your fitness acti...,,http://www.runtastic.com,,()
698,toolani,https://www.eu-startups.com/directory/toolani/,Austria,Vienna,"Online Calls Software, Connectivity, Telecom",2008.0,toolani,https://www.eu-startups.com/wp-content/uploads...,,toolani offers cheap international calling to ...,,https://www.toolani.com,,()
699,Matchoffice Österreich,https://www.eu-startups.com/directory/matchoff...,Austria,Wien,"Office rentals, Business centres",2008.0,Matchoffice Österreich,https://www.eu-startups.com/wp-content/uploads...,Thanks to our many years of experience with of...,MatchOffice is a recognised player for the pla...,No funding announced yet,https://www.matchoffice.at,Active,()
700,Kununu,https://www.eu-startups.com/directory/kununu/,Austria,Vienna,"Anonymous Feedback, Companies Rating, Reviews ...",2007.0,Kununu,https://www.eu-startups.com/wp-content/uploads...,,The Austria based company kununu offers a plat...,,http://www.kununu.com,,()
