# 1. Overview

### 1.1. Introduction to Kununu

In this notebook, I'll introduce you to Kununu, an employer review platform similar to Glassdoor, but focused on German-speaking countries (Germany, Switzerland and Austria).

If you've ever applied for a job, you might relate to the curiosity of knowing what it's truly like to work at that company. Many job applicants prefer to independently investigate potential employers, especially if they don’t know anyone within the company.  This is where Kununu comes into play.

Kununu provides individuals with valuable employee insights and ratings for companies they are interested in, helping them make informed career decisions. Meanwhile, companies benefit from receiving anonymous yet reliable feedback, which highlights their strengths and identifies areas needing improvement.

<div class="alert alert-block alert-success">

Personally, I find Kununu to be extremely insightful. From what I've seen, companies tend to emphasize their attractive values and benefits, but these employer reviews reveal the most truthful opinions. That's why I make sure to read them thoroughly before any applications and interviews.
</div>

<div class="alert alert-block alert-info">

If data quality is your concern, know that they are very committed to ensuring fairness and maintaining high standards in reviews, with measures in place to prevent fake and purchased feedback while protecting user personal data.

For further information  see the links below:

* Policy: https://inside.kununu.com/kununu-richtlinien/

* Data processing: https://inside.kununu.com/wie-kununu-daten-verarbeitet/
</div>

The platform is unfortunately only in German, but with current technology, you can easily translate the page directly into English using your browser. Sometimes, you can also find reviews in English.

### 1.2. Page and data structure

Before diving into details on how to access their data, let's start with a brief summary of the site and data structure. 

The entire website is designed so that each company has an overview page containing information about the company and summarized statistics. Its URL has the structure:

`https://www.kununu.com/<country abbreviation: de, at, or ch>/<company name>`

For example:

`https://www.kununu.com/de/statista`


Below this, there are several subpages:

* Reviews (**~/kommentare** - from employees or **~/bewerbung** - from applicants): This section allows users to filter reviews based on position, department, location, and other criteria, as well as sort them by newest, oldest, most relevant, etc. Each review includes various categories where the reviewer can rate from 1 to 5 stars and add textual feedback. The reviewers are not only employees but also previous applicants, allowing for their filtering as well.

* Salaries(**~/gehalt**): Provides insights into salary ranges for different positions within the company.

* Job posts (**~/jobs**): Although vacant jobs are listed here for the company, in my experience, it may not be the most up-to-date or comprehensive source for job listings. Other platforms such as Stepstone, Indeed, and LinkedIn typically provide more current and thorough job postings.

* Company culture (**~/kultur**): Focuses on insights into the company's culture, based on employee experiences. For more details on how they calculate these cultural metrics, visit [Kununu Culture Compass Methodology](https://news.kununu.com/presseinformation/kununu-kulturkompass-methodische-grundlagen/).

Some company has the News section too (**~/news**): This page is dedicated to updates and announcements related to the company.

If you don't have a specific company to investigate and just want to explore options, you can navigate to their general pages, which are not attached to any particular company (see below).

<img src="images/Kununu_general.png" width="900"/>

<div class="alert alert-block alert-success">

I usually scroll through the review section to read about companies of interest, but I often find it challenging to summarize all the information I need and search for very specific details, e.g. whether the company has a works committee. Furthermore, reviews are already categorized but at quite a high level, so there is plenty of room for more advanced analysis, such as text mining (e.g., topic detection, more ideas in part 3). 

Let's now take a look at possible ways to retrieve the data.
</div>

### 1.3. Options to access data

| Method                | Description and links                                                                                                | Potential peculiarities/problems                                          |
|-----------------------|---------------------------------------------------------------------------------------                               |---------------------------------------------------------------------------|
| **APIs**              | Kununu does not appear to offer a public API directly for developers on its website.                                 | -                                                                         | 
| **Python Packages**   | Based on my research, there are currently no dedicated Python packages specifically for accessing Kununu data either.| -                                                                         |
| **Web Scraping**      | Paid options: [Apify's Kununu.com Companies Scraper API](https://apify.com/lexis-solutions/kununu-scraper/api)<br>Free options: write your own script or use code from open source projects, e.g.: <br>- https://github.com/TheWoops/Web-Scraping, <br>- https://github.com/CAlamosV/Webscraping-Kununu          | - Legal and ethical considerations - ensure compliance with Kununu's terms of service and data usage policies <br>- Potential changes in website structure can break scraping scripts.

This approach is more oriented towards employees and interested individuals because if a company has an Employer Branding Profile, they receive in-depth analysis insights for their account. Read more [here](https://arbeitgeberportal.kununu.com/produkte/kostenloses-arbeitgeberprofil/).


Based on this initial research, I'd recommend web scraping while keeping some common rules for this option in mind, such as:
* Rate limit: Be mindful of making too many requests in a short period to avoid IP blocks.
* Privacy and legal considerations: As all reviews are anonymous, there is no significant risk concerning storing and processing personal information. However, always consult the terms of service to ensure compliance.

<div class="alert alert-block alert-warning">
<b>Further considerations</b>

* It's also important to remember that these reviews only capture part of the whole picture, as not every employee/applicant shares their experiences. Often, individuals who had negative experiences are more likely to post reviews. Utilize this data to support better, data-driven decision-making for employees and companies, but avoid using it as the sole metric for evaluating company performance or culture.

* Furthermore, be mindful that reviews from the distant past may not reflect the current situation. It is recommended to concentrate on reviews from the previous 1 or 2 years.
</div>

# 2. Mini Tutorial

### Code for scraping all written reviews (either from employees or applicants) for a certain company in Kununu 
*(modified from the original file **scrape_reviews.ipynb** of [Webscraping-Kununu](https://github.com/CAlamosV/Webscraping-Kununu) (GitHub) for the purpose of my submission)*

##### > Workflow
The main steps involve installing Python packages, defining CSS classes based on page structure, and writing functions to scrape the data:
1. **Parsing Individual Review Blocks (`parse_review_block`):**
   Extracts structured data from a single review, including subcategories and text of reviews.

2. **Fetching Reviews (`get_all_reviews_for_url`):**
   Scrapes all reviews (either from employees or applicants) for a company across multiple pages, iterating through review pages until the 'Mehr Bewertungen lesen' (Read more reviews) button is no longer visible.

##### > Output
- **Structured Review Data:** A JSON file containing the scraped reviews.

##### > Notes
- The scraper relies on specific HTML structures defined in the `CSS_CLASSES` dictionary. Changes to Kununu’s website may require updates to these selectors.
- Kununu's website uses JavaScript to load content dynamically, meaning simple HTTP requests won't capture all the data. The original project used Scraping Bee to address this issue. I replaced it with Selenium, a free alternative that we learned in the lecture.

In [1]:
# !pip install beautifulsoup4 selenium
import json
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import math
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException, NoSuchElementException

In [2]:
# Define a dictionary to hold the needed CSS class names
# This may need to be updated if the website changes
CSS_CLASSES = {
    "overall_score": ".index__score__BktQY",
    "title": "h3.index__title__xakS9.h3-semibold",
    "date": "time[datetime]",
    "recommendation_block": ".index__recommendationBlock__2zhEJ",
    "employment_info": ".index__employmentInfoBlock__wuOtj",
    "factor": ".index__factor__Mo6xW",
    "factor_title": ".index__title__Rq0Po",
    "factor_text": ".index__plainText__JgbHE",
    "review_block": ".index__reviewBlock__I8pdb",
    "total_reviews": "index__totalReviews__aUzS6 p-small-semibold"
}

In [3]:
def parse_review_block(block, kn_url):
    review = {}
    review['kn_url'] = kn_url
    # Overall score
    score_el = block.select_one(CSS_CLASSES["overall_score"])
    review['overall_score'] = float(score_el.text.replace(',', '.')) if score_el else None
    
    # Title
    title_el = block.select_one(CSS_CLASSES["title"])
    review['title'] = title_el.text.strip() if title_el else None
    
    # Date
    date_el = block.select_one(CSS_CLASSES["date"])
    if date_el:
        date_str = date_el.get('datetime', '')
        date_parts = date_str.split('T')[0].split('-')
        review['year'] = int(date_parts[0])
        review['month'] = int(date_parts[1])
    else:
        review['year'] = None
        review['month'] = None
    
    # Recommendation
    rec_el = block.select_one(CSS_CLASSES["recommendation_block"])
    if rec_el:
        recommendation_text = rec_el.text.strip().lower()
    else:
        review['recommended'] = None

    # Employee Type and Position
    emp_info_el = block.select_one(CSS_CLASSES["employment_info"])
    if emp_info_el:
        emp_info_text = emp_info_el.text.strip()
        emp_type_el = emp_info_el.select_one('b')
        review['employee_type'] = emp_type_el.text.strip() if emp_type_el else None
        try:
            position_text = emp_info_el.text.replace(review['employee_type'], '', 1).strip()
            review['position'] = position_text if position_text else None
        except:
            review['position'] = None
    else:
        review['employee_type'] = None
        review['position'] = None

    # Subcategories
    review['subcategories'] = []
    factors = block.select(CSS_CLASSES["factor"])
    for f in factors:
        cat_title_el = f.select_one(CSS_CLASSES["factor_title"])
        if not cat_title_el:
            continue
        cat_title = cat_title_el.text.strip()
        text_el = f.select_one(CSS_CLASSES["factor_text"])
        cat_text = text_el.text.strip() if text_el else None
        review['subcategories'].append({cat_title: cat_text})

    return review


In [4]:
def get_all_reviews_for_url(kn_url, save_path="data/scraped_reviews.json"):

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--window-size=1920,1080")
    
    driver = webdriver.Chrome(options=chrome_options)
    
    try:
        driver.get(kn_url)
        
        reviews = []
        more_reviews_available = True
        
        while more_reviews_available:
            
            # Parse the current page content
            soup = BeautifulSoup(driver.page_source, 'html.parser')
            review_blocks = soup.select(CSS_CLASSES["review_block"])
            
            # Process the currently visible reviews
            for block in review_blocks:
                parsed = parse_review_block(block, kn_url)
                if parsed.get('title'):
                    reviews.append(parsed)
            
            print(f"Collected {len(reviews)} reviews so far...")
            
            # Check if the "Mehr Bewertungen lesen" button exists
            try:
                load_more_link = soup.select_one("a.index__button__2PFpW")
                if load_more_link and load_more_link.get("href"):
                    next_page_url = load_more_link["href"]
                    full_next_page_url = f"https://www.kununu.com{next_page_url}"
                    print(f"Navigating to next page: {full_next_page_url}")
                    driver.get(full_next_page_url)
            
                    time.sleep(3)
                else:
                    print("No more 'Mehr Bewertungen lesen' button found. All reviews loaded.")
                    more_reviews_available = False
            except Exception as e:
                print(f"Error finding 'Mehr Bewertungen lesen' button: {e}")
                more_reviews_available = False
        
        print(f"Total reviews collected: {len(reviews)}")
        

        results = {kn_url: reviews}
        with open(save_path, "w") as f:
            json.dump(results, f)
        
        return results
    
    finally:
        driver.quit()



In [None]:
# Example: Let's scrape all reviews from employees for Statista
url = "https://www.kununu.com/de/statista/kommentare"
reviews = get_all_reviews_for_url(url)

Collected 10 reviews so far...
Navigating to next page: https://www.kununu.com/de/statista/kommentare/2
Collected 20 reviews so far...
Navigating to next page: https://www.kununu.com/de/statista/kommentare/3
Collected 30 reviews so far...
Navigating to next page: https://www.kununu.com/de/statista/kommentare/4
Collected 40 reviews so far...
Navigating to next page: https://www.kununu.com/de/statista/kommentare/5
Collected 50 reviews so far...
Navigating to next page: https://www.kununu.com/de/statista/kommentare/6
Collected 60 reviews so far...
Navigating to next page: https://www.kununu.com/de/statista/kommentare/7
Collected 70 reviews so far...
Navigating to next page: https://www.kununu.com/de/statista/kommentare/8
Collected 80 reviews so far...
Navigating to next page: https://www.kununu.com/de/statista/kommentare/9
Collected 90 reviews so far...
Navigating to next page: https://www.kununu.com/de/statista/kommentare/10
Collected 100 reviews so far...
Navigating to next page: https:

In [None]:
# Return the first review
reviews['https://www.kununu.com/de/statista/kommentare'][0]

{'kn_url': 'https://www.kununu.com/de/statista/kommentare',
 'overall_score': 2.8,
 'title': 'War mal super, jetzt eher ein sinkendes Schiff',
 'year': 2025,
 'month': 4,
 'employee_type': 'Ex-Angestellte/r oder Arbeiter/in',
 'position': 'Hat bis 2025 im Bereich IT bei Statista gearbeitet.',
 'subcategories': [{'Gut am Arbeitgeber finde ich': '(Bisher noch) sehr viel Flexibilität, was sich aber ab Herbst ändern wirdDas kollegiale MiteinanderDie kulturelle Vielfalt'},
  {'Schlecht am Arbeitgeber finde ich': 'Falsche & gelogene Kommunikation ("es wird keine Layoffs geben"), nur damit dann am laufenden Band Mitarbeitende binnen zwei Wochen "verschwinden", ergo nicht selbst gekündigt haben.Die Bezahlung ist unterirdisch. Uni-Absolventen aus anderen Ländern werden schamlos ausgenutzt und zu Witzgehältern eingestellt, weil man weiß, dass sie gern in Deutschland Fuß fassen wollen. Gleichzeitig bekommen neue Kolleg:innen aus der Beratungswelt lächerlich hohe Gehälter, weil sie sich gut verkau

This demo code serves the purpose of scraping all reviews and ratings from employees of a certain company. However, it can be easily scaled to access reviews from applicants (by using the URL https://www.kununu.com/de/statista/bewerbung instead), as well as other sections such as company profile data, job postings, salaries, and to scrape multiple companies simultaneously. Up to now, I haven't experienced any limitations with this approach.

The full code can be found here: https://github.com/CAlamosV/Webscraping-Kununu. They claimed to have used the code to scrape all 255,002 employer profiles as of December 15th, 2024.

# 3. Use case

As you can see, Kununu consists primarily of user-generated employer reviews, hence a lot of text data. The data quality in terms of completeness and consistency of the review is not secured, it depends a lot on the individual biases and subjectivity. This requires us to handle the data differently than traditional tabular data, by using special NLP techniques. This aspect makes it particularly interesting in the context of our course.

Listed below are the initial ideas for relevant use cases:

| Type           | Description                                                                            |Example of potential use cases                                           |
|----------------|----------------------------------------------------|-------------------------------------------------------|
| Labeled data   | Ratings: Reviews feature 1 to 5-star ratings on various job aspects like work-life balance, salary, and career development| Ratings can serve as sentiment labels (1-2 stars = negative, 3 stars = neutral, 4-5 stars = positive) for sentiment analysis of review text. Or predict overall satisfaction or the likelihood of recommendation.  |
| Text data      | Free text reviews | Topic Modeling: Discover recurring themes and issues.


Regarding the topic of RAG systems, I coincidentally came across an AI summary function on the Kununu page of Check24. Althorugh it's unclear whether this feature is implemented by the company or provided by Kununu, as I can't see it on other companies' pages, this presents an interesting example of RAG functionality.

<img src="images/Kununu_Check24.png" width="1000"/>

Imagine building a RAG system where you could ask questions like, "What do employees say about promotion opportunities at Company Y? or 'What interview questions should I expect for position A at company B?" The system could treat each company as a source, retrieving relevant snippets and generating comprehensive answers based on your query. It could also enable comparisons between several companies and offer deeper analysis of specific topics.

<div class="alert alert-block alert-success">

In summary, Kununu and its data have enormous potential for traditional machine learning as well as more advanced AI applications like RAG. Web scraping has proven to work seamlessly, and I'm personally intrigued to explore this direction further.
</div>