# TrustPilot.com User Reviews

This notebook processes the reviews for all businesses in the USA that are listed in the `Car Dealership` category and have received a review. Each step in the process tries to clean up the data before storing in a .json file at the end

These are the steps:
1. Create a list of all businesses in the category with at least 1 review
    - Filter to US businesses
    - Only verified & claimed business
    - Results sorted by number of reviews in descending order
2. For each business, parse the HTML via *BeautifulSoup4* and capture the reviews
    - Paginate through each result set as needed
    - Ignore blank or non-english comments
4. Store data in a .json file

This notebook is using BeautifulSoup4 for all the HTML parsing since the web server generates the final code without any dynamic JavaScript altering the results. This makes scraping much easier and straightforward. Were this not the case, a solution using [Selenium](https://github.com/SeleniumHQ/Selenium) would be necessary.

### Non default libraries used
In case your environment does not have these libraries, execute the following:
`pip install beautifulsoup4 requests langdetect`

### Import libraries used in the notebook

In [None]:
import requests
from bs4 import BeautifulSoup
import time
import random
import json
from urllib.parse import urljoin
from langdetect import detect
from langdetect import LangDetectException
import datetime

### Initialize the global variables

In [2]:
start_url = "https://www.trustpilot.com/categories/car_dealer?claimed=true&sort=reviews_count&verified=true" #filtered & sorted
base_url = "https://www.trustpilot.com"
all_reviews = []

### Scrape the entire category
With this function, we start at the provided URL (in this case the car dealership page) and find all those businesses that are listed and have any reviews posted. Since the web results are paginated, we are using logic to navigate from one page to the next and process each page in succession. The results are being sorted on the server side (via URL parameters) by review count in descending order, therefore once we hit the first business with 0 reviews we know we can stop.

In [None]:
def scrape_category_page(start_url: str):
    current_page_url = start_url
    business_urls = []

    while current_page_url:
        response = requests.get(current_page_url)
        soup = BeautifulSoup(response.content, 'html.parser')

        for link in soup.find_all('a', attrs={'data-business-unit-card-link': True}):
            tst = link.find_all('img', alt=lambda value: value and "TrustScore" in value)
            if len(tst) == 0: return business_urls #since we're sorting by number of reviews in descending order, when we hit 0 we're done
            url = link.get('href')
            url = urljoin(base_url, url)
            if '/review/' in url:
                business_urls.append(url)

        next_page_link = soup.find('a', attrs={'data-pagination-button-next-link': 'true'})
        if next_page_link and next_page_link.get('href'):
            next_page_url = next_page_link.get('href')
            next_page_url = urljoin(base_url, next_page_url)
            current_page_url = next_page_url
        else:
            break

        time.sleep(random.uniform(2, 5))
    
    return business_urls

### Scrape each business' page
Here we navigate to the specific business' page and start processing the reviews one page at a time. In an effort to be thoughtful toward the webserver, we have a random 2-7 second wait between page request. Note that there are businesses with thousands of pages.

As we loop through each review, we do work to remove empty reviews and any that are not in English. We then return the resulting `Dict` for processing by the main body of the notebook

In [1]:
# Extract business name and reviews from a given URL
def get_business_reviews(url: str):
    reviews_data = []

    while url: #loop while we have a valid URL to process
        try:
            #get the page and load into BeautifulSoup
            response = requests.get(url) 
            soup = BeautifulSoup(response.content, 'html.parser')

            # This assumes 'business_name' remains constant across pagination
            if not reviews_data:  # Only get the business name on the first page
                busElement = soup.find('div', id='business-unit-title')
                if busElement:
                    business_name = busElement.text.strip()
        
            #iterate through all the reviews for the business
            for review in soup.find_all('article', attrs={'data-service-review-card-paper': 'true'}):
                try:
                    review_rating = review.find('div', attrs={'data-service-review-rating':True})
                    review_title = review.find('h2', attrs={'data-service-review-title-typography': 'true'})
                    review_text = review.find('p', attrs={'data-service-review-text-typography': 'true'})

                    if review_rating and (review_text or review_title): #we only want the data if we have both a rating and either a review or a title
                        rating = review_rating['data-service-review-rating']
                        ttl = review_title.text.strip() if review_title else None
                        txt = review_text.text.strip() if review_text else None
                        if (ttl and detect(ttl)=='en') or (txt and detect(txt)=='en'): #we only want the data if it's in English
                            reviews_data.append({'rating': rating, 'review_title': ttl, 'review_text': txt})
                except LangDetectException:
                    pass #ignore this particular type of error
                except Exception as e:
                    print(f"An error occurred: {e}")
            
            # Find the 'Next' page link and update `url` for the next iteration
            next_page_link = soup.find('a', attrs={'data-pagination-button-next-link': 'true'})
            if next_page_link and 'href' in next_page_link.attrs:
                url = urljoin(base_url, next_page_link['href']) 
            else:
                url = None
        
            time.sleep(random.uniform(2, 7)) #be nice and wait 2-7 seconds
        except requests.exceptions.RequestException as e:
            print(f"Request error: {e}")
            break  # Exit loop on request error
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            break  # Exit loop on any other error
        
    return {'business_name': business_name, 'reviews': reviews_data}

### Start of the main body
We start by pulling the list of all businesses with reviews

In [None]:
business_urls = scrape_category_page(start_url)
print(f"Found {len(business_urls)} business review URLs")

We now iterate through each resulting URL but do it sorted by review count in ascending order. This is to ensure that if we encounter any issues in pulling the data, we discover it with small datasets before committing to the multi-thousand page businesses.

We try to be nice and wait 30-60 seconds before starting the process of the next business.

In [None]:
for url in reversed(business_urls): #since the webserver returns the results sorted by review count in descending order, we reverse the list
    ts = datetime.datetime.now()
    print(f"{ts}\tNow processing: {url}")

    reviews = get_business_reviews(url) #get the reviews for the business

    try:
        if len(reviews['reviews']) > 0: #only store the data if we got anything
            all_reviews.append(reviews)
            ts = datetime.datetime.now()
            print(f"{ts}\tProcessed {len(reviews['reviews'])} reviews")

    except Exception as e:
            print(f"An error occurred parsing {url}: {e}")

    time.sleep(random.uniform(30, 60))

Now that we're done pulling all the data from the web server, we need to do some further cleanup by removing invalid characters

In [None]:
running_total = 0
delim="Reviews"
for rvw in all_reviews:
    txt = rvw['business_name']
    bsnm = txt.split(delim)[0]
    rvw['business_name'] = bsnm.replace('\u00A0', '') #for some reason this character is included in the text and we need to remove it
    print(f"Business Name: {rvw['business_name']}\tReviews: {len(rvw['reviews'])}")
    running_total += len(rvw['reviews'])

print(f"Total reviews: {running_total}")

Now we save the file

In [None]:
# Save the collected data to a JSON file with UTF-8 encoding
filename = f".\\trusted_pilot_car_dealerships-{datetime.datetime.now().strftime('%Y%m%d')}.json"
with open(filename, 'w', encoding='utf-8', newline='\n') as file:
    json.dump(all_reviews, file, indent=4, ensure_ascii=False)

print(f"Data has been saved to {filename}")