## SeleniumBase URL Scraper & reCAPTCHA Solver

This program is a URL web-scraper built in SeleniumBase that can be deployed to scrape Google's search engine for company websites. It is integrated with an automated reCAPTCHA solver that has a >90% success rate, as Google tends to detect web-scraping on its search engine quickly and will try to block the scraper with reCAPTCHAs. 

Read more about the reCAPTCHA solver [here](https://pypi.org/project/selenium-recaptcha-solver/).

The general workflow of this program is as follows:

1. When the scraper starts to run, a separate Google Chrome browser opens. This will be where the URL-scraping will happen, and we can open this browser to monitor the progress of the URL retrieval manually. 

2. A company name is taken in as input. The URL scraper attempts to run a Google search on this company name, in the browser opened in Step 1. 

   (a) **<u>If no reCAPTCHA appears</u>**, the URL of the first search result will be scraped. Based on manual inspection, search result objects on Google are represented in HTML as \<a> tags with a jsname attribute of "UWckNb". 
   
   (b) **<u>If reCAPTCHA appears</u>**, a NoSuchElementException is raised and the reCAPTCHA solver will be activated to solve the reCAPTCHA. Most of the time, the reCAPTCHA will be solved in the first try. After the reCAPTCHA is solved, the Google search page on the company name will show and the URL scraping carries on as usual.
   
> - Sometimes, the reCAPTCHA solver needs more than 1 attempt to solve the reCAPTCHA (since the solver has ~10% failure rate). When it fails, a TimeOut Exception or Recaptcha Exception is raised. The program will search the company name on Google again, to load another reCAPTCHA page for the reCAPTCHA solver to retry solving the reCAPTCHA. Usually, by doing this, the reCAPTCHA will be solved already. 

> - In rare cases of more severe blocks from Google (see below), the reCAPTCHA will not be solvable even after several retries. The program needs to stop its execution for ~10 minutes, after which the block will go away on its own. As such, the program is built to go inactive after every 10th failed retry. 

<table><tr>
<td bgcolor="white"> 
  <p align="center" style="padding: 10px">
    <img src="https://media-hosting.imagekit.io//4fddb2eef3864f98/Empty%20Box.png?Expires=1835776147&Key-Pair-Id=K2ZIVPTIP2VGHC&Signature=CFlwdLIpaoX9ruTZFQ-y4XwbIhk~9XzTh3mWhhgPtqR0NIey2~A73ekpDex0uAsQM1Wzv9pu8NcQ0jxD1btjInRkl-2UzcmBhldnpJ1YJdFlw3OV7FUASY~TZs01TwvCrbJznypuIJvBFFQsx8SmDJ9VrLCAcMKKdB8471sZcfiziK9ozf7KzAgDsbD2ZRkkLelNgcMLLmVqkSrVMBAQ4nn24vRVtLgDUTUT3aj-1VGIRyt40t~9cXV~mKtIZy4z6h3vBSvPof9dXdkCM788DuNt0Qbl9wriZ7IOholFAO8BALZFTiBao5e562tbVtg5ieGmLT8tjR1BpSvuX7P7sQ" width="200">
      <img src="https://media-hosting.imagekit.io//1a8106f24717434e/Screenshot%202025-03-04%20at%207.06.38%E2%80%AFPM%20(1).png?Expires=1835776430&Key-Pair-Id=K2ZIVPTIP2VGHC&Signature=Bpaa4UczusvmAnyw8kkseJd2DaMkCl77AEpqVIfNP5mXRp2imlcVuqIK8O9aluio24aSxeGCDWlZEltlvAOD9bf6914EkKPgSAL~iHXSnf-a0Krn7YtCgqJKKIYqnyMdv8SbyeGlAL91aa9ldA5oxwULN2bksezag-cs-vDqqk2FKa2IQBls5x1suRoCD2iC4L0nGcawdE8aAPiFNEB~rW0z59V~BKp7igpSBoT1OwGYWb4kMB5oURlOgVEF3diAzuYY01GfijZCppW~rQEtIikyaXILJ0Ynn4U98XkxWRiJFhAQoFV9Mgj0FJdqvyUPSZ0OI2gU7rm8HuOQmKOUAw" width="200">
    <br> 
      <em><u>reCAPTCHA solver cannot click the Play Audio Button to solve the reCAPTCHA</u></em>
  </p> 
</td>
<td bgcolor="white"> 
  <p align="center">
    <br><br><br><br><br><br>
    <img src="https://media-hosting.imagekit.io//8fe28f4c89f442ec/No%20reCAPTCHA.png?Expires=1835776519&Key-Pair-Id=K2ZIVPTIP2VGHC&Signature=RpDaUC07KhXM8RAzcwF0Pso~xB8mt2DBvAPX0gCM3VRlsZx-AcXaoLWoxZeYyt-3E5vnkBB-bg37lovi7zdIFRj4Xeej8zrd1N39EsWmgo2SkmSx8mpH4MtoD~Kl-Cr~ABcGYL4omiFXlAwwS9AsbIzK8r86IJ48zKKpejvHaq5105hyXUMqpw-6ywKrb1XcIo6Fn8Jk2QdK~fOttyfAhY89~XJR6PaMer2F9H1hewXFQGEMGg1XXaZjNv2QAx5ecFEzQ7ESjmdDArRuvi0DZ9Hu1vw~0s~gj9GX3HRo34KD7aBd8Xt2YDqdzzHkjFAAgfcyL-38PKxYCXDX4WTWZQ" width="390">
    <br><br><br><br><br><br><br>
    <em><u>Google blocks us but is not providing any reCAPTCHA for us to solve</u></em>
  </p> 
</td>
</tr></table>

> - In total, the reCAPTCHA solver is given a maximum of 100 retries for any reCAPTCHA it faces (or 100 attempts to scrape the website of the company). If the reCAPTCHA is not broken (or the website has not been scraped) by the 100th retry, the scraper will stop retrying. The program will mark the company's website as an empty string to be filtered out in the next Cleaning step. 
   
**TLDR**: The algorithm of this program is summarised in the flow chart below. 

![](<https://media-hosting.imagekit.io//88351df6dce74e61/Flowchart.drawio%20(1).png?Expires=1835776640&Key-Pair-Id=K2ZIVPTIP2VGHC&Signature=Kw3IlBGQzd21qlghtW0aKWLPH~779K-N3--IKxHbLb0LXgfz81MTWjn-Y9Kdm9tWgTjKOg7cbC16GQEHPB9hA2rjnMMkSdexAwQHsVSa0iNRtPwc6G7FLxro4ZpPelpxN2PtgH1a42FVYgg98bn~pTliW-DFNwS6UI1PEAT0P9iB6PDEmTPj-11sqi0obLw2R8wQwwk8BZYlM3GAp8pRKDDCz4NHT7hCvuxnwOTisMb9obWlRVIZs39cXbrhKU1EUZaOYD-abDM8PcIwKMNgBjl45nAftYXaAuaPUwEkoIpBbPygG4485XvXcAyZ5wwowzKeh5EXm9iFq9HF2xPEFw__>)

In [None]:
import pandas as pd
from seleniumbase import Driver
from selenium_recaptcha_solver import RecaptchaSolver
from selenium.webdriver.common.by import By
import pyautogui
import urllib.parse
from time import sleep

In [None]:
driver = Driver(uc=True, headless=False)

df = pd.read_excel("Firms with no Websites 1.xlsx")
company_names = df["FIRM NAME"].astype("string").to_list() # convert FIRM NAME column to string in case there are some company names consisting of numbers only — they will be read in as numeric values instead of strings
company_urls = []
count = 1

def get_urls(company_name):
    for x in range(1, 101):
        try:
            google_url = f"https://google.com/search?q={urllib.parse.quote(company_name)}"
            driver.uc_open(google_url)
            global count # just to monitor and display the URLs retrieved from Google
            url = driver.find_element("a[jsname='UWckNb']").get_attribute("href") # finds only the first HTML object with an a tag and jsname attribute of "UWckNb" (the first search result on the Google search result page)
            company_urls.append(url)
            print(f"{count}" + ". " + url) # just to monitor and display the URLs retrieved from Google
            str_error = None
            count += 1 # just to monitor and display the URLs retrieved from Google
        except Exception as error:
            str_error = str(error)
            print(type(error).__name__)
        
        if str_error:
            
            if x == 100:
                company_urls.append("")
                print("Max. retries exceeded")
                print(f"{count}" + ". " + "")
                count += 1
                break
                
            if x % 10 == 0:
                print("Sleeping...")
                sleep(600)
                continue
            
            try:
                solver = RecaptchaSolver(driver = driver)
                recaptcha_iframe = driver.find_element(By.XPATH, '//iframe[@title = "reCAPTCHA"]')
                solver.click_recaptcha_v2(iframe = recaptcha_iframe)
            except Exception as secondary_error:
                print(type(secondary_error).__name__)
                continue
        else:
            break
            
for name in company_names:
    get_urls(name)

In [None]:
df["URLs"] = company_urls

df.to_excel("Firms with Websites.xlsx", index = False)