# Web Scraping Tutorial

Requirements:

- Selenium
- Chrome Driver


Useful Links:

[Selenium Documentation](https://www.selenium.dev/documentation/)

[XPath Syntax](https://www.w3schools.com/xml/xpath_syntax.asp)

[Regex Syntax](https://www.w3schools.com/python/python_regex.asp)

## Setting up environment

### Installing libraries

In [1]:
# Set up for running selenium in Google Colab
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb
CHROME_DRIVER_VERSION=`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip -P /tmp/
unzip -o /tmp/chromedriver_linux64.zip -d /tmp/
chmod +x /tmp/chromedriver
mv /tmp/chromedriver /usr/local/bin/chromedriver
pip install selenium

Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Ign:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:8 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B]
Hit:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:13 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [3,108 kB]
Get:14 http://archive.u



### Importing libraries

In [2]:
!pip install chromedriver-autoinstaller
!pip install selenium-stealth

Collecting chromedriver-autoinstaller
  Downloading chromedriver_autoinstaller-0.6.4-py3-none-any.whl.metadata (2.1 kB)
Downloading chromedriver_autoinstaller-0.6.4-py3-none-any.whl (7.6 kB)
Installing collected packages: chromedriver-autoinstaller
Successfully installed chromedriver-autoinstaller-0.6.4
Collecting selenium-stealth
  Downloading selenium_stealth-1.0.6-py3-none-any.whl.metadata (6.4 kB)
Downloading selenium_stealth-1.0.6-py3-none-any.whl (32 kB)
Installing collected packages: selenium-stealth
Successfully installed selenium-stealth-1.0.6


In [3]:
from selenium_stealth import stealth
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
import chromedriver_autoinstaller

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains

import pandas as pd
import re

## Data collection

### Simple example - IMDb Movies

In [5]:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # Run Chrome in headless mode (no GUI). Useful for server-side automation.
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.set_capability("browserVersion", "114.0.5735.90")
chrome_options.add_argument("--window-size=2560,1440")
chromedriver_autoinstaller.install()

# Initialize the Chrome WebDriver with the previously defined options.
driver = webdriver.Chrome(options=chrome_options)

# Setup stealth settings to make the browser less detectable by websites (helps to prevent blocking by anti-bot mechanisms).
stealth(driver,
    languages=["en-US", "en"],
    vendor="Google Inc.",
    platform="Win32",
    webgl_vendor="Intel Inc.",
    renderer="Intel Iris OpenGL Engine",
    fix_hairline=True,
)

# Target URL
url = "https://www.imdb.com/chart/top/"

# Selenium browsers to a specific page
driver.get(url)

# List that stores flighs
movie_list = driver.find_elements(By.CLASS_NAME, "ipc-metadata-list-summary-item__tc")
print(f'Found possibly {len(movie_list)} movies')

Found possibly 250 movies


In [6]:
# Loop through each movie element and print its outer HTML
for index, movie in enumerate(movie_list):
    print(f"Movie {index + 1} HTML:")
    print(movie.get_attribute("outerHTML"))
    print("----------------------------------------------------")

Movie 1 HTML:
<div class="ipc-metadata-list-summary-item__tc"><span class="ipc-metadata-list-summary-item__t" aria-disabled="false"></span><div class="sc-b189961a-0 iqHBGn cli-children"><div class="ipc-title ipc-title--base ipc-title--title ipc-title-link-no-icon ipc-title--on-textPrimary sc-b189961a-9 bnSrml cli-title"><a href="/title/tt0111161/?ref_=chttp_t_1" class="ipc-title-link-wrapper" tabindex="0"><h3 class="ipc-title__text">1. The Shawshank Redemption</h3></a></div><div class="sc-b189961a-7 btCcOY cli-title-metadata"><span class="sc-b189961a-8 hCbzGp cli-title-metadata-item">1994</span><span class="sc-b189961a-8 hCbzGp cli-title-metadata-item">2h 22m</span><span class="sc-b189961a-8 hCbzGp cli-title-metadata-item">R</span></div><span class="sc-b189961a-1 kcRAsW"><div class="sc-e2dbc1a3-0 jeHPdh sc-b189961a-2 bglYHz cli-ratings-container" data-testid="ratingGroup--container"><span aria-label="IMDb rating: 9.3" class="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb r

Get values inside of each movie element

In [7]:
movies_scraped = []

for movie in movie_list:
  movie_element = {}
  movie_element["title"] = movie.find_element(By.CLASS_NAME, "ipc-title__text").text
  movie_element["year"] = movie.find_element(By.XPATH, './/div[2]//span[1]').text
  movies_scraped.append(movie_element)

movies_df = pd.DataFrame(movies_scraped)
movies_df.head(10)

Unnamed: 0,title,year
0,1. The Shawshank Redemption,1994
1,2. The Godfather,1972
2,3. The Dark Knight,2008
3,4. The Godfather Part II,1974
4,5. 12 Angry Men,1957
5,6. Schindler's List,1993
6,7. The Lord of the Rings: The Return of the King,2003
7,8. Pulp Fiction,1994
8,9. The Lord of the Rings: The Fellowship of th...,2001
9,"10. The Good, the Bad and the Ugly",1966


In [8]:
movies_scraped = []

for movie in movie_list:
  movie_element = {}
  movie_element["title"] = movie.find_element(By.CLASS_NAME, "ipc-title__text").text
  movie_element["year"] = movie.find_element(By.XPATH, './/div[2]//span[1]').text

  pattern = r"(\d+)\.\s*(.*)" # Regex pattern

  # Use re.match to find the number and the title
  match = re.match(pattern, movie_element["title"] )

  if match:
      # Group 1 is the number, Group 2 is the title
      number = match.group(1)
      title = match.group(2)
  else:
    print("No match found.")

  movie_element["title"] = title
  movies_scraped.append(movie_element)

movies_df = pd.DataFrame(movies_scraped)
movies_df.head(10)

Unnamed: 0,title,year
0,The Shawshank Redemption,1994
1,The Godfather,1972
2,The Dark Knight,2008
3,The Godfather Part II,1974
4,12 Angry Men,1957
5,Schindler's List,1993
6,The Lord of the Rings: The Return of the King,2003
7,Pulp Fiction,1994
8,The Lord of the Rings: The Fellowship of the Ring,2001
9,"The Good, the Bad and the Ugly",1966
