## Web Scraping Car Damage Images from a Website

***Reminder:*** *Before scraping any website, make sure to check the website policies and follow the rules on how to scrape websites ethically.*

In [1]:
!pip install selenium undetected_chromedriver beautifulsoup4 pandas pyarrow Pillow requests selenium-wire

Collecting selenium
  Downloading selenium-4.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting undetected_chromedriver
  Downloading undetected-chromedriver-3.5.5.tar.gz (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.4/65.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting selenium-wire
  Downloading selenium_wire-5.1.0-py3-none-any.whl.metadata (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.29.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.12.1-py3-none-any.whl.metadata (5.1 kB)
Collecting brotli>=1.0.9 (from selenium-wire)
  Downloading Brotli-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting kaitaistruct>=0.7 (from selenium-wire)
  Downloadin

### **Import Libraries**

In [2]:
# import undetected_chromedriver as uc
import hashlib, io, requests, pandas as pd
import time
import random

from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver import ChromeOptions
# from seleniumwire import webdriver

from bs4 import BeautifulSoup
from pathlib import Path
from PIL import Image

### **Mount Drive**

In [3]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


### **Launch the WebDriver to open a target URL**
Configure Chrome Options

In [4]:
options = ChromeOptions()
options.add_argument("--headless=new")
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36")
# driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver = webdriver.Chrome(options=options)

### **Extracting Page URLs**
Each url/page contains the photos for every listed car. Each page will have around 12 photos for the specific car brand. These contains internal and external photos of the car. We only need the external photos so we will be only be extracting those later on.

**Note**: When you encounter a connection error to the website, just rerun the chrome configuration options cell.

Images are from AutobidMaster Website. URLS should be extracted for front, rear, and side panels. The same process should be repeated for each panel.

In [8]:
## empty list to store the url
page_urls = []

## the range indicates the number of pages in the website
## let's try 30 pages for now
for i in range(1, 30):
  url = f"https://www.autobidmaster.com/en/search/make-toyota,lexus,suzuki/doc-type-c,s/damage-front+end/?page={i}"
  driver.get(url)
  time.sleep(random.randint(1,30))
  elements = driver.find_elements(By.CLASS_NAME, "_1dGf-ymm")

  for item in elements:
    # print(img)
    page_urls.append(item.get_attribute('href'))

driver.quit()

Repeat the process until you finish extracting all urls in all pages

In [9]:
## check how many page urls were extracted
len(page_urls)

870

#### Saving Extracted Page URLs

In [None]:
## save the urls to a csv file just in case you need it later on
df = pd.DataFrame(page_urls)
df.to_csv('scraped_page_urls_front.csv')

In [7]:
## run this code in case runtime suddenly stops and you lose the list
df = pd.read_csv("scraped_page_urls_front.csv")
df.dropna(inplace=True)

### **Extracting Image URLs**

In this step, we will be extracting the urls of the images per page. Extracting a large number of urls is time-consuming so you can do the extraction by batch (depends on your preference).

**Note**: When you encounter a connection error to the website, just rerun the chrome configuration options cell.

In [14]:
## empty list to store the image urls
image_urls = []

## the range indicates the number of links we extracted in the prior step
## since running this code is time-consuming, you can do it by batch until you finish all links
for page in page_urls[1:50]:
  url = f"{page}"
  driver.get(url)
  time.sleep(random.randint(1,30))
  image_elements = driver.find_elements(By.CLASS_NAME, "_XTkv-img")

  for img in image_elements:
    # print(img)
    image_urls.append(img.get_attribute('data-src'))

driver.quit()

Repeat the process until you finish extracting all image urls.

In [15]:
## check how many image urls were extracted
len(image_urls)

667

#### Saving Extracted Image URLs

In [16]:
## save the urls to a csv file just in case you need it later on
df_image_urls = pd.DataFrame(image_urls)
df_image_urls.to_csv('scraped_image_urls_front.csv')

In [8]:
## run this code in case runtime suddenly stops and you lose the list
# df_image_urls = pd.read_csv("scraped_image_urls_front.csv")
# df_image_urls.dropna(inplace=True)

### **Cleaning the Image URLs**

In [10]:
## create a list for the url
image_urls = df_image_urls[0].tolist()

## check how many images we have
len(image_urls)

667

In [12]:
## remove any URLs that contain the substring "_ful.jpg"
## images with substring "_ful.jpg" does not belong to the same car in the list
image_urls = [x for x in image_urls if "_ful.jpg" not in x]

In [19]:
## checking the list
image_urls

['https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/cdf842f6fcec4ad69009e2e565b60717_thb.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/3203820d21f147758654cad23fc44846_thb.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/03234ad005b64c20aaaa2454812858af_thb.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/d8a76a4ba2824ecb8075ec531ad5d647_thb.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/c413ba1cfea444d3a2bd6f09674069ec_thb.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/40020cd8ea74471a849a43da416c5b57_thb.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/0dc2c1c53b5d41abb374b95a0c1cf310_thb.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/b00ed43f5cef49b5a1d26366e501f617_thb.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/370c9ae1663c4cb3baa1de7cafabcb4f_thb.jpg',
 'https://cs.copart

In [21]:
## the target string serves as an indication of a new set of car images
target_string = "/build/spa/images/"
result_links = []
temp_links = []  # Temporary storage

for link in image_urls:
  if target_string in link:
    result_links.append(temp_links[4])  ## Keep the 5th link as this is the front damage in a car
    result_links.append(link)
    temp_links = [] # Reset temporary list
  else:
    temp_links.append(link)

result_links.extend(temp_links)  # Add any remaining links

In [22]:
## checking resulting list
print("There are a total of", len(result_links), "items in the list")

result_links

There are a total of 98 items in the list


['https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/c413ba1cfea444d3a2bd6f09674069ec_thb.jpg',
 '/build/spa/images/overlays-registeroverlay-img/thumbnail.bcd9772d.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0125/9abb9e2d3b67446fb8dcd4e4a1b31325_thb.jpg',
 '/build/spa/images/overlays-registeroverlay-img/thumbnail.bcd9772d.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/8cbf198d01434abba0cf3d9907964e82_thb.jpg',
 '/build/spa/images/overlays-registeroverlay-img/thumbnail.bcd9772d.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/8a3e722f1c2b4b54880a97ebe49d62bb_thb.jpg',
 '/build/spa/images/overlays-registeroverlay-img/thumbnail.bcd9772d.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0125/55b0857bccce4b4cbfcbf234e39cdfb5_thb.jpg',
 '/build/spa/images/overlays-registeroverlay-img/thumbnail.bcd9772d.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0125/700ddbce3cbc418db24

In [23]:
## remove the unnecessary image in the list
temp_url = [x for x in result_links if "/build/spa/images/" not in x]

## replace the thumbnail as a full image
new_urls = [url.replace("_thb", "_ful") for url in temp_url]

In [24]:
## checking the final list of front car damage
print("There are", len(new_urls), "in the list")

new_urls

There are 49 in the list


['https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/c413ba1cfea444d3a2bd6f09674069ec_ful.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0125/9abb9e2d3b67446fb8dcd4e4a1b31325_ful.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/8cbf198d01434abba0cf3d9907964e82_ful.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0225/8a3e722f1c2b4b54880a97ebe49d62bb_ful.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0125/55b0857bccce4b4cbfcbf234e39cdfb5_ful.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0125/700ddbce3cbc418db24f586845b815f4_ful.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/0125/4fc4ae0321b64cad887d72cc93396c80_ful.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/1224/6b21364744bf46e98ed49dc0ecb139a7_ful.jpg',
 'https://cs.copart.com/v1/AUTH_svc.pdoc00001/ids-c-prod-lpp/1224/55b92fe86f294d91b09619f08a11974b_ful.jpg',
 'https://cs.copart

### **Save Images to Drive Folder**

In [27]:
## set folder path
data_path = "/content/drive/My Drive/Colab Notebooks/Data/Scraped Data/Car Damage/Front"

In [28]:
for img in new_urls:
  ## Store the content from the URL to a variable
  image_content = requests.get(img).content

  ## Create a byte object out of image_content and store it in the variable image_file
  image_file = io.BytesIO(image_content)

  ## Use Pillow to convert the Python object to an RGB image
  image = Image.open(image_file).convert("RGB")

  ## Set a file_path variable that points to your directory.
  ## Create a file based on the sha1 hash of 'image_content'.
  ## Use .hexdigest to convert it into a string.
  file_path = Path(data_path, hashlib.sha1(image_content).hexdigest()[:10] + ".jpeg")
  image.save(file_path, "JPEG", quality=120)

Repeat the same process for every car panel.