# MIIE Coding Task
*Author: Jiayan LI*

The task involved scraping a list of image URLs and downloading the corresponding images from classmates.com. To accomplish this, I employed two main steps: automating the Chrome browser using `Selenium` with `undetected_chromedriver` and accessing image urls with `urllib.request`.

My initial attempt using the `requests` library presented a 403 error, which indicates that the server is refusing access to the resource I'm trying to retrieve. Upon further investigation, I discovered that the website's terms of service prohibit automated access to their content using scripts and bots.

In my second attempt using `Selenium`, I encountered several challenges. When using Chrome and Firefox drivers, a Cloudflare verification page appeared, indicating that the website is protected by Cloudflare and requires additional verification. When using Safari, the browser crashed repeatedly, preventing further progress.

In an attempt to bypass the Cloudflare verification, I experimented with different solutions, and using an undetected ChromeDriver helped me successfully log in.

For processing the image links, I initially encountered a 403 error when using the requests library. To overcome this, I found a solution that involved using an opener to download the images successfully.

Overall, due to the website's protection mechanisms and terms of service, I faced challenges in both scraping and downloading images but ultimately solved them.

In [None]:
# import necessary libraries
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import time
import undetected_chromedriver as uc
import urllib.request
import os

## Step 1 & 2: Log in & Scrape one page
Stored as `test.jpg`

### Using `Selenium` to scrape the image urls

Using undetedcted chromedriver

Ref: https://stackoverflow.com/questions/71518406/how-to-bypass-cloudflare-browser-checking-selenium-python

In [8]:
# Create a driver instance
driver = uc.Chrome(use_subprocess=True)

In [9]:
# Navigate to the webpage
url = "https://www.classmates.com/siteui/yearbooks/4182755124?page=7"
driver.get(url)

# Wait for the page to load completely
driver.implicitly_wait(10)

The log-in page pops out at this point.

In [10]:
# Find the login form elements and fill in the credentials
username_input = driver.find_element(By.NAME, "emailOrRegId")
password_input = driver.find_element(By.NAME, "password")

# Fill in credentials
username = "joannejiayan@gmail.com"
password = "15070019719mama!"

username_input.send_keys(username)
password_input.send_keys(password)

time.sleep(2)

In [11]:
# Scroll down the page, mimicing humna behaviors
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)

In [12]:
# Submit the login form
submit_button = driver.find_element(By.ID, "login-button")
driver.execute_script("arguments[0].click();", submit_button)

# Wait for the page to load after login
wait = WebDriverWait(driver, 10)

Step 2 satisfied, log-in process completed and now I arrived at the [webpage](https://www.classmates.com/siteui/yearbooks/4182755124?page=7)

In [16]:
# Find all image elements using XPath
image_elements = driver.find_elements(By.XPATH, "//img[contains(@src, 'https://yb.')]")  # Example XPath

# Extract the image URLs
image_urls = [element.get_attribute("src") for element in image_elements]

In [None]:
# Get the test image url
test_url = image_urls[0]

### Processing image URL

ref: https://stackoverflow.com/questions/34692009/download-image-from-url-using-python-urllib-but-receiving-http-error-403-forbid

In [38]:
# Use opener
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)

# retrieve the image file
local='test.jpg'        # specify the path
urllib.request.urlretrieve(test_url, local)

('test.jpg', <http.client.HTTPMessage at 0x7fd066222f10>)

Step 1 satisfied, `test.jpg` now in the directory.

## Step 3: Streamline the process
Scrape all the pages of the yearbook (2010-Alameda-High-School). Results stored under the directory `images`

In [39]:
# Create a list that contains every yearbook page URL
yearbook_urls = []

for i in range(1, 297):
    page_url = "https://www.classmates.com/siteui/yearbooks/4182755124?page=" + f'{i}'
    yearbook_urls.append(page_url)

In [40]:
# Inspect if I've got every page url
yearbook_urls[-1]

'https://www.classmates.com/siteui/yearbooks/4182755124?page=296'

### Streamline the image-URL-scraping process

In [50]:
def get_image_url(yearbook_page_url):
    '''
    Scrape the image url from specified page url
    '''

    driver.get(yearbook_page_url)

    # Wait for the page to load completely
    driver.implicitly_wait(5)

    # if encounter a log-in page
    try:
        # Attempt to find the element
        username_input = driver.find_element(By.NAME, "emailOrRegId")
        # Element found
        print("Log-in element exists")

        password_input = driver.find_element(By.NAME, "password")

        # Fill in credentials
        username = "joannejiayan@gmail.com"
        password = "15070019719mama!"

        username_input.send_keys(username)
        password_input.send_keys(password)
        time.sleep(2)

        # Scroll down the page, mimicing humna behaviors
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)

        # Submit the login form
        submit_button = driver.find_element(By.ID, "login-button")
        driver.execute_script("arguments[0].click();", submit_button)

        # Wait for the page to load after login
        wait = WebDriverWait(driver, 10)

    except: pass

    # Find all image elements using XPath
    image_element = driver.find_element(By.XPATH, "//img[contains(@src, 'https://yb.')]")  # Example XPath

    # Extract the image URLs
    image_url = image_element.get_attribute("src")

    return image_url

Scraping image URLs in batches using the function above

In [54]:
image_urls = []
failed_attempts = []
batch_size = 10  # Number of URLs to process in each batch

# Calculate the total number of batches
total_batches = len(yearbook_urls) // batch_size + (1 if len(yearbook_urls) % batch_size != 0 else 0)

for batch in range(total_batches):
    start_index = batch * batch_size
    end_index = (batch + 1) * batch_size

    for i, page_url in enumerate(yearbook_urls[start_index:end_index], start=start_index):
        try:
            image_urls.append(get_image_url(page_url))
        except Exception as e:
            failed_attempts.append(i)
            print(f"Attempt {i} failed: {str(e)}")

    print(f"Batch {batch + 1} completed")

Batch 1 completed
Batch 2 completed
Batch 3 completed
Batch 4 completed
Batch 5 completed
Batch 6 completed
Batch 7 completed
Batch 8 completed
Batch 9 completed
Batch 10 completed
Batch 11 completed
Batch 12 completed
Batch 13 completed
Batch 14 completed
Batch 15 completed
Batch 16 completed
Batch 17 completed
Batch 18 completed
Batch 19 completed
Batch 20 completed
Batch 21 completed
Batch 22 completed
Batch 23 completed
Batch 24 completed
Batch 25 completed
Batch 26 completed
Batch 27 completed
Batch 28 completed
Batch 29 completed
Batch 30 completed


In [56]:
len(image_urls)

296

In [57]:
image_urls[:5]

['https://yb.cmcdn.com/yearbooks/1/4/f/5/14f548096a10e39b9a984e8461acec86/440/0001.jpg?h=39f9265d252277f7817d0c50c274a54d',
 'https://yb.cmcdn.com/yearbooks/1/4/f/5/14f548096a10e39b9a984e8461acec86/440/0002.jpg?h=0e35a7cb66c07162259002f6c8f4a734',
 'https://yb.cmcdn.com/yearbooks/1/4/f/5/14f548096a10e39b9a984e8461acec86/440/0003.jpg?h=1ea75290fd69e34b0d17dc5d0f293674',
 'https://yb.cmcdn.com/yearbooks/1/4/f/5/14f548096a10e39b9a984e8461acec86/440/0004.jpg?h=d157fdc61755da7ac709acf742013289',
 'https://yb.cmcdn.com/yearbooks/1/4/f/5/14f548096a10e39b9a984e8461acec86/440/0005.jpg?h=9705548ecb80d5f4287c41a8fc60fdce']

Save the image urls to `image_url.txt`

In [58]:
file_path = 'image_urls.txt'  # Path to the text file

with open(file_path, 'w') as file:
    for url in image_urls:
        file.write(url + '\n')

print("Image URLs have been written to the file:", file_path)

Image URLs have been written to the file: image_urls.txt


In [None]:
driver.quit()

### Download images from the collected image URLs


In [59]:
# Create the directory images
directory = 'images'

# Check if the directory already exists
if not os.path.exists(directory):
    # Create the directory
    os.makedirs(directory)

In [60]:
# Build an opener
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)

In [61]:
# Download each image from each url in the image url list
for i, url in enumerate(image_urls):
    local=f'images/{i}.jpg'
    urllib.request.urlretrieve(url, local)