## Project Description

### Overview

This project involves collecting and analyzing data related to adoptable dogs from Petfinder.com, a reputable online database of animals who need homes. The data was collected on November 1st, 2024 using web scraping techniques to gather information on dogs available for adoption in a specific region.

### Source of the Data

- Website: Petfinder.com
- Focus: Adoptable dogs listed on the website
- Location Filter: Data was restricted to dogs available in Georgia to maintain a manageable dataset and focus on a specific geographical area

### Data Collected

The following information was collected for each dog:
- Pet ID: A unique identifier assigned by Petfinder
- Name: The name given to the dog by the shelter or rescue organization
- Primary Breed: The primary breed of the dog
- Secondary Breed: The secondary breed of the dog
- Mixed Breed: An indicator (e.g, Yes or No) showing whether the dog is of mixed breed
- Age: Categorized as Baby, Young, Adult, or Senior
- Sex: Male or Female
- Size: The size category of the dog, such as Small, Medium, Large, or Extra Large
- Primary Colour: The predominant color of the dog's coat
- Secondary Colour: The secondary color present in the dog's coat, if any
- Coat Length: The length of the dog's coat, categorized as Hairless, Short, Medium, or Long
- Shelter Name: The name of the shelter or rescue organization currently caring for the dog
- Zip Code: The postal code of the shelter's location
- Number of Photos: The number of photos available for the dog in its listing
- Children: An indicator of whether the dog is suitable for homes with children
- Cats: An indicator of whether the dog gets along well with cats
- Other Dogs: An indicator of whether the dog is friendly towards other dogs
- Characteristics: Descriptive traits and personality attributes of the dog
- House Trained: An indicator of whether the dog is trained to eliminate outside or in designated areas
- Health: Information regarding the dog's health status, including vaccinations, spay/neuter status, and any special needs
- Adoption Fee: The cost associated with adopting the dog, when available

To begin, I am collecting all available data for each dog and will refine the scope during the analysis phase once I determine the specific questions I want to address.

### Decisions to Restrict Data Collection

- Respecting Terms of Service: Ensured that the scraping process complied with Petfinder's terms of use and robots.txt file
- Error Handling: Implemented exception handling to manage unexpected issues without causing undue strain on the website
- Duplicate Avoidance: Implemented checks to prevent collecting duplicate entries, ensuring each pet is uniquely represented
- **Data Cleaning and Validation: Please see "data_cleaning.ipynb"**

## Importing Needed Libraries 

In [2]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException, InvalidSelectorException
from selenium.webdriver.chrome.service import Service

import pandas as pd
import time
import json
import re
import matplotlib.pyplot as plt
import numpy as np
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor, wait
import threading
import traceback


## Pulling the Data

In [None]:
def thread_name_fn():
    thread_name_raw = threading.current_thread().name
    thread_name = re.sub(r"ThreadPoolExecutor-\d+_(\d+)", r"ThreadPoolExecutor-0_\1", thread_name_raw)
    return thread_name 

In [None]:
def create_driver_fn(options):
    global drivers_dict
    thread_name = thread_name_fn()
    if thread_name not in drivers_dict:
        service = Service(executable_path=ChromeDriverManager().install())
        drivers_dict[thread_name] = webdriver.Chrome(service=service, options=options)
    driver = drivers_dict[thread_name]
    return driver

In [None]:
def navigate_to_page(driver, page_number):
    current_page = 1
    while current_page < page_number:
        try:
            # Waiting for the "Next" button to be clickable
            next_button = WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.XPATH, '//button[span[text()="Next"]]'))
            )
            # Scrolling to the "Next" button and clicking it
            driver.execute_script("arguments[0].scrollIntoView(true);", next_button)
            next_button.click()
            current_page += 1
            # Waiting for the new page of results to load
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, '//a[@class="petCard-link"]'))
            )
            # Adding a short delay to ensure the page loads 
            time.sleep(1)  
        except Exception as e:
            print(f"Error navigating to page {page_number}: {e}")
            break

In [None]:
def get_total_pages(driver):
    try:
        # Waiting for the custom element to be present
        total_pages_element = WebDriverWait(driver, 30).until(
            EC.presence_of_element_located(
                (By.XPATH, '//*[@id="page-select_List_Box_Btn"]')
            )
        )

        # Getting the number of total pages and extracting the digits
        total_pages_text = total_pages_element.text.strip()
        total_page_result = re.search(r'/(\d+)', total_pages_text)
        total_page = total_page_result.group(1)
        
        # Converting from a string to an integer 
        total_page_num = int(total_page)
        return total_page_num

    except Exception as e:
        print(f"Error getting total pages: {e}")
        if 'driver' in locals():
            driver.save_screenshot(f"get_total_pages_error_{int(time.time())}.png")
        traceback.print_exc()
        return 1
    
    

In [None]:
# Function to navigate to the desired page 
def search_dogs(URL, location, driver):
    try:
        # Navigating to the website
        driver.get(URL)

        # Giving the website some time to load (to ensure "Dogs" will set in the animal type search bar) 
        time.sleep(2)

        # Waiting for animal type search bar to load and be ready for interaction 
        form_type = WebDriverWait(driver, 40).until(
            EC.element_to_be_clickable((By.ID, 'simpleSearchAnimalType'))
        )
        form_type.clear()
        form_type.send_keys("Dogs")

        # Finding the location search bar element and ensure it is ready for interaction 
        form_location = WebDriverWait(driver, 40).until(
            EC.element_to_be_clickable((By.ID, 'simpleSearchLocation'))
        )
        form_location.clear()
        form_location.send_keys(location)
        
        # Allowing the location to be set before clicking the search button
        time.sleep(2)

        # Clicking the search button to submit the form
        search_button = WebDriverWait(driver, 40).until(
            EC.element_to_be_clickable((By.ID, 'petSearchBarSearchButton'))
        )
        # search_button = driver.find_element(By.ID, 'petSearchBarSearchButton')
        search_button.click()

        # Wait for the results to load
        WebDriverWait(driver, 40).until(
            EC.presence_of_element_located((By.XPATH, '//a[@class="petCard-link"]'))
        )
        search_results_url = driver.current_url
        print(search_results_url)
        return search_results_url

    except Exception as e:
        print(f"An error occurred while searching for adoptable dogs: {str(e)}")
        print("driver.current_url", driver.current_url)
        if 'driver' in locals():
            driver.save_screenshot(f"search_dogs_error_{int(time.time())}.png")
        traceback.print_exc()
        return None

In [None]:
def get_info(driver, page_number):
    try:
        # Creating a list to store dog data 
        dog_data_list = []
        # Creating a list to track missing data for each dog 
        incomplete_data = []
        # Creating a set to keep track of pet IDs 
        processed_pet_ids = set()  

        
        # Waiting for the dog cards to be clickable
        WebDriverWait(driver, 40).until(
            EC.presence_of_all_elements_located((By.XPATH, '//a[@class="petCard-link"]'))
        )
        dog_cards = driver.find_elements(By.XPATH, '//a[@class="petCard-link"]')
        print(f"Found {len(dog_cards)} dog cards on page {page_number}.")

        # Extracting the hrefs from the dog cards
        dog_links = [card.get_attribute('href') for card in dog_cards]

        # Looping through each dog card on the page
        for i, link in enumerate(dog_links):
            # Creating dictionary to store data for each dog 
            dog_info_dict = {}
            try:
                print(f"Processing dog {i + 1} on page {page_number}.")

                # Navigating to the dog detail page
                driver.get(link)

                # Waiting for the dog's detail page to load
                WebDriverWait(driver, 40).until(
                    EC.presence_of_element_located((By.XPATH, '//pf-ad[contains(@id, "PetDetail")]'))
                )

                # Finding the <pf-ad> element that contains the dogs details 
                pf_ad = driver.find_element(By.XPATH, '//pf-ad[contains(@id, "PetDetail")]')
                targeting_data = pf_ad.get_attribute("targeting")
                dog_info = json.loads(targeting_data)

                # Getting the pet ID
                pet_id = dog_info.get('Pet_ID', 'N/A')
                if pet_id == 'N/A':
                    print(f"Pet ID not found for dog {i + 1} on page. Skipping.")
                    continue

                # Checking if pet ID is already processed
                if pet_id in processed_pet_ids:
                    print(f"Pet ID {pet_id} already processed. Skipping dog {i + 1} on page {page_number}.")
                    continue


                # Defining the fields I want and what I want to call the column names
                fields = [
                    ('pet_id', 'Pet_ID'),
                    ('pet_name', 'Pet_Name'), 
                    ('primary_breed', 'Primary_Breed'), 
                    ('secondary_breed', 'Secondary_Breed'),
                    ('mixed_breed', 'Mixed_Breed'), 
                    ('age', 'Age'), 
                    ('gender', 'Gender'), 
                    ('size', 'Size'), 
                    ('primary_colour', 'Primary_color'), 
                    ('secondary_colour', 'Secondary_color'), 
                    ('coat_length', 'Coat_length'), 
                    ('shelter_name', 'Shelter_Name'), 
                    ('shelter_id', 'Shelter_ID'), 
                    ('zip_code', 'Zip_Code'), 
                    ('num_photos', 'Number_of_photos_in_profile'), 
                    ('children', 'Good_with_children'), 
                    ('cats', 'Good_with_cats'), 
                    ('other_dogs', 'Good_with_dogs'), 
                    ('other_animals', 'Good_with_other_animals'), 
                    ('fee_waived', 'Adoption_fee_waived')
                ]

                # Iterating through each field
                for key, dog_info_key in fields:
                    try:
                        dog_info_dict[key] = dog_info.get(dog_info_key, 'N/A')
                        if dog_info_dict[key] == 'N/A':
                            incomplete_data.append({'pet_id': pet_id, 'field': key, 'error': 'Data not found in JSON'})
                    except Exception as e:
                        dog_info_dict[key] = 'N/A'
                        incomplete_data.append({'pet_id': pet_id, 'field': key, 'error': str(e)})

                # Handling fields that are not in the <pf_ad> tag 
                try:
                    pet_location = driver.find_element('xpath', '//span[@data-test="Pet_Location"]').text
                    dog_info_dict['pet_location'] = pet_location
                except:
                    dog_info_dict['pet_location'] = 'N/A'
                    incomplete_data.append({'pet_id': pet_id, 'field': 'pet_location', 'error': 'Not found on page'})

                try:
                    characteristics = driver.find_element('xpath', '//dt[contains(text(), "Characteristics")]/following-sibling::dd').text
                    dog_info_dict['characteristics'] = characteristics
                except:
                    dog_info_dict['characteristics'] = 'N/A'
                    incomplete_data.append({'pet_id': pet_id, 'field': 'characteristics', 'error': 'Not found on page'})

                try:
                    house_trained = driver.find_element('xpath', '//dt[contains(text(), "House-trained")]/following-sibling::dd').text
                    dog_info_dict['house_trained'] = house_trained
                except:
                    dog_info_dict['house_trained'] = 'N/A'
                    incomplete_data.append({'pet_id': pet_id, 'field': 'house_trained', 'error': 'Not found on page'})

                try:
                    health = driver.find_element('xpath', '//dt[contains(text(), "Health")]/following-sibling::dd').text
                    dog_info_dict['health'] = health
                except:
                    dog_info_dict['health'] = 'N/A'
                    incomplete_data.append({'pet_id': pet_id, 'field': 'health', 'error': 'Not found on page'})

                try:
                    # First, check if the adoption fee is in the about section
                    adoption_fee_element = WebDriverWait(driver, 10).until(
                        EC.presence_of_element_located(
                            (By.XPATH, "//dt[contains(text(), 'Adoption fee')]/following-sibling::dd")
                        )
                    )
                    adoption_fee = adoption_fee_element.text.strip()
                except (NoSuchElementException, TimeoutException, InvalidSelectorException) as e:
                    # If not in the about section, log the error and check the pet story section
                    incomplete_data.append({'pet_id': pet_id, 'field': 'adoption_fee_about_section', 'error': str(e)})
                    try:
                        # Wait for the pet story section to be present
                        pet_story_element = WebDriverWait(driver, 10).until(
                            EC.presence_of_element_located((By.XPATH, '//div[@data-test="Pet_Story_Section"]'))
                        )
                        pet_story_text = pet_story_element.text
                        # Use regex to search for the adoption fee
                        adoption_fee_search = re.search(r'Adoption fee[:\s$]*([\d.,]+)', pet_story_text, re.IGNORECASE)
                        if adoption_fee_search:
                            adoption_fee = adoption_fee_search.group(1)
                        else:
                            adoption_fee = 'N/A'
                            incomplete_data.append({'pet_id': pet_id, 'field': 'adoption_fee_pet_story', 'error': 'Not found in pet story text'})
                    except (NoSuchElementException, TimeoutException) as e:
                        # Log failure in pet story section
                        adoption_fee = 'N/A'
                        incomplete_data.append({'pet_id': pet_id, 'field': 'adoption_fee_pet_story_section', 'error': str(e)})
                else:
                    # If adoption fee was found in the about section, proceed
                    adoption_fee = adoption_fee.strip()
                        
                # Appending a dictionary of all the values for the given dog to the dog_data_list 
                dog_data_list.append(dog_info_dict)
                print(f"Collected data for pet_name {dog_info_dict.get('pet_name', 'N/A')} with pet_id {dog_info_dict.get('pet_id', 'N/A')}")

                # Adding the pet ID to the set of processed IDs
                processed_pet_ids.add(pet_id)

                # Navigating back to the results page
                driver.back()

                # Waiting for dog cards to reload
                WebDriverWait(driver, 40).until(
                    EC.presence_of_all_elements_located((By.XPATH, '//a[@class="petCard-link"]'))
                )

            # Error message if cannot process dog on the given page but continuing with the next dog 
            except Exception as e:
                print(f"Error processing dog {i + 1} on page: {page_number} {e}")
                traceback.print_exc()
                continue

        # # Printing incomplete data information
        # if incomplete_data:
        #     print("Summary of incomplete data fields:")
        #     for entry in incomplete_data:
        #         print(entry)

        # Returing dog data and incomplete data 
        return {'dog_data': dog_data_list, 'incomplete_data': incomplete_data}


    finally:
        pass


In [None]:
def custom_scraping(options, search_results_url, page_number):
    try:
        service = Service(executable_path=ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service, options=options)

        # Modifying the URL to include the page number
        if 'page=' in search_results_url:
            page_url = re.sub(r'page=\d+', f'page={page_number}', search_results_url)
        else:
            delimiter = '&' if '?' in search_results_url else '?'
            page_url = f"{search_results_url}{delimiter}page={page_number}"

        driver.get(page_url)

        # Waiting for the page to load
        WebDriverWait(driver, 30).until(
            EC.presence_of_element_located((By.XPATH, '//a[@class="petCard-link"]'))
        )

        # Extracting data from the page
        dog_data = get_info(driver, page_number)
        return dog_data

    except Exception as e:
        print(f"An error occurred in thread {threading.current_thread().name}: {e}")
        if 'driver' in locals():
            driver.save_screenshot(f"custom_scraping_error_{int(time.time())}.png")
        traceback.print_exc()
        return []
    finally:
        if 'driver' in locals():
            driver.quit()

In [None]:
def main():
    location = "Georgia"
    URL = "https://www.petfinder.com"
    options = Options()
    options.add_argument("headless")


    # Initializing driver to perform the search and get the search results URL
    service = Service(executable_path=ChromeDriverManager().install())
    temp_driver = webdriver.Chrome(service=service, options=options)
    search_results_url = search_dogs(URL, location, temp_driver)

    if not search_results_url:
        print("Failed to get search results URL.")
        return

    # Getting total pages
    total_pages = get_total_pages(temp_driver)
    temp_driver.quit()

    print(f"Total pages to scrape: {total_pages}")

    # Generating a list of page numbers
    page_numbers = list(range(1, total_pages + 1))

    # Limiting the number of threads
    max_workers = min(5, total_pages)

    # Initializing lists to collect all data
    all_dog_data = []
    all_incomplete_data = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [
            executor.submit(custom_scraping, options, search_results_url, page_number)
            for page_number in page_numbers
        ]
        for future in concurrent.futures.as_completed(futures):
            try:
                data = future.result()
                all_dog_data.extend(data['dog_data'])
                all_incomplete_data.extend(data['incomplete_data'])
            except Exception as e:
                print(f"An error occurred: {e}")
                traceback.print_exc()

    print(f"Total dogs collected: {len(all_dog_data)}")
    print(f"Total incomplete data entries: {len(all_incomplete_data)}")   

    # Saving dog data to a csv 
    if all_dog_data:
        df_dog_data = pd.DataFrame(all_dog_data)
        df_dog_data.to_csv('georgia_dogs.csv', index=False)
        print("Data saved to georgia_dogs.csv")
    else:
        print("No data collected.")

    # Saving incomplete dog data to a csv 
    if all_incomplete_data:
        df_incomplete = pd.DataFrame(all_incomplete_data)
        df_incomplete.to_csv('incomplete_data.csv', index=False)
        print("Incomplete data saved to incomplete_data.csv")
    else:
        print("No incomplete data collected.")

if __name__ == "__main__":
    main()
    