# Data Pull

## Importing Libraries

In [1]:
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
import time
import random
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## Perform Web Scrape from Craigslist in Los Angeles

**A Note on the Default Search**

The default starting page will have certain characteristics already standardized for the purposes of this project. I have narrowed the focus to only 1 bedroom, 1 bathroom apartments. There is also a specific map area being used. The map view is an area centered around Santa Monica, and encompasses the West Los Angeles region north of Manahattan Beach and south of Pacific Palisades. This area contains a high density of apartments, giving us a steady supply of data to pull.

This pulls many results, and isolates two big cost factors in the forthcoming regression equation. This allows us to better view the effects of engineered features, which we will perform later. 

Our goal here is to cater the data science insights to me first with an eye towards scaling to a potential use case by anybody. Thus, while we will engineer the data infrastructure with an eye towards scaling the data quantity and features, we want to narrow the insights to be useful to at least one person (myself) before we expand further. This lets us behave pragmatically within the time constraints of a 7-week project. 

First, let's declare global variables that will be used in our code.

In [2]:
## Global Variables ##

craigslist_search_first_page_url = 'https://losangeles.craigslist.org/search/santa-monica-ca/apa?lat=34.0315&lon=-118.461&max_bathrooms=1&max_bedrooms=1&min_bathrooms=1&min_bedrooms=1&postal=90095&search_distance=3.6#search=1~list~0~0'
chrome_driver_path = '../Other_Material/chromedriver-mac-arm64/chromedriver'

Next, let's proceed with implementing our Selenium code to get the total listings amount. We write a function called "get_postings_count" that takes in the two arguments "website" and "path" and returns the post count. We need to use JavaScript here, because the post count is loaded dynamically into the webpage. 

In [6]:
def get_postings_count(website, path):
    
    # prevent a window from opening in Selenium
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    
    # set up the Chrome driver path for Selenium usage
    service = Service(path)
    driver = webdriver.Chrome(service=service, options=options)

    # Call a "get" instance of the initial Craigslist page to initialize Selenium
    driver.get(website)

    # Use a waiting period to make sure all the elements load for Selenium to inspect
    wait = WebDriverWait(driver, 10)  # Wait for up to 10 seconds

    try:
        # Wait for the specific element to be present before executing the script
        postings_count_element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.cl-count-save-bar > div')))

        # Use JavaScript to set up a script to return the postings count
        postings_count_script = """
            var postingsDiv = document.querySelector('.cl-count-save-bar > div');
            return postingsDiv ? postingsDiv.textContent : 'Postings count not found';
        """

        # Execute the script to get the post count and return it
        postings_count = driver.execute_script(postings_count_script)
        return postings_count
    finally:
        # Exits Selenium
        driver.quit()

Let's call the function now to get the postings count.

In [7]:
postings_count = get_postings_count(craigslist_search_first_page_url, chrome_driver_path)

Finally, let's print the amount of posts to see what we are working with.

In [8]:
print(postings_count)

2,399 postings
