# Data Pull

## Importing Libraries

In [1]:
import pandas as pd
import requests
from requests import get
from bs4 import BeautifulSoup
import time
import random
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## Perform Web Scrape from Craigslist in Los Angeles

**A Note on the Default Search**

The default starting page will have certain characteristics already standardized for the purposes of this project. I have narrowed the focus to only 1 bedroom, 1 bathroom apartments. There is also a specific map area being used. The map view is an area centered around Santa Monica, and encompasses the West Los Angeles region north of Manahattan Beach and south of Pacific Palisades. This area contains a high density of apartments, giving us a steady supply of data to pull.

This pulls many results, and isolates two big cost factors in the forthcoming regression equation. This allows us to better view the effects of engineered features, which we will perform later. 

Our goal here is to cater the data science insights to me first with an eye towards scaling to a potential use case by anybody. Thus, while we will engineer the data infrastructure with an eye towards scaling the data quantity and features, we want to narrow the insights to be useful to at least one person (myself) before we expand further. This lets us behave pragmatically within the time constraints of a 7-week project. 

First, let's declare global variables that will be used in our code.

In [2]:
## Global Variables ##

craigslist_base_url = 'https://losangeles.craigslist.org/search/santa-monica-ca/apa?lat=34.0315&lon=-118.461&max_bathrooms=1&max_bedrooms=1&min_bathrooms=1&min_bedrooms=1&postal=90095&search_distance=3.6#search=1'
craigslist_search_first_page_url = 'https://losangeles.craigslist.org/search/santa-monica-ca/apa?lat=34.0315&lon=-118.461&max_bathrooms=1&max_bedrooms=1&min_bathrooms=1&min_bedrooms=1&postal=90095&search_distance=3.6#search=1~list~0~0'
chrome_driver_path = '../Other_Material/chromedriver-mac-arm64/chromedriver'

Next, let's proceed with implementing our Selenium code to get the total listings amount. We write a function called "get_postings_count" that takes in the two arguments "website" and "path" and returns the post count. We need to use JavaScript here, because the post count is loaded dynamically into the webpage. 

In [3]:
# Function to return the number of posts at a given time
def get_postings_count(website, path):
    
    # prevent a window from opening in Selenium
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    
    # set up the Chrome driver path for Selenium usage
    service = Service(path)
    driver = webdriver.Chrome(service=service, options=options)

    # Call a "get" instance of the initial Craigslist page to initialize Selenium
    driver.get(website)

    # Use a waiting period to make sure all the elements load for Selenium to inspect
    wait = WebDriverWait(driver, 10)  # Wait for up to 10 seconds

    try:
        # Wait for the specific element to be present before executing the script
        postings_count_element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.cl-count-save-bar > div')))

        # Use JavaScript to set up a script to return the postings count
        postings_count_script = """
            var postingsDiv = document.querySelector('.cl-count-save-bar > div');
            return postingsDiv ? postingsDiv.textContent : 'Postings count not found';
        """

        # Execute the script to get the post count and return it
        postings_count = driver.execute_script(postings_count_script)
        return postings_count
    finally:
        # Exits Selenium
        driver.quit()

Let's call the function now to get the postings count.

In [8]:
# Call the get_postings_count function
postings_count = get_postings_count(craigslist_search_first_page_url, chrome_driver_path)

Finally, let's print the amount of posts to see what we are working with.

In [9]:
# Check to see how many postings there are
print(postings_count)

2,353 postings


We can now use the post count to get the number of pages to loop through. There are 120 posts per page, so we want to extract the post count and divide it by 120.

In [10]:
# Function to calculate the number of pages for us to loop through
def calculate_pages_from_postings(postings_count_str):
    
    # Remove commas and extract the numerical part of the string
    num_postings = int(postings_count_str.replace(" postings", "").replace(",", ""))
    
    # 120 posts per page
    postings_per_page = 120
    
    # Calculate the number of pages needed to display all postings, accounting for remainder
    num_pages = -(-num_postings // postings_per_page)  
    
    return num_pages

In [11]:
# Call the calculate_pages_from_postings function
number_of_pages = calculate_pages_from_postings(postings_count)
print(f"Number of pages: {number_of_pages}")

Number of pages: 20


Next, we write a function to extract the links from each of the pages. We do this by 

1. initializing a list
2. looping through the number of pages
3. finding all the 'li' tags within page that use the class 'cl-static-search-result'
4. finding the 'a' tags within the 'li' tags, which contain the link to the individual listing
5. appending these links to the list

In [12]:
# Create the extract_listings_links function
def extract_listings_links(number_of_pages):
    
    # Initialize a list to store the links
    listings_links = []

    # Iterate through the number of pages
    for i in range(number_of_pages):  

        # Use the page number to get the webpages containing the listings
        page_number = i
        page_url = f'{craigslist_base_url}~list~{page_number}~0'

        # Call a get instance with the URL
        response = requests.get(page_url)

        # Sleep in order to not overwhelm servers
        time.sleep(5 + 10 * random.random())

        # Find all the listings links on the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Look for all 'li' tags with the class 'cl-static-search-result'
        listings = soup.find_all('li', class_='cl-static-search-result')

        # Loop through all the listings and append links to the list
        for listing in listings:
            a_tag = listing.find('a', href=True)
            if a_tag:
                listings_links.append(a_tag['href'])

    return listings_links

In [13]:
# Call the extract_listings_links function and store the returned list in the 'all_links' variable
all_links = extract_listings_links(number_of_pages)

Let's check out the "all_links" list to see the extraction was successful.

In [14]:
print(len(all_links))

7200


We see that there is content within the all_links list. Next, let's get a sample of three of the links to ensure we pulled what we wanted.

In [15]:
print(f'Link #1: ',all_links[0])
print(f'Link #2: ',all_links[1])
print(f'Link #3: ',all_links[2])

Link #1:  https://losangeles.craigslist.org/wst/apa/d/venice-bedroom-in-marina-del-rey-quartz/7726576879.html
Link #2:  https://losangeles.craigslist.org/wst/apa/d/los-angeles-bedroom-ba-in-west-la/7726576204.html
Link #3:  https://losangeles.craigslist.org/wst/apa/d/los-angeles-westwood-bedroom-bath/7726575683.html


**Let us dive into working with the data within each listing link.**

We will start out by using just one of the links to do a single data pull. Once we do this, we can plan to write a function that will loop through all of the links in the all_links list.

First, we will use the requests and BeautifulSoup packages again to access the webpage.

In [16]:
# Call a get instance from the "requests" package using the URL
link_response = requests.get('https://losangeles.craigslist.org/wst/apa/d/los-angeles-live-at-palms-caribbean/7724486915.html')

# Sleep in order to not overwhelm servers
time.sleep(5 + 10 * random.random())

# Find all the listings links on the page
link_soup = BeautifulSoup(link_response.text, 'html.parser')

Below we isolate some of the pieces of information we want to use in the upcoming structured data frame we will put together.

In [17]:
title = link_soup.find("span", id="titletextonly").text.strip()
price = link_soup.find("span", class_="price").text.strip()
bedroom_info = link_soup.find("span", class_="housing").text.split("/")[1].split("-")[0].strip()
square_feet = link_soup.find("span", class_="housing").text.split("-")[1].split("ft")[0].strip()
full_address = link_soup.find("h2", class_="street-address").text.strip()

attribute_search = link_soup.find_all('div', class_='attr')
attributes = []
for listing in attribute_search:
    value_span = listing.find('span', class_='valu')
    attribute = value_span.text.strip()
    attributes.append(attribute)

Let us pause here to see what we have pulled from the listing we are testing.

In [18]:
print(title)
print(price)
print(bedroom_info)
print(square_feet)
print(full_address)
print(attributes)

Live at Palms Caribbean Apts 1 Bedroom 1 BA with Fridge, 2 Weeks Free!
$1,900
1br
550
3258 Overland Avenue, Palms, CA 90034
['monthly', 'air conditioning', 'cats are OK - purrr', 'apartment', 'laundry on site', 'off-street parking']


Now let's create a sample dataframe using the information we have so far.

In [19]:
# Create a pandas DataFrame
sample_df = pd.DataFrame({
    "Title": [title],
    "Price": [price],
    "Bedrooms": [bedroom_info],
    "Square Feet": [square_feet],
    "Full Address": [full_address],
    "Attributes": [attributes]
})

In [21]:
sample_df

Unnamed: 0,Title,Price,Bedrooms,Square Feet,Full Address,Attributes
0,Live at Palms Caribbean Apts 1 Bedroom 1 BA wi...,"$1,900",1br,550,"3258 Overland Avenue, Palms, CA 90034","[monthly, air conditioning, cats are OK - purr..."


Next, we work on extracting the content body of the listing. The content body will contain information in a less structured way. 

In [22]:
# Find the section body content located within the "postingbody" id
section_body = link_soup.find('section', id='postingbody')

As a starting point, it will be easiest to use the body content as a large string, since each listing's content format and information will vary considerably. Thus, below we use BeautifulSoup's text search ".text" feature to do this.

In [23]:
# Extract all text within the section as one string
description_text = section_body.text.strip()

print(description_text)

QR Code Link to This Post



2 Weeks Free Off the 2nd Month With A 12 Month Lease


Virtual Tour Unit 5819:Â https://my.matterport.com/show/?m=Q9v2nFtipfU

Welcome to Palms Caribbean Apartments!

Palms Caribbean Apartments is centrally located in the Palms/West Los Angeles/Culver City Adj area near shops and restaurants on Venice Blvd., and just a few minutes from the 405/10 freeways. We have apartments ranging from 1-3 bedrooms. Amenities include a pool, parking, and laundry on site. Our apartment amenities include stainless steel kitchen appliances in select units, vinyl plank flooring, and more. Come check us out today!


 Your New Apartment Home Features: 

- Vinyl Plank Flooring 
- Granite Countertops 
- Air Conditioning 
- Disposal 
- Stainless Steel Appliances (In Select Units) 
- Ceiling Fan(s) 
- Carpeting 
- Refrigerator 
- Gas Stove 


Property Features: 

- Uncovered Parking 
- Pool 
- Laundry On Site 


Pet Policy: 

Cats Welcome. Additional Fees and Deposit May Apply. 
Ca