# Data Pull

## Importing Libraries

In [1]:
import pandas as pd
import requests
from requests import get
from bs4 import BeautifulSoup
import time
import random
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re

## Perform Web Scrape from Craigslist in Los Angeles

**A Note on the Default Search**

The default starting page will have certain characteristics already standardized for the purposes of this project. I have narrowed the focus to only 1 bedroom, 1 bathroom apartments. There is also a specific map area being used. The map view is an area centered around Santa Monica, and encompasses the West Los Angeles region north of Manahattan Beach and south of Pacific Palisades. This area contains a high density of apartments, giving us a steady supply of data to pull.

This strategy pulls many results and isolates two big cost factors in the forthcoming regression equation. This allows us to better view the effects of engineered features, which we will perform later. 

Our goal here is to cater the data science insights to me (the author) first with an eye towards scaling to a potential use case by anyone. Thus, while we will engineer the data infrastructure with an eye towards scaling the data quantity and features, for now we want to narrow the insights to be useful to at least one person (myself) before we expand further. This lets us behave pragmatically within the time constraints of a 7-week project. 

#### First, we declare global variables that will be used in our code.

In [2]:
## Global Variables ##

# craigslist_base_url = 'https://losangeles.craigslist.org/search/santa-monica-ca/apa?lat=34.0315&lon=-118.461&max_bathrooms=1&max_bedrooms=1&min_bathrooms=1&min_bedrooms=1&postal=90095&search_distance=3.6#search=1'
# craigslist_search_first_page_url = 'https://losangeles.craigslist.org/search/santa-monica-ca/apa?lat=34.0315&lon=-118.461&max_bathrooms=1&max_bedrooms=1&min_bathrooms=1&min_bedrooms=1&postal=90095&search_distance=3.6#search=1~list~0~0'
craigslist_base_url = 'https://losangeles.craigslist.org/search/santa-monica-ca/apa?lat=34.019&lon=-118.4724&max_bathrooms=1&max_bedrooms=1&min_bathrooms=1&min_bedrooms=1&postal=90095&search_distance=0.7#search=1'
craigslist_search_first_page_url = 'https://losangeles.craigslist.org/search/santa-monica-ca/apa?lat=34.019&lon=-118.4724&max_bathrooms=1&max_bedrooms=1&min_bathrooms=1&min_bedrooms=1&postal=90095&search_distance=0.7#search=1~list~0~0'
chrome_driver_path = '../Other_Material/chromedriver-mac-arm64/chromedriver'

#### We will be using BeautifulSoup to access URLs in this notebook, so we write a function to perform this operation now.

In [3]:
# Create the access_beautiful_soup function
def access_beautiful_soup(url):
    # Call a get instance with the URL
    response = requests.get(url)

    # Sleep in order to not overwhelm servers
    time.sleep(5 + 10 * random.random())

    # Find all the listings links on the page
    soup = BeautifulSoup(response.text, 'html.parser')

    return soup

#### Before we access the URLs of the listings, we need to find out how many results are in the search query. 

For the search in the area of Los Angeles we are doing, the number is typically around 2400 at any given time (posts expire after 30 days). However, we want to construct the application with an eye towards scaling, so we need to make the infrastructure flexible for different result amounts.

Since this information is not accessible via BeautifulSoup based on the way the Craigslist HTML structure is set up, we need to use Selenium. 

We now proceed with implementing the Selenium code to get the total listings amount. We write a function called "get_postings_count" that takes in the two arguments "website" and "path" and returns the post count. We need to use JavaScript here, because the post count is loaded dynamically into the webpage. 

In [4]:
# Function to return the number of posts at a given time
def get_postings_count(website, path):
    
    # prevent a window from opening in Selenium
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    
    # set up the Chrome driver path for Selenium usage
    service = Service(path)
    driver = webdriver.Chrome(service=service, options=options)

    # Call a "get" instance of the initial Craigslist page to initialize Selenium
    driver.get(website)

    # Use a waiting period of up to 10 seconds to make sure all the elements load for Selenium to inspect
    wait = WebDriverWait(driver, 10)  

    try:
        # Wait for the specific element to be present before executing the script
        postings_count_element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.cl-count-save-bar > div')))

        # Use JavaScript to set up a script to return the postings count
        postings_count_script = """
            var postingsDiv = document.querySelector('.cl-count-save-bar > div');
            return postingsDiv ? postingsDiv.textContent : 'Postings count not found';
        """

        # Execute the script to get the post count and return it
        postings_count = driver.execute_script(postings_count_script)
        return postings_count
    finally:
        # Exits Selenium
        driver.quit()

#### Let's call the function now to get the postings count.

In [5]:
# Call the get_postings_count function
postings_count = get_postings_count(craigslist_search_first_page_url, chrome_driver_path)

KeyboardInterrupt: 

#### Finally, we print the amount of posts to see how many we are working with.

In [6]:
# Check to see how many postings there are
print(postings_count)

48 postings


#### We can now use the post count to get the number of pages to loop through when we extract the listings. 

There are 120 posts per page, so we want to extract the post count and divide it by 120.

In [7]:
# Function to calculate the number of pages for us to loop through
def calculate_pages_from_postings(postings_count_str):
    
    # Remove commas and extract the numerical part of the string
    num_postings = int(postings_count_str.replace(" postings", "").replace(",", ""))
    
    # 120 posts per page
    postings_per_page = 120
    
    # Calculate the number of pages needed to display all postings, accounting for remainder
    num_pages = -(-num_postings // postings_per_page)  
    
    return num_pages

In [8]:
# Call the calculate_pages_from_postings function
number_of_pages = calculate_pages_from_postings(postings_count)
print(f"Number of pages: {number_of_pages}")

Number of pages: 1


#### Next, we write a function to extract the links from each of the pages. We do this by: 

1. initializing a list
2. looping through the number of pages
3. finding all the 'li' tags within page that use the class 'cl-static-search-result'
4. finding the 'a' tags within the 'li' tags, which contain the link to the individual listing
5. appending these links to the list

In [9]:
def extract_listing_links(path, base_url, number_of_pages):
    all_listing_links = []
    
    for page_number in range(number_of_pages):
        page_url = f'{base_url}~list~{page_number}~0'
        
        # prevent a window from opening in Selenium
        options = Options()
        options.add_argument('--headless')
        options.add_argument('--disable-gpu')
        
        # set up the Chrome driver path for Selenium usage
        service = Service(path)
        driver = webdriver.Chrome(service=service, options=options)
        
        driver.get(page_url)
        # Wait for the listings to be present
        WebDriverWait(driver, 10).until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.cl-search-result.cl-search-view-mode-list"))
        )
        # Now that the page is loaded, find all the `a` tags within the listings
        listing_links = [a.get_attribute('href') for a in driver.find_elements(By.CSS_SELECTOR, "li.cl-search-result.cl-search-view-mode-list a")]

        all_listing_links.extend(listing_links)

        driver.quit()
        
    return all_listing_links

## NOTE TO SELF: DO NOT RE-RUN THE BELOW CELL

In [10]:
all_links = extract_listing_links(chrome_driver_path, craigslist_base_url, number_of_pages)

## NOTE TO SELF: DO NOT RE-RUN THE ABOVE CELL

#### Let's check out the "all_links" list to see the extraction was successful.

In [11]:
print(len(all_links))

48


#### We see that there is content within the all_links list. 

Let us also get a sample of three of the links to ensure we pulled what we wanted.

In [12]:
print(f'Link #1: ',all_links[0])
print(f'Link #2: ',all_links[1])
print(f'Link #3: ',all_links[2])

Link #1:  https://losangeles.craigslist.org/wst/apa/d/santa-monica-refrigerator-hardwood/7728203084.html
Link #2:  https://losangeles.craigslist.org/wst/apa/d/santa-monica-new-renovated-interior/7728136256.html
Link #3:  https://losangeles.craigslist.org/wst/apa/d/santa-monica-remodeled-cottage-style/7728056080.html


#### We now dive into working with the data within each listing link.

First, we set up a dataframe containing basic column information.

In [108]:
# Initialize the DataFrame
df_columns = ["Title", "Price", "Bedrooms", "Square Feet", "Full Address"]
listings_df = pd.DataFrame(columns=df_columns)

# Set the max columns to infinite so that we may view all of them
pd.set_option('display.max_columns', None)

## We declare an object called "links_and_soups" to pair each link with its BeautifulSoup content.

This is a lengthy process, due to the fact that we have random sleep intervals between each time we access BeautifulSoup. 

We need to do this because the soup content will contain the information we want, so we need the soup content for each of the links. 

We start by declaring the object that will hold the links and soups as key value pairs.

## NOTE TO SELF DO NOT RUN BELOW CELL

In [109]:
# Load the CSV file
df = pd.read_csv("../Data/linksnsoups.csv", header=None, names=['link', 'soup_string'])

In [110]:
links_and_soups = {}

In [111]:
# Convert string back to BeautifulSoup objects and populate the dictionary
for index, row in df.iterrows():
    soup_object = BeautifulSoup(row['soup_string'], 'html.parser')  # Assuming 'html.parser' was used initially
    links_and_soups[row['link']] = soup_object

print(f"Loaded {len(links_and_soups)} entries.")

Loaded 2395 entries.


## NOTE TO SELF DO NOT RUN ABOVE CELL

#### We write a function for pairing the links with their soup content.

In [112]:
# # Function to pair links with soup content
# def pair_links_and_soups(list_of_links):
#     for link in list_of_links:
#         the_soup = access_beautiful_soup(link)
#         links_and_soups[link] = the_soup

## NOTE TO SELF DO NOT RUN BELOW CELL

We run the function using the links that were pulled previously.

**NOTE: This can take a long time. For example, 2300 links took 7.5 hours**

In [113]:
# pair_links_and_soups(all_links)

## NOTE TO SELF DO NOT RUN ABOVE CELL

#### Let's check to see if the links and soups object was successfully populated with data.

In [114]:
print(len(links_and_soups))

2395


#### Next, we begin the process of creating boolean values for different attributes.

Each listing contains different attributes that the poster uses to convey information about a property and market it. While there is a lot of overlap between listings, we need to see all of the options. To do this, we initialize a dictionary called "full_attribute_counts," then add unique values and count them. Ultimately, we want to create columns with these values and use boolean values "1" or "0" meaning "present" or "not present" in the listing.

In [115]:
global_attribute_counts = {}

In [116]:
# Define the count_attributes_function to view all the attributes used in apartment listings
def process_attributes(the_soup):
    attribute_search = the_soup.find_all('div', class_='attr')
    attributes = []
    fee_needed = 0  # Initialize a flag for fees

    fee_pattern = re.compile(r'\b\d+\b')  # Regex to identify fee-related attributes

    for listing in attribute_search:
        value_span = listing.find('span', class_='valu')
        if value_span:
            attribute = value_span.text.strip()
            global_attribute_counts[attribute] = global_attribute_counts.get(attribute, 0) + 1 
            if fee_pattern.search(attribute):  # Check if attribute suggests a fee
                fee_needed = 1
            else:
                attributes.append(attribute)  # Only add non-fee attributes to the list

    return attributes, fee_needed

#### Below we run the process_attributes function

In [117]:
# Run process_attributes using the info in the links_and_soups dictionary
for link, soup in links_and_soups.items():
    attributes = process_attributes(soup)

#### Let's take a look at all the attributes that were pulled.

In [118]:
# Storing the object results in a variable called "raw_attributes" 
raw_attributes = global_attribute_counts

In [119]:
raw_attributes

{'monthly': 2323,
 'air conditioning': 1339,
 'cats are OK - purrr': 1823,
 'apartment': 2315,
 'laundry on site': 1392,
 'off-street parking': 817,
 'dogs are OK - wooof': 1425,
 'carport': 501,
 'laundry in bldg': 371,
 'w/d in unit': 532,
 'detached garage': 394,
 'attached garage': 431,
 'wheelchair accessible': 354,
 'no smoking': 498,
 'EV charging': 684,
 'no parking': 140,
 'furnished': 83,
 '$52': 146,
 'house': 6,
 'street parking': 42,
 '$49.50': 146,
 '$49': 1,
 'None': 2,
 'CRE Inc': 2,
 '$30': 1,
 '50$': 1,
 '25.00': 1,
 'no laundry on site': 27,
 '$30 application fee per applicant': 2,
 '$50 Dollars per applicant.': 2,
 'NO BROKER FEE': 2,
 'PRIVATE OWNER': 1,
 'REMAX': 1,
 '$45.00 Application Screening Fee Per Adult': 11,
 '40': 1,
 '40 per person': 2,
 'valet parking': 2,
 'w/d hookups': 5,
 '$45': 3,
 'daily': 3,
 '$45 Application Fee': 3,
 '$50': 1,
 '39.99': 2,
 '$42.50': 2,
 '$26 per adult': 2,
 'www.conradpm.com $45': 1,
 '$35': 2,
 'condo': 2,
 'duplex': 3,
 '45.

#### We see a lot of overlap but also noise to clean up
We see there is a high proportion of overlap with "monthly" being the most common item found. Outside the most common, there is some noise to clean up. Looking closer, we see that the noise is mostly made up of application and credit check fees. To clean this up, we can remove all of this and simply group all of these into a key called "Fee Needed To Apply". Let's write a function that will clean up listings with fee information. This will do a very simple check: if there is an attribute with an integer in it, then it will be categorized as a fee. 

In [120]:
# Function to group together fee related attributes
def clean_up_the_fees(attributes_dictionary):
    
    # Initialize a count for "Fees Needed To Apply Key"
    fees_needed_to_apply = 0

    # Set up a Regex to identify keys containing integers
    # We will use the re package to do this
    fee_pattern = re.compile(r'\b\d+\b')

    # Iterate through the dictionary, summing up counts for fee-related attributes
    for key, value in raw_attributes.items():
        if fee_pattern.search(key):
            fees_needed_to_apply += value
    
    # Update the dictionary and add in a key called "Fee Needed To Apply"
    cleaned_attributes = {key: value for key, value in raw_attributes.items() if not fee_pattern.search(key)}
    cleaned_attributes["Fee Needed To Apply"] = fees_needed_to_apply

    return cleaned_attributes

In [121]:
# Run the clean_up_the_fees function using the raw_attributes as input
cleaned_attributes = clean_up_the_fees(raw_attributes)

In [122]:
# Sort the cleaned_attributes in descending order of instance count
cleaned_attributes = dict(sorted(cleaned_attributes.items(), key=lambda item: item[1], reverse=True))

#### Inspecting the cleaned attributes dictionary.

We see the fee related material is now grouped into one key called "Fee Needed To Apply," such that we answer the question of whether or not a fee is needed for an application.

In [123]:
cleaned_attributes

{'monthly': 2323,
 'apartment': 2315,
 'cats are OK - purrr': 1823,
 'dogs are OK - wooof': 1425,
 'laundry on site': 1392,
 'air conditioning': 1339,
 'off-street parking': 817,
 'EV charging': 684,
 'w/d in unit': 532,
 'carport': 501,
 'no smoking': 498,
 'attached garage': 431,
 'detached garage': 394,
 'laundry in bldg': 371,
 'Fee Needed To Apply': 361,
 'wheelchair accessible': 354,
 'no parking': 140,
 'furnished': 83,
 'street parking': 42,
 'no laundry on site': 27,
 'house': 6,
 'w/d hookups': 5,
 'daily': 3,
 'duplex': 3,
 'None': 2,
 'CRE Inc': 2,
 'NO BROKER FEE': 2,
 'valet parking': 2,
 'condo': 2,
 'PRIVATE OWNER': 1,
 'REMAX': 1,
 'cottage/cabin': 1,
 'weekly': 1,
 'Parking space fee is negotiable': 1}

#### We will have some noise that we don't know the meaning of. 

Many of the attributes seem to be one-off items that are unique to one or two posts. Let us get rid of the ones with less than 5 instances, with "w/d hookups" as our cut-off.

In [124]:
# Filtering out attributes with less than 10 instance counts
filtered_attributes = {key: value for key, value in cleaned_attributes.items() if value >= 5}

In [125]:
filtered_attributes

{'monthly': 2323,
 'apartment': 2315,
 'cats are OK - purrr': 1823,
 'dogs are OK - wooof': 1425,
 'laundry on site': 1392,
 'air conditioning': 1339,
 'off-street parking': 817,
 'EV charging': 684,
 'w/d in unit': 532,
 'carport': 501,
 'no smoking': 498,
 'attached garage': 431,
 'detached garage': 394,
 'laundry in bldg': 371,
 'Fee Needed To Apply': 361,
 'wheelchair accessible': 354,
 'no parking': 140,
 'furnished': 83,
 'street parking': 42,
 'no laundry on site': 27,
 'house': 6,
 'w/d hookups': 5}

In [126]:
def collect_basic_information(the_soup):
    title_element = the_soup.find("span", id="titletextonly")
    title = title_element.text.strip() if title_element else "Title Not Found"
    
    price_element = the_soup.find("span", class_="price")
    price = price_element.text.strip() if price_element else "Price Not Found"
    
    housing_element = the_soup.find("span", class_="housing")
    if housing_element:
        try:
            bedroom_info = housing_element.text.split("/")[1].split("-")[0].strip()
            square_feet = housing_element.text.split("-")[1].split("ft")[0].strip()
        except IndexError:
            bedroom_info = "Bedrooms Info Not Found"
            square_feet = "Square Feet Not Found"
    else:
        bedroom_info = "Bedrooms Info Not Found"
        square_feet = "Square Feet Not Found"
    
    full_address_element = the_soup.find("h2", class_="street-address")
    full_address = full_address_element.text.strip() if full_address_element else "None listed"

    return title, price, bedroom_info, square_feet, full_address


In [127]:
# # Define the count_attributes_function to view all the attributes used in apartment listings
# def create_dataframe(links_and_soups, listings_df):
#     localized_df = listings_df.copy()
    
#     for link, soup in links_and_soups.items():
#         title, price, bedroom_info, square_feet, full_address = collect_basic_information(soup)
#         listing_attributes, fee_needed = process_attributes(soup)
        
#         new_row_data = {"Title": title, "Price": price, "Bedrooms": bedroom_info, "Square Feet": square_feet, "Full Address": full_address}
        
#         # For each attribute in filtered_attributes, add to new_row_data with 1 or 0
#         for attribute in filtered_attributes.keys():
#             new_row_data[attribute] = 1 if attribute in listing_attributes else 0

#         # Convert new_row_data to a DataFrame row and concat to localized_df
#         new_row_df = pd.DataFrame([new_row_data])
#         localized_df = pd.concat([localized_df, new_row_df], ignore_index=True)
    
#     return localized_df


def create_dataframe(links_and_soups, listings_df):
    localized_df = listings_df.copy()
    
    for link, soup in links_and_soups.items():
        title, price, bedroom_info, square_feet, full_address = collect_basic_information(soup)
        listing_attributes, fee_needed = process_attributes(soup)  # Capture fee_needed flag here
        
        # Start with basic info
        new_row_data = {
            "Title": title,
            "Price": price,
            "Bedrooms": bedroom_info,
            "Square Feet": square_feet,
            "Full Address": full_address,
        }
        
        # For each attribute in filtered_attributes, add to new_row_data with 1 or 0
        for attribute in filtered_attributes.keys():
            new_row_data[attribute] = 1 if attribute in listing_attributes else 0
        
        # Add "Fee Needed To Apply" after processing filtered_attributes
        new_row_data["Fee Needed To Apply"] = fee_needed

        # Convert new_row_data to a DataFrame row and concat to localized_df
        new_row_df = pd.DataFrame([new_row_data])
        localized_df = pd.concat([localized_df, new_row_df], ignore_index=True)
    
    return localized_df


In [128]:
# Now call the updated function
listings_df = create_dataframe(links_and_soups, listings_df)

In [129]:
listings_df.head()

Unnamed: 0,Title,Price,Bedrooms,Square Feet,Full Address,monthly,apartment,cats are OK - purrr,dogs are OK - wooof,laundry on site,air conditioning,off-street parking,EV charging,w/d in unit,carport,no smoking,attached garage,detached garage,laundry in bldg,Fee Needed To Apply,wheelchair accessible,no parking,furnished,street parking,no laundry on site,house,w/d hookups
0,Title Not Found,Price Not Found,Bedrooms Info Not Found,Square Feet Not Found,None listed,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1 Bedroom in Marina Del Rey -Quartz Counters -...,"$3,295",1br,750,"415 Washington Boulevard, Venice, CA 90292",1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1 Bedroom 1 BA in West L.A. | Hardwood Style F...,"$2,250",1br,700,None listed,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Lease TODAY, Save BIG! One Month FREE Rent Offer!","$2,700",1br,590,"11411 Rochester Avenue, Los Angeles, CA 90025",1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1 Bedroom in the Heart of Venice* Plank Floors...,"$2,895",1br,750,"237 Fourth Avenue, Venice, CA 90291",1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [132]:
len(listings_df)

2395

In [133]:
listings_df.to_csv('../Data/craigslist_data.csv', index=False)