Project 2: Web Scraping and Regression<br/>

The goal of this project is to build a linear regression model to predict some numerical value using data scraped from the world wide web.

For this project, I have chosen to scrape Craigslist car postings in the SF Bay Area. The time period for my scrape is between July 10,2020 and July 12,2020. I will be collecting lots of attribute information and see how they correlate with price. Price is the target variable I am trying to predict.

In this notebook, I scrape Craigslist car posting for data and save them to CSV files for later analysis.

### Notebook Table of Contents

1. Imports
2. Scraping a Craigslist Car Post
3. Scraping Craigslist Results Page
4. Building my DataFrame
5. Saving the Data

### 1. Imports

In [1]:
# import Web Scraping tools
from bs4 import BeautifulSoup
import requests

# import tools to slow down web scraping
import time
import random

In [2]:
# import Data Science Tools
import pandas as pd
import numpy as np
import pickle

# import csv to save data
import csv

# import datetime
from datetime import datetime

### 2. Scraping a Craigslist Car Post

For myy first web scraping exercise, I will practice scraping one post and refine my scraping from there.

In [3]:
# This url may not work because the post may have been deleted.
url = "https://sfbay.craigslist.org/sfc/cto/d/san-francisco-2007-toyota-camry-hybrid/7156527385.html"

response = requests.get(url)

In [4]:
response.status_code # 200 = success!

200

In [5]:
page = response.text
page

'<!DOCTYPE html>\n<html class="no-js">\n<head>\n    \n\t<meta charset="UTF-8">\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\n\t<meta name="viewport" content="width=device-width,initial-scale=1">\n\t<meta property="og:site_name" content="craigslist">\n\t<meta name="twitter:card" content="preview">\n\t<meta property="og:title" content="2007 Toyota Camry Hybrid Sedan! - cars &amp; trucks - by owner - vehicle...">\n\t<meta name="description" content="Greetings i am selling my 2007 Toyota Camry Hybrid LE that is currently in awesome condition...">\n\t<meta property="og:description" content="Greetings i am selling my 2007 Toyota Camry Hybrid LE that is currently in awesome condition...">\n\t<meta property="og:image" content="https://images.craigslist.org/00101_jzVubKwL5kx_0CI0t2_600x450.jpg">\n\t<meta property="og:url" content="https://sfbay.craigslist.org/sfc/cto/d/san-francisco-2007-toyota-camry-hybrid/7156527385.html">\n\t<meta property="og:type" content="article">\n\t<meta na

In [6]:
soup = BeautifulSoup(page, 'lxml')
soup

<!DOCTYPE html>
<html class="no-js">
<head>
<meta charset="utf-8"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="craigslist" property="og:site_name"/>
<meta content="preview" name="twitter:card"/>
<meta content="2007 Toyota Camry Hybrid Sedan! - cars &amp; trucks - by owner - vehicle..." property="og:title"/>
<meta content="Greetings i am selling my 2007 Toyota Camry Hybrid LE that is currently in awesome condition..." name="description"/>
<meta content="Greetings i am selling my 2007 Toyota Camry Hybrid LE that is currently in awesome condition..." property="og:description"/>
<meta content="https://images.craigslist.org/00101_jzVubKwL5kx_0CI0t2_600x450.jpg" property="og:image"/>
<meta content="https://sfbay.craigslist.org/sfc/cto/d/san-francisco-2007-toyota-camry-hybrid/7156527385.html" property="og:url"/>
<meta content="article" property="og:type"/>
<meta content="noarchive,nofollow,unavaila

That's great! I got a successful status code and was able to parse through the HTML with BeautifulSoup. <br/>

What information do I want?
* Year, make, and model of the car
* Price
* Other car attributes (odometer reading, transmission, type, size)
* Nice to have: date of posting, post title

In [110]:
car_attribute_dict = {}

In [51]:
car_attr = soup.find_all('p', class_='attrgroup') # found the attributes

[<p class="attrgroup">
 <span><b>2007 toyota camry</b></span>
 <br/>
 </p>,
 <p class="attrgroup">
 <span>fuel: <b>gas</b></span>
 <br/>
 <span>odometer: <b>132000</b></span>
 <br/>
 <span>paint color: <b>blue</b></span>
 <br/>
 <span>title status: <b>salvage</b></span>
 <br/>
 <span>transmission: <b>automatic</b></span>
 <br/>
 </p>]

In [71]:
# parsing through 2 "attrgroup" classes
attr_group_1 = soup.find('p', class_='attrgroup')
attr_group_2 = attr_group_1.find_next_sibling('p')
print(attr_group_1)
print("\n", attr_group_2)  

<p class="attrgroup">
<span><b>2007 toyota camry</b></span>
<br/>
</p>

 <p class="attrgroup">
<span>fuel: <b>gas</b></span>
<br/>
<span>odometer: <b>132000</b></span>
<br/>
<span>paint color: <b>blue</b></span>
<br/>
<span>title status: <b>salvage</b></span>
<br/>
<span>transmission: <b>automatic</b></span>
<br/>
</p>


In [72]:
# grabbing year, make, and model
attr_group_1.text.strip().split()

['2007', 'toyota', 'camry']

In [80]:
# storing year, make, and model
car_year = attr_group_1.text.strip().split()[0]
car_make = attr_group_1.text.strip().split()[1] # this doesn't always work
car_model = ' '.join(attr_group_1.text.strip().split()[2:])
print(car_year, car_make, car_model, type(car_year), type(car_make), type(car_model))

2007 toyota camry <class 'int'> <class 'str'> <class 'str'>


In [111]:
# populating the dictionary to be used for my DataFrame
car_attribute_dict['year'] = car_year
car_attribute_dict['make'] = car_make
car_attribute_dict['model'] = car_model

In [98]:
# figuring out a strategy to grab attributes from the 2nd "attrgroup" class
attr_group_2

<p class="attrgroup">
<span>fuel: <b>gas</b></span>
<br/>
<span>odometer: <b>132000</b></span>
<br/>
<span>paint color: <b>blue</b></span>
<br/>
<span>title status: <b>salvage</b></span>
<br/>
<span>transmission: <b>automatic</b></span>
<br/>
</p>

In [99]:
# using the <span> tag to collect attributes
attr_group_2.find_all('span')

[<span>fuel: <b>gas</b></span>,
 <span>odometer: <b>132000</b></span>,
 <span>paint color: <b>blue</b></span>,
 <span>title status: <b>salvage</b></span>,
 <span>transmission: <b>automatic</b></span>]

In [112]:
# looping through the <span> tag to collect and add attributes to the dictionary
for attr in attr_group_2.find_all('span'):
    if ':' in attr.text: # don't grab attributes that say "cryptocurrency ok!" or something similar
        attribute = attr.text.split(':')[0].strip()
        value = attr.text.split(':')[1].strip()
        car_attribute_dict[attribute] = value
car_attribute_dict

{'year': 2007,
 'make': 'toyota',
 'model': 'camry',
 'fuel': 'gas',
 'odometer': '132000',
 'paint color': 'blue',
 'title status': 'salvage',
 'transmission': 'automatic'}

In [113]:
# grabbing post title, but it doesn't look right
title_text = soup.find('span', class_='postingtitletext')

{'year': 2007,
 'make': 'toyota',
 'model': 'camry',
 'fuel': 'gas',
 'odometer': '132000',
 'paint color': 'blue',
 'title status': 'salvage',
 'transmission': 'automatic',
 'post_title': <span class="postingtitletext">
 <span id="titletextonly">2007 Toyota Camry Hybrid Sedan!</span> - <span class="price">$3800</span><small> (San Francisco)</small> </span>}

In [22]:
# trying to use an id instead of text to get the post title
soup.find('span', id='titletextonly')

<span id="titletextonly">2007 Toyota Camry Hybrid Sedan!</span>

In [114]:
# figuring out how to grab the post title and add to my dictionary
title = soup.find('span', id='titletextonly').text
car_attribute_dict['post_title'] = title
car_attribute_dict

{'year': 2007,
 'make': 'toyota',
 'model': 'camry',
 'fuel': 'gas',
 'odometer': '132000',
 'paint color': 'blue',
 'title status': 'salvage',
 'transmission': 'automatic',
 'post_title': '2007 Toyota Camry Hybrid Sedan!'}

In [23]:
# figuring out how to grab price
soup.find('span', class_='price')

<span class="price">$3800</span>

In [116]:
price = soup.find('span', class_='price').text

In [120]:
# adding price to the dictionary
car_attribute_dict['price'] = price
car_attribute_dict

{'year': 2007,
 'make': 'toyota',
 'model': 'camry',
 'fuel': 'gas',
 'odometer': '132000',
 'paint color': 'blue',
 'title status': 'salvage',
 'transmission': 'automatic',
 'post_title': '2007 Toyota Camry Hybrid Sedan!',
 'price': 3800}

In [123]:
# adding location to the dictionary
location = soup.find('small').text.strip().replace('(','').replace(')','')
car_attribute_dict['location'] = location
car_attribute_dict

{'year': 2007,
 'make': 'toyota',
 'model': 'camry',
 'fuel': 'gas',
 'odometer': '132000',
 'paint color': 'blue',
 'title status': 'salvage',
 'transmission': 'automatic',
 'post_title': '2007 Toyota Camry Hybrid Sedan!',
 'price': 3800,
 'location': 'San Francisco'}

In [126]:
# adding posting date to the dictionary
post_timeago = soup.find('time', class_='date timeago').text.strip()
car_attribute_dict['posting_time'] = post_timeago
car_attribute_dict

{'year': 2007,
 'make': 'toyota',
 'model': 'camry',
 'fuel': 'gas',
 'odometer': '132000',
 'paint color': 'blue',
 'title status': 'salvage',
 'transmission': 'automatic',
 'post_title': '2007 Toyota Camry Hybrid Sedan!',
 'price': 3800,
 'location': 'San Francisco',
 'posting_time': '2020-07-09 15:54'}

In [97]:
posting_time = datetime.strptime(post_timeago, '%Y-%m-%d %H:%M')
posting_time

datetime.datetime(2020, 7, 9, 15, 54)

### 3. Scraping Craigslist Results Page<br/>
Now that I know how to scrape one Craigslist car post, how do I get links to all the Craigslist car postings?<br/>
There are 2 things I need to scrape:<br/>
1. Each Craigslist car post
2. The main Craigslist car search results with 120 car posts per search result page

In [140]:
# Looking at Craigslist's main search result page
search_url = "https://sfbay.craigslist.org/search/cta"
search_response = requests.get(search_url)
search_response.status_code

200

In [141]:
search_page = search_response.text
search_page

'\ufeff<!DOCTYPE html>\n<html class="no-js"><head>\n    <title>SF bay area cars &amp; trucks  - craigslist</title>\n\n    <meta name="description" content="SF bay area cars &amp; trucks  - craigslist">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge"/>\n    <link rel="canonical" href="https://sfbay.craigslist.org/search/cta">\n    <link rel="alternate" type="application/rss+xml" href="https://sfbay.craigslist.org/search/cta?format=rss" title="RSS feed for craigslist | SF bay area cars &amp; trucks  - craigslist">\n        <link rel="next" href="https://sfbay.craigslist.org/search/cta?s=120">\n    <meta name="viewport" content="width=device-width,initial-scale=1">\n    <link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org/styles/cl.css?v=5826de27c327d61d2169c6a45af814f9">\n    <link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org/styles/search.css?v=bc035cbbc3978b0ec9df93944cdf349b">\n    <link type="text/css" rel="stylesheet" med

In [142]:
search_soup = BeautifulSoup(search_page, 'lxml')
search_soup

<html><body><p>﻿<!DOCTYPE html>

</p>
<title>SF bay area cars &amp; trucks  - craigslist</title>
<meta content="SF bay area cars &amp; trucks  - craigslist" name="description"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<link href="https://sfbay.craigslist.org/search/cta" rel="canonical"/>
<link href="https://sfbay.craigslist.org/search/cta?format=rss" rel="alternate" title="RSS feed for craigslist | SF bay area cars &amp; trucks  - craigslist" type="application/rss+xml"/>
<link href="https://sfbay.craigslist.org/search/cta?s=120" rel="next"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<link href="//www.craigslist.org/styles/cl.css?v=5826de27c327d61d2169c6a45af814f9" media="all" rel="stylesheet" type="text/css"/>
<link href="//www.craigslist.org/styles/search.css?v=bc035cbbc3978b0ec9df93944cdf349b" media="all" rel="stylesheet" type="text/css"/>
<link href="//www.craigslist.org/styles/jquery-ui-clcustom.css?v=3b05ddffb7c7f5b62066deff2dda9339" media

In [146]:
# Pinpointing results and seeing how I can grab post URLs
result_info = search_soup.find_all('p', class_='result-info')
result_info

[<p class="result-info">
 <span class="icon icon-star" role="button">
 <span class="screen-reader-text">favorite this post</span>
 </span>
 <time class="result-date" datetime="2020-07-09 22:10" title="Thu 09 Jul 10:10:23 PM">Jul  9</time>
 <a class="result-title hdrlnk" data-id="7156661924" href="https://sfbay.craigslist.org/sby/cto/d/san-jose-2003-toyota-camry-xle-4/7156661924.html">*2003 Toyota Camry XLE , 4 cylender .Price to Sell</a>
 <span class="result-meta">
 <span class="result-price">$3800</span>
 <span class="result-hood"> (san jose north)</span>
 <span class="result-tags">
 <span class="pictag">pic</span>
 </span>
 <span class="banish icon icon-trash" role="button">
 <span class="screen-reader-text">hide this posting</span>
 </span>
 <span aria-hidden="true" class="unbanish icon icon-trash red" role="button"></span>
 <a class="restore-link" href="#">
 <span class="restore-narrow-text">restore</span>
 <span class="restore-wide-text">restore this posting</span>
 </a>
 </span>


In [148]:
# Generating a list of Craigslist post URLs
result_urls = []
for info in result_info:
    result_urls.append(info.find('a')['href'])
result_urls

['https://sfbay.craigslist.org/sby/cto/d/san-jose-2003-toyota-camry-xle-4/7156661924.html',
 'https://sfbay.craigslist.org/sby/cto/d/los-gatos-bmw-750-li-2011/7156661748.html',
 'https://sfbay.craigslist.org/nby/cto/d/vineburg-2002-lexus-ls430-sedan-great/7156661493.html',
 'https://sfbay.craigslist.org/eby/cto/d/pittsburg-03-toyota-camry-se-5-speed/7156661075.html',
 'https://sfbay.craigslist.org/pen/cto/d/san-mateo-2004-lexus-gx470/7156661023.html',
 'https://sfbay.craigslist.org/sfc/ctd/d/san-jose-2010-toyota-prius-free/7156660966.html',
 'https://sfbay.craigslist.org/sby/ctd/d/san-jose-2011-toyota-corolla-hybrid/7156660916.html',
 'https://sfbay.craigslist.org/sfc/ctd/d/san-jose-2008-toyota-highlander-hybrid/7156660889.html',
 'https://sfbay.craigslist.org/sby/ctd/d/san-jose-2008-toyota-highlander-hybrid/7156660868.html',
 'https://sfbay.craigslist.org/sby/ctd/d/san-jose-2008-toyota-camry-hybrid-rare/7156660805.html',
 'https://sfbay.craigslist.org/sfc/ctd/d/san-jose-2012-toyota-pr

In [149]:
# Checking that I grabbed all 120 URLs listed on Craigslist's main search results page
print(len(result_urls))

120


Now that I know how to scrape one of Craigslist's main search results page, how do I get all of the other search pages?  
Noticed that Craigslist URLs follow a pattern. They add `?s=120`, then `?s=240`, etc. to the base url in increments of 120 all the way up to `?s=2880`.

In [154]:
# Generating a list of Craigslist search result URLs to pull individual car post URLs from.
url_list = ['https://sfbay.craigslist.org/search/cta']
for i in range(120, 2881, 120):
    url_list.append(f'{base_url}?s={i}')
url_list

['https://sfbay.craigslist.org/search/cta',
 'https://sfbay.craigslist.org/search/cta?s=120',
 'https://sfbay.craigslist.org/search/cta?s=240',
 'https://sfbay.craigslist.org/search/cta?s=360',
 'https://sfbay.craigslist.org/search/cta?s=480',
 'https://sfbay.craigslist.org/search/cta?s=600',
 'https://sfbay.craigslist.org/search/cta?s=720',
 'https://sfbay.craigslist.org/search/cta?s=840',
 'https://sfbay.craigslist.org/search/cta?s=960',
 'https://sfbay.craigslist.org/search/cta?s=1080',
 'https://sfbay.craigslist.org/search/cta?s=1200',
 'https://sfbay.craigslist.org/search/cta?s=1320',
 'https://sfbay.craigslist.org/search/cta?s=1440',
 'https://sfbay.craigslist.org/search/cta?s=1560',
 'https://sfbay.craigslist.org/search/cta?s=1680',
 'https://sfbay.craigslist.org/search/cta?s=1800',
 'https://sfbay.craigslist.org/search/cta?s=1920',
 'https://sfbay.craigslist.org/search/cta?s=2040',
 'https://sfbay.craigslist.org/search/cta?s=2160',
 'https://sfbay.craigslist.org/search/cta?s=22

#### 4. Building my DataFrame

Here, I will build out a pipeline for web scraping

In [3]:
def get_year_make_model(car_soup):
    '''
    This function parses through the first group of attributes in the HTML and returns car year, make, and model

    Parameters:
        car_soup: BeautifulSoup HTML object
    Returns:
        car_year: year the car was made
        car_make: make of the car (e.g. Honda)
        car_model: model of the car (e.g. Accord)
    '''    
    
    if car_soup.find('p', class_='attrgroup'): # only assign year, make, and model if the attributes exist
        year_make_model = car_soup.find('p', class_='attrgroup')
    
        if len(year_make_model.text.strip().split()) > 2:
            car_year = year_make_model.text.strip().split()[0]
            car_make = year_make_model.text.strip().split()[1]
            car_model = ' '.join(year_make_model.text.strip().split()[2:])
        else:
            car_year = None
            car_make = None
            car_model = None
    else:
        car_year = None
        car_make = None
        car_model = None
    
    return car_year, car_make, car_model
    

In [4]:
def get_car_attributes(car_soup, car_dict):
    '''
    This function parses through the second group of attributes in the HTML and returns a dictionary with attribute names and values added
    Parameters:
        car_soup: BeautifulSoup HTML object
        car_dict: main dictionary to add car attributes
    Returns:
        car_dict: dictionary with attribute names and values added
    '''
    
    # if the second attribute group HTML tag exists, go to it
    if car_soup.find('p', class_='attrgroup') and car_soup.find('p', class_='attrgroup').find_next_sibling('p'):
        year_make_model = car_soup.find('p', class_='attrgroup')
        car_attributes = year_make_model.find_next_sibling('p')
    
        # grab attributes from the second attribute group
        for attr in car_attributes.find_all('span'):
            # attributes should look like "condition: excellent" rather than "cryptocurrency ok!" 
            if ':' in attr.text:
                attribute_name = attr.text.split(':')[0].strip()
                attribute_value = attr.text.split(':')[1].strip()
                car_dict[attribute_name] = attribute_value
    
    return car_dict

In [5]:
def get_title(car_soup):
    '''
    This function parses through the post title HTML id and returns the post title
    Parameters:
        car_soup: BeautifulSoup HTML object
    Returns:
        (the post title as text or None if the title can't be found)
    '''
    
    if car_soup.find('span', id='titletextonly'):
        return car_soup.find('span', id='titletextonly').text
    else:
        return None

In [6]:
def get_price(car_soup):
    '''
    This function parses through the price HTML class and returns the price
    Parameters:
        car_soup: BeautifulSoup HTML object
    Returns:
        (the price as text or None if the price can't be found)
    '''    
    
    if car_soup.find('span', class_='price'):
        return car_soup.find('span', class_='price').text
    else:
        return None

In [7]:
def get_location(car_soup):
    '''
    This function parses through the HTML and returns the post location
    Parameters:
        car_soup: BeautifulSoup HTML object
    Returns:
        (the post location as text or None if the location can't be found)
    '''    

    if car_soup.find('small'):
        # return "san mateo" instead of "(san mateo)"
        return car_soup.find('small').text.strip().replace('(','').replace(')','')
    else:
        return None

In [8]:
def get_posting_date(car_soup):
    '''
    This function parses through the time HTML tag and returns the posting time
    Parameters:
        car_soup: BeautifulSoup HTML object
    Returns:
        (the post time as text or None if the posting time can't be found)
    '''
    
    if car_soup.find('time', class_='date timeago'):
        return car_soup.find('time', class_='date timeago').text.strip()
    else:
        return None

In [9]:
def soupify_url(craigslist_url):
    '''
    This function sends a request to a Craigslist URL and tries to return the HTML
    as a BeautifulSoup HTML object.
    Parameters:
        craigslist_url: the url of the Craigslist car posting
    Returns:
        (the Craigslist car post BeautifulSoup HTML object)
    '''    
    
    try:
        response = requests.get(craigslist_url)
        car_page = response.text
        return BeautifulSoup(car_page, 'lxml')
    # if the URL doesn't work, print out the URL and return None
    except:
        print(craigslist_url + " is not okay")
        return None

In [10]:
def get_car_info(craigslist_url):
    '''
    This is the main function that puts everything together. It scrapes
    the Craigslist car url for attributes and returns a dictionary with all the scraped attributes.
    Parameters:
        craigslist_url: the url of the Craigslist car posting
    Returns:
        car_dict: a dictionary with all scraped attributes
    '''    
    
    car_soup = soupify_url(craigslist_url)
    
    if car_soup == None:
        return None
    
    car_dict = {}
    
    get_car_attributes(car_soup, car_dict)
    
    car_year, car_make, car_model = get_year_make_model(car_soup)
    car_dict['year'] = car_year
    car_dict['make'] = car_make
    car_dict['model'] = car_model
    
    car_dict['post_title'] = get_title(car_soup)
    car_dict['price'] = get_price(car_soup)
    car_dict['location'] = get_location(car_soup)
    car_dict['posting_date'] = get_posting_date(car_soup)
    
    car_dict['url'] = craigslist_url
    
    return car_dict


In [4]:
# testing my main function
sample_test_1 = get_car_info('https://sfbay.craigslist.org/sfc/cto/d/san-francisco-2007-toyota-camry-hybrid/7156527385.html')
# Note: this posting was deleted by the author

test_dict_1 = []
test_dict_1.append(sample_test_1)

https://sfbay.craigslist.org/sfc/cto/d/san-francisco-2007-toyota-camry-hybrid/7156527385.html is not okay


Now, I want to test my function on a Craigslist result search page

In [2]:
page_url = "https://sfbay.craigslist.org/search/cta"
search_page = requests.get(page_url).text
search_soup = BeautifulSoup(search_page, 'lxml')

In [None]:
# for one page of results, grab all the individual post URLs
all_results = search_soup.find_all('p', class_='result-info')
result_urls = []

for info in all_results:
    result_urls.append(info.find('a')['href'])
    
result_urls

In [11]:
# generate main craigslist search URLs
def generate_craigslist_urls():
    '''
    Generate a list of all Craigslist car posting search URLs
    Returns:
        search_craigslist_urls: a list of all craigslist search urls from ?s=120 to ?s=2880
    ''' 
    
    search_craigslist_urls = ['https://sfbay.craigslist.org/search/cta']
    base_url = "https://sfbay.craigslist.org/search/cta"

    for i in range(120, 2881, 120):
        search_craigslist_urls.append(f'{base_url}?s={i}')
    return search_craigslist_urls

In [32]:
# doing more testing
result_urls = []
for link in search_craigslist_urls:
    search_page = requests.get(link).text
    search_soup = BeautifulSoup(search_page, 'lxml')
    
    all_results = search_soup.find_all('p', class_='result-info')
    
    for result in all_results:
        result_urls.append(result.find('a')['href'])
        
    time.sleep(random.randint(1,2))
        
print(len(result_urls))

480


#### 5. Saving the Data

My goal is to scrape Craigslist on July 10, 11, and 12, save the URLs to a CSV, use the URLs to generate dictionaries for each car posting, turn the dictionaries into a DataFrame, and then save the DataFrame as a pickle.

In [None]:
# Generate main Craigslist search result URLs
search_craigslist_urls = generate_craigslist_urls()
search_craigslist_urls

In [37]:
# This was performed on July 10, 2020. Grab all the individual Craigslist post URLs from the main search pages.
result_urls = []
for link in search_craigslist_urls:
    search_page = requests.get(link).text
    search_soup = BeautifulSoup(search_page, 'lxml')
    
    all_results = search_soup.find_all('p', class_='result-info')
    
    for result in all_results:
        result_urls.append(result.find('a')['href'])
        
    time.sleep(random.randint(1,2))
result_urls

['https://sfbay.craigslist.org/eby/cto/d/concord-2013-chevy-tahoe-salvage-con/7156858229.html',
 'https://sfbay.craigslist.org/nby/ctd/d/2017-volvo-xc90-design/7156857632.html',
 'https://sfbay.craigslist.org/scz/ctd/d/monterey-2013-audi-q5-quattro-4dr-20t/7156857622.html',
 'https://sfbay.craigslist.org/sfc/ctd/d/san-jose-2013-hyundai-veloster-z/7156857576.html',
 'https://sfbay.craigslist.org/sfc/ctd/d/san-jose-2012-mitsubishi-galant-z/7156857334.html',
 'https://sfbay.craigslist.org/sby/ctd/d/watsonville-2019-gmc-yukon-4wd-4d-sport/7156857189.html',
 'https://sfbay.craigslist.org/sfc/ctd/d/san-jose-2012-acura-tl-z-financing/7156857020.html',
 'https://sfbay.craigslist.org/sfc/ctd/d/san-jose-2012-acura-tl-z-financing/7156856746.html',
 'https://sfbay.craigslist.org/scz/ctd/d/watsonville-2017-jeep-cherokee-fwd-4d/7156856673.html',
 'https://sfbay.craigslist.org/sfc/ctd/d/san-jose-2007-toyota-camry-z-financing/7156856536.html',
 'https://sfbay.craigslist.org/eby/cto/d/martinez-2004-vol

In [39]:
# Turn the list of URLs into a DataFrame
result_urls_df = pd.DataFrame(result_urls)
result_urls_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3000 non-null   object
dtypes: object(1)
memory usage: 23.6+ KB


In [40]:
# Save the July 10, 2020 URLs to a CSV
result_urls_df.to_csv('data/july_10_craigslist_urls.csv', index=False)

In [None]:
# Test opening the CSV
with open('data/july_10_craigslist_urls.csv', 'r') as file:
    # there is a 0 saved as the first value of the csv that I don't want
    july_10_urls = [url.strip() for url in file][1:]
july_10_urls

In [14]:
# Use my main function to generate car attribute dictionaries for each Craigslist post URL
july_10_info = []

for url in july_10_urls:
    july_10_info.append(get_car_info(url))
    time.sleep(1)

https://sfbay.craigslist.org/sby/ctd/d/redwood-city-2014-dodge-challenger-t/7156835400.html is not okay
https://sfbay.craigslist.org/pen/cto/d/south-san-francisco-reduced-price-2014/7155810710.html is not okay


In [None]:
# Check the first five entries of the list
july_10_info[:5]

In [None]:
# Remove any values where the URL didn't work and the data is None rather than a dictionary
cleaned_july_10_info = [data for data in july_10_info if data != None]
cleaned_july_10_info

In [22]:
# Generate a DataFrame for the July 10, 2020 data
july_10_df = pd.DataFrame(cleaned_july_10_info)
july_10_df.head()

Unnamed: 0,condition,cylinders,drive,fuel,odometer,paint color,title status,transmission,type,year,make,model,post_title,price,location,posting_date,url,VIN,size
0,good,8 cylinders,rwd,gas,160000,black,salvage,automatic,SUV,2013,chevy,tahoe lt,2013 Chevy Tahoe salvage con 160000 millas,$10300,concord / pleasant hill / martinez,2020-07-10 09:14,https://sfbay.craigslist.org/eby/cto/d/concord...,,
1,excellent,4 cylinders,4wd,gas,36368,white,clean,automatic,SUV,2017,volvo,xc90,✭2017 Volvo XC90 R-Design,$34800,san rafael,2020-07-10 09:14,https://sfbay.craigslist.org/nby/ctd/d/2017-vo...,YV4102KMXH1131603,mid-size
2,,4 cylinders,4wd,gas,59541,black,clean,other,SUV,2013,Audi,Q5,2013 Audi Q5 quattro 4dr 2.0T Premium Plus,$17900,2013 Audi Q5,2020-07-10 09:14,https://sfbay.craigslist.org/scz/ctd/d/montere...,WA1LFAFP2DA057459,
3,,4 cylinders,fwd,gas,102693,black,clean,automatic,other,2013,Hyundai,Veloster,2013 Hyundai Veloster - E-Z Financing!,$8499,,2020-07-10 09:14,https://sfbay.craigslist.org/sfc/ctd/d/san-jos...,KMHTC6AE1DU158388,
4,,4 cylinders,fwd,gas,64183,white,clean,automatic,sedan,2012,Mitsubishi,Galant,2012 Mitsubishi Galant - E-Z Financing!,$6499,,2020-07-10 09:13,https://sfbay.craigslist.org/sfc/ctd/d/san-jos...,4A32B3FF9CE017464,


In [24]:
# Save the July 10 data as a pickle
with open('data/july_10.pickle', 'wb') as to_write:
    pickle.dump(july_10_df, to_write)

In [14]:
# make a function to get all the car posting URLs
def get_individual_car_urls(search_craigslist_urls):
    '''
    Obtain individual car posting URLs
    Parameters:
        search_craigslist_urls: the urls of the Craigslist main results page
    Returns:
        result_urls: a list of URLs corresponding to individual Craigslist car postings
    '''
    
    result_urls = []
    
    for link in search_craigslist_urls:
        search_page = requests.get(link).text
        search_soup = BeautifulSoup(search_page, 'lxml')
        
        all_results = search_soup.find_all('p', class_='result-info')
        
        for result in all_results:
            result_urls.append(result.find('a')['href'])
        
        time.sleep(random.randint(1,2))
    
    return result_urls

In [None]:
# Grab individual Craigslist car post URLs for July 11
july_11_result_urls = get_individual_car_urls(search_craigslist_urls)
july_11_result_urls

In [9]:
# Turn the URLs into CSV
july_11_result_urls_df = pd.DataFrame(july_11_result_urls)
july_11_result_urls_df.to_csv('data/july_11_craigslist_urls.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3000 non-null   object
dtypes: object(1)
memory usage: 23.6+ KB


In [20]:
# Remove URLs that I already collected attributes for
def remove_duplicate_urls(new_urls, old_urls):
    '''
    Go through a list of recently acquired urls and remove any duplicate urls that were in a previously
    collected list of urls.
    Parameters:
        new_urls: a list of Craigslist car post URLs that were recently collected
        old_urls: a list of Craigslist car post URLs that were collected previously
    Returns:
        new_urls_deduped: a list of Craigslist car post URLs recently collected with previously collecte urls
        removed
    '''
    
    new_urls_deduped = []
    
    for url in new_urls:
        if url not in old_urls:
            new_urls_deduped.append(url)
    
    return new_urls_deduped

In [13]:
# save the July 11 individual Craigslist car posting URLs
with open('data/july_11_craigslist_urls.csv', 'r') as file:
    # there is a 0 saved as the first value of the csv that I don't want
    july_11_urls = [url.strip() for url in file][1:]
july_11_urls[:5]

['https://sfbay.craigslist.org/eby/ctd/d/livermore-2015-subaru-impreza-awd-sedan/7157444766.html',
 'https://sfbay.craigslist.org/eby/cto/d/hayward-2005-honda-pilot-ex-4wd/7157444757.html',
 'https://sfbay.craigslist.org/eby/ctd/d/fremont-2006-mercedes-benz-clk-class/7157444754.html',
 'https://sfbay.craigslist.org/sfc/ctd/d/san-jose-2001-honda-civic-lx-all-great/7157444565.html',
 'https://sfbay.craigslist.org/eby/ctd/d/fremont-2008-mercedes-benz-clk-class/7157444545.html']

In [15]:
# remove duplicate links scraped from July 11th that were already in my july 10th urls
july_11_urls_deduped = remove_duplicate_urls(july_11_urls, july_10_urls)
len(july_11_urls_deduped)

2537

In [24]:
# Use my main function to generate car attribute dictionaries for each Craigslist post URL 
july_11_info = []

for url in july_11_urls_deduped:
    july_11_info.append(get_car_info(url))
    time.sleep(0.5)

https://sfbay.craigslist.org/eby/ctd/d/stockton-2012-mercedes-benz-ml350-ml/7156914606.html is not okay


In [None]:
# Remove any values where the URL didn't work or the data is None rather than a dictionary
cleaned_july_11_info = [data for data in july_11_info if data != None]
cleaned_july_11_info

In [26]:
# Generate a DataFrame for the July 11, 2020 data
july_11_df = pd.DataFrame(cleaned_july_11_info)
july_11_df.head()

Unnamed: 0,VIN,condition,cylinders,drive,fuel,odometer,paint color,title status,transmission,type,year,make,model,post_title,price,location,posting_date,url,size
0,JF1GJAA63FH008498,excellent,4 cylinders,4wd,gas,99245,blue,clean,automatic,sedan,2015.0,Subaru,Impreza sedan,"2015 Subaru Impreza AWD Sedan. Clean, Runs exc...",$8800,dublin / pleasanton / livermore,2020-07-11 09:05,https://sfbay.craigslist.org/eby/ctd/d/livermo...,
1,,,,4wd,gas,210000,,clean,automatic,,,,,2005 Honda Pilot EX 4WD,$4200,hayward / castro valley,2020-07-11 09:05,https://sfbay.craigslist.org/eby/cto/d/hayward...,
2,WDBTK76GX6T068610,excellent,,rwd,gas,141303,silver,clean,automatic,convertible,2006.0,Mercedes-Benz,CLK-Class,2006 Mercedes-Benz CLK-Class CLK 55 AMG Cabrio...,$9999.00,*2006* *Mercedes-Benz* *CLK-Class* *CLK* *55* ...,2020-07-11 09:05,https://sfbay.craigslist.org/eby/ctd/d/fremont...,
3,2HGES16501H559491,excellent,4 cylinders,fwd,gas,148,white,salvage,automatic,sedan,2001.0,honda,civic LX,2001 Honda Civic LX All Great Tires Clean/Runs...,$2650,,2020-07-11 09:05,https://sfbay.craigslist.org/sfc/ctd/d/san-jos...,compact
4,WDBTJ56H38F256449,excellent,,rwd,gas,133849,silver,clean,automatic,coupe,2008.0,Mercedes-Benz,CLK-Class,2008 Mercedes-Benz CLK-Class CLK 350 Coupe 2D ...,$6999.00,*2008* *Mercedes-Benz* *CLK-Class* *CLK* *350*...,2020-07-11 09:05,https://sfbay.craigslist.org/eby/ctd/d/fremont...,


In [27]:
# Save the July 11 data as a pickle
with open('data/july_11.pickle', 'wb') as to_write:
    pickle.dump(july_11_df, to_write)

In [15]:
# Grab individual Craigslist car post URLs for July 12
july_12_result_urls = get_individual_car_urls(search_craigslist_urls)
july_12_result_urls[:5]

['https://sfbay.craigslist.org/nby/ctd/d/santa-rosa-imca-dirt-modified-harris/7158004149.html',
 'https://sfbay.craigslist.org/sby/ctd/d/fremont-2019-subaru-forester-suv-acura/7158003421.html',
 'https://sfbay.craigslist.org/nby/cto/d/rohnert-park-2003-ford-expedition-eddie/7158003322.html',
 'https://sfbay.craigslist.org/sby/cto/d/palo-alto-2009-honda-fit-sport/7158002840.html',
 'https://sfbay.craigslist.org/eby/cto/d/livermore-2008-dodge-ram-mega-can/7158002379.html']

In [16]:
# Turn the URLs into CSV
july_12_result_urls_df = pd.DataFrame(july_12_result_urls)
july_12_result_urls_df.to_csv('data/july_12_craigslist_urls.csv', index=False)

In [18]:
# save the July 12 individual Craigslist car posting URLs
with open('data/july_12_craigslist_urls.csv', 'r') as file:
    # there is a 0 saved as the first value of the csv that I don't want
    july_12_urls = [url.strip() for url in file][1:]
july_12_urls[:5]

['https://sfbay.craigslist.org/nby/ctd/d/santa-rosa-imca-dirt-modified-harris/7158004149.html',
 'https://sfbay.craigslist.org/sby/ctd/d/fremont-2019-subaru-forester-suv-acura/7158003421.html',
 'https://sfbay.craigslist.org/nby/cto/d/rohnert-park-2003-ford-expedition-eddie/7158003322.html',
 'https://sfbay.craigslist.org/sby/cto/d/palo-alto-2009-honda-fit-sport/7158002840.html',
 'https://sfbay.craigslist.org/eby/cto/d/livermore-2008-dodge-ram-mega-can/7158002379.html']

In [21]:
# remove duplicate links scraped from july 12th that were already in my july 11th urls
july_12_urls_deduped = remove_duplicate_urls(july_12_urls, july_11_urls)
len(july_12_urls_deduped)

1909

In [22]:
# Use my main function to generate car attribute dictionaries for each Craigslist post URL 
july_12_info = []

for url in july_12_urls_deduped:
    july_12_info.append(get_car_info(url))
    time.sleep(0.5)

In [None]:
# Remove any values where the URL didn't work and the data is None rather than a dictionary
cleaned_july_12_info = [data for data in july_12_info if data != None]
cleaned_july_12_info

In [24]:
# Generate a DataFrame for the July 12, 2020 data
july_12_df = pd.DataFrame(cleaned_july_12_info)
july_12_df.head()

Unnamed: 0,fuel,title status,transmission,year,make,model,post_title,price,location,posting_date,url,VIN,odometer,type,condition,cylinders,drive,paint color,size
0,other,clean,other,2017,dirt,modified,Imca Dirt Modified Harris Chassis,$9000,santa rosa,2020-07-12 09:25,https://sfbay.craigslist.org/nby/ctd/d/santa-r...,,,,,,,,
1,gas,clean,automatic,2019,Subaru,Forester,2019 Subaru Forester SUV ( Acura of Fremont : ...,$29458.00,google map,2020-07-12 09:24,https://sfbay.craigslist.org/sby/ctd/d/fremont...,JF2SKAEC1KH442013,17691.0,SUV,,,,,
2,gas,clean,automatic,2003,ford,expedition,2003 Ford Expedition Eddie Bauer,$4200,rohnert pk / cotati,2020-07-12 09:24,https://sfbay.craigslist.org/nby/cto/d/rohnert...,,152600.0,SUV,good,8 cylinders,4wd,black,full-size
3,gas,clean,manual,2009,Honda,Fit Sport,2009 Honda Fit Sport,$7300,Palo Alto,2020-07-12 09:23,https://sfbay.craigslist.org/sby/cto/d/palo-al...,JHMGE87419S063728,61000.0,hatchback,excellent,4 cylinders,fwd,,
4,gas,clean,automatic,2008,dodge,ram mega cab,2008 Dodge Ram mega cab,$20000,Livermore,2020-07-12 09:22,https://sfbay.craigslist.org/eby/cto/d/livermo...,,109600.0,truck,good,8 cylinders,4wd,white,full-size


In [25]:
# Save the July 12 data as a pickle
with open('data/july_12.pickle', 'wb') as to_write:
    pickle.dump(july_12_df, to_write)