# Putting it all together: A complete web scraping example

In this example, we will scrape the car listings on Craigslist in New York and store the them in a CSV file with three columns: Listing Title, Price, and Location. 

Let's begin by capturing the three pieces of data from the first page: https://newyork.craigslist.org/search/cta#search=1~gallery~0~0. We create a list for each entry and put all the results on one page in a list (list of lists).

Note that cars may appear in a different order in the list compared to what you see on your browser.

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

url='https://newyork.craigslist.org/search/cta#search=1~gallery~0~0'

html = urlopen(url)
bs = BeautifulSoup(html.read(),'html.parser')
cars=bs.find_all('li',{'class':'cl-static-search-result'})

scrapedCarsList=[]
for car in cars:
    salesTitle=car.find('div',{'class':'title'})
    price=car.find('div',{'class':'price'})
    location=car.find('div',{'class':'location'})
    #Some listings do not have a price.
    if price!=None:
        new_car=[salesTitle.get_text().strip(),location.get_text().strip(),price.get_text().strip()]
        #print(new_car) #uncomment to see all the cars with a newline
        scrapedCarsList.append(new_car)
        # print(new_car) #uncomment to see the list of cars on the first page
len(scrapedCarsList)

344

Now let's revise the code to write the results of the first page in a CSV file named `'craigslist_cars.csv'`. We can first create the file with the column titles and then append the data, as shown below:

In [4]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv

with open('craigslist_cars.csv', 'w',newline='') as myFile:
    writer = csv.writer(myFile)
    writer.writerow(["Listing Title", "Location", "Price"])

url='https://newyork.craigslist.org/search/cta?s=0'
html = urlopen(url)
bs = BeautifulSoup(html.read(),'html.parser')
cars=bs.find_all('li',{'class':'cl-static-search-result'})

scrapedCarsList=[]
for car in cars:
    salesTitle=car.find('div',{'class':'title'})
    price=car.find('div',{'class':'price'})
    location=car.find('div',{'class':'location'})
    #Some listings do not have a price.
    if price!=None:
        new_car=[salesTitle.get_text().strip(),location.get_text().strip(),price.get_text().strip()]
        scrapedCarsList.append(new_car)

with open('craigslist_cars.csv', 'a',newline='',encoding='utf-8') as myFile:
    writer = csv.writer(myFile)
    writer.writerows(scrapedCarsList)

How do we visit the second page, or how do we search another city, add filters like price, model year, etc.? All we need to do is to understand how these changes are reflected in the URL. After understanding the structure of the URL, we can create a list of all URLs that will be scraped later. 

Here is how we can create the list of URL's for the first 60 pages of listings.

In [6]:
urlList=[]
for i in range(60):
    newURL= 'https://newyork.craigslist.org/search/cta#search=1~gallery~{}~0'.format(str(i))
    urlList.append(newURL)
urlList[0:5]

['https://newyork.craigslist.org/search/cta#search=1~gallery~0~0',
 'https://newyork.craigslist.org/search/cta#search=1~gallery~1~0',
 'https://newyork.craigslist.org/search/cta#search=1~gallery~2~0',
 'https://newyork.craigslist.org/search/cta#search=1~gallery~3~0',
 'https://newyork.craigslist.org/search/cta#search=1~gallery~4~0']

The next step is to convert the scraping script into a function so that we can call it on different pages.  The `craigslistCarsScrape` function takes the page number (0,1,2,..) as input and returns a list of all the cars on the page in a list of lists format.

In [7]:
def craigslistCarsScrape(pageNumber):
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import csv
    print('*** Scraping cars on page:',pageNumber,'***\n\n')

    baseURL='https://newyork.craigslist.org/search/cta?s='
    url= 'https://newyork.craigslist.org/search/cta#search=1~gallery~{}~0'.format(str(pageNumber))
    html = urlopen(url)
    bs = BeautifulSoup(html.read(),'html.parser')
    cars=bs.find_all('li',{'class':'cl-static-search-result'})
    scrapedCarsList=[]            
    for car in cars:
        salesTitle=car.find('div',{'class':'title'})
        price=car.find('div',{'class':'price'})
        location=car.find('div',{'class':'location'})
        if price!=None:
            new_car=[salesTitle.get_text().strip(),location.get_text().strip(),price.get_text().strip()]
            scrapedCarsList.append(new_car)
    return scrapedCarsList

To make the code robust, let's add all exception handling statements. The new function is called `craigslistCarsScrapeWithExceptions`.

In [13]:
def craigslistCarsScrapeWithExceptions(pageNumber):
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import csv
    from urllib.error import HTTPError
    from urllib.error import URLError

    url= 'https://newyork.craigslist.org/search/cta#search=1~gallery~{}~0'.format(str(pageNumber))
    print('*** Scraping cars on page: {} ({}) ***'.format(pageNumber,url))
    
    
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        print('-----------------------HTTPError----------------------')
        return None
    except URLError as e:
        print('Server cound not be found')
        print('-----------------------URLError----------------------')
        return None
    
    bs = BeautifulSoup(html.read(),'html.parser')
    
    try:        
        cars=bs.find_all('li',{'class':'cl-static-search-result'})
    except AttributeError as e:
        print('Tag was not found')
        print('-----------------------AttributeError----------------------')
    else:
        if len(cars) == 0:
            print ('Page has no cars')
            print('---------------------No cars on the page------------------------')
            return None
        else:
            scrapedCarsList=[]           
            for car in cars:
                salesTitle=car.find('div',{'class':'title'})
                price=car.find('div',{'class':'price'})
                location=car.find('div',{'class':'location'})
                if price!=None:
                    new_car=[salesTitle.get_text().strip(),location.get_text().strip(),price.get_text().strip()]
                    scrapedCarsList.append(new_car)            
            return scrapedCarsList

Now, we can run the function in a loop and write the resutls on a csv:

In [14]:
with open('craigslist_cars_final.csv', 'w',newline='') as myFile:
    import csv
    writer = csv.writer(myFile)
    writer.writerow(["Listing Title", "Location", "Price"])

with open('craigslist_cars_final.csv', 'a',newline='',encoding='utf-8') as myFile:
    writer = csv.writer(myFile)
    for i in range(0,60):
        scrapedCarsList=craigslistCarsScrapeWithExceptions(i)
        writer.writerows(scrapedCarsList)

print('''
____________________¶¶¶¶¶¶¶¶¶¶¶
___________________¶¶__________¶¶
______¶¶¶________¶¶______________¶¶
_____¶___¶______¶¶________________¶¶
_____¶____¶____¶¶____¶¶______¶¶____¶¶
_____¶____¶___¶¶____¶__¶____¶__¶____¶¶
_____¶____¶__¶¶_____¶__¶____¶__¶_____¶¶
____¶¶___ ¶___________________________¶¶
____¶____¶¶¶¶¶¶_______________________¶¶
___¶¶_________¶_¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶¶__¶¶
__¶¶____¶¶¶¶¶¶¶__¶¶___¶__¶__¶____¶¶___¶¶
__¶____________¶__¶¶__¶__¶__¶___¶¶___¶¶
__¶_____¶¶¶¶¶¶¶____¶¶_¶__¶__¶__¶¶___¶¶
__¶____________¶____¶¶¶__¶__¶_¶¶___¶¶
__¶_____¶¶¶¶¶¶¶_¶¶___¶¶¶¶¶¶¶¶¶¶___¶¶
__¶¶__________¶___¶¶_____________¶¶
____¶¶¶¶¶¶¶¶¶¶______¶¶_________¶¶
______________________¶¶¶¶¶¶¶¶¶
''')

*** Scraping cars on page: 0 (https://newyork.craigslist.org/search/cta#search=1~gallery~0~0) ***
*** Scraping cars on page: 1 (https://newyork.craigslist.org/search/cta#search=1~gallery~1~0) ***
*** Scraping cars on page: 2 (https://newyork.craigslist.org/search/cta#search=1~gallery~2~0) ***
*** Scraping cars on page: 3 (https://newyork.craigslist.org/search/cta#search=1~gallery~3~0) ***
*** Scraping cars on page: 4 (https://newyork.craigslist.org/search/cta#search=1~gallery~4~0) ***
*** Scraping cars on page: 5 (https://newyork.craigslist.org/search/cta#search=1~gallery~5~0) ***
*** Scraping cars on page: 6 (https://newyork.craigslist.org/search/cta#search=1~gallery~6~0) ***
*** Scraping cars on page: 7 (https://newyork.craigslist.org/search/cta#search=1~gallery~7~0) ***
*** Scraping cars on page: 8 (https://newyork.craigslist.org/search/cta#search=1~gallery~8~0) ***
*** Scraping cars on page: 9 (https://newyork.craigslist.org/search/cta#search=1~gallery~9~0) ***
*** Scraping cars on