## Craigslist web scraping

In [2]:
from bs4 import BeautifulSoup #For BeautifulSoup
import requests #For requests
import time #For sleep
import re #For regular expressions
import os #For saving files

#### Part 1: Scraping and Saving HTML Content
##### Start with navigating to the “free” section on the Craigslist San Francisco Bay Area site (https://sfbay.craigslist.org/search/zip This page lists items that people are giving away for free.

In [3]:
url1='https://sfbay.craigslist.org/search/zip'
time.sleep(10)
headers={'User-Agent':'Mozilla/5.0'}
pageurl=requests.get(url1,headers)
soup=BeautifulSoup(pageurl.content,'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="craigslist" property="og:site_name"/>
  <meta content="preview" name="twitter:card"/>
  <meta content="SF bay area free stuff - craigslist" property="og:title"/>
  <meta content="SF bay area free stuff - craigslist" name="description"/>
  <meta content="SF bay area free stuff - craigslist" property="og:description"/>
  <meta content="https://sfbay.craigslist.org/search/zip" property="og:url"/>
  <title>
   SF bay area free stuff - craigslist
  </title>
  <link href="https://sfbay.craigslist.org/search/zip" rel="canonical"/>
  <link href="https://sfbay.craigslist.org/search/zip" hreflang="x-default" rel="alternate"/>
  <link href="/favicon.ico" id="favicon" rel="icon">
   <script id="ld_searchpage_data" type="application/ld+json">
    {"description":"Free Stuff in SF Bay Area","@type":"Se

##### changing the sorting order to “oldest” first by interacting with the page’s UI and pagination

In [None]:
n = input("Please enter the page number you want to go to in Craiglist:")
pagination_cnt=int(n)-1

In [None]:
url=url1+'#search=1~gallery~'+str(pagination_cnt)+'~0'
time.sleep(2)
pageurl=requests.get(url,headers)
soup=BeautifulSoup(pageurl.content,'html.parser')
print(soup.prettify())

##### Fetch Listing URLs:
Using `requests` to access the first page of the “free” section, ordered “newest” first.

Deploying `BeautifulSoup` to parse the HTML content.

In [64]:
url='https://sfbay.craigslist.org/search/zip?sort=date'
#Setting the URL and requesting access to the first page of "free", ordered "newest"
pageurl=requests.get(url,headers)
time.sleep(10)
#BeautifulSoup to parse HTML
soup=BeautifulSoup(pageurl.content,'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="craigslist" property="og:site_name"/>
  <meta content="preview" name="twitter:card"/>
  <meta content="SF bay area free stuff - craigslist" property="og:title"/>
  <meta content="SF bay area free stuff - craigslist" name="description"/>
  <meta content="SF bay area free stuff - craigslist" property="og:description"/>
  <meta content="https://sfbay.craigslist.org/search/zip" property="og:url"/>
  <title>
   SF bay area free stuff - craigslist
  </title>
  <link href="https://sfbay.craigslist.org/search/zip" rel="canonical"/>
  <link href="https://sfbay.craigslist.org/search/zip" hreflang="x-default" rel="alternate"/>
  <link href="/favicon.ico" id="favicon" rel="icon">
   <script id="ld_searchpage_data" type="application/ld+json">
    {"description":"Free Stuff in SF Bay Area","@context":

##### Identify the structure that holds the links to individual listing pages.  What selector do you choose to grab the link?
Using the li tag and class="cl-static-search-result", I found the links for the individual listing pages

In [None]:
#Identify the structure that holds the links to individual listing pages. 


n= soup.find_all("li",{"class":"cl-static-search-result"})
listings=re.findall(r'<a\shref="(.*)">',str(n))
print("The listings links are as follows:")
m=0
for i in listings:
    m=m+1
    print('Listing Number '+str(m)+' :- '+i)

##### 
One more way would be using the "a" tag and the use of regular expressions within the href of "a" tag for the links as I have noticed all the links for listings or items essentially contains links to other craiglist pages

In [118]:
n2=soup.find_all('a')
n2[0:4]

[<a href="#" id="cl-unrecoverable-hard-refresh" onclick="location.reload(true);">refresh the page.</a>,
 <a href="/">craigslist</a>,
 <a href="https://sfbay.craigslist.org/nby/zip/d/santa-rosa-boulders-need-excivator-and/7709347969.html">
 <div class="title">Boulders - Need excivator and trailer to pick up</div>
 <div class="details">
 <div class="price">$0</div>
 <div class="location">
                         Santa Rosa
                     </div>
 </div>
 </a>,
 <a href="https://sfbay.craigslist.org/sfc/zip/d/san-francisco-free-rei-camping-pad/7707002091.html">
 <div class="title">Free REI Camping Pad</div>
 <div class="details">
 <div class="price">$0</div>
 <div class="location">
                         San Francisco
                     </div>
 </div>
 </a>]


###### Extract the first 250 unique listing URLs and saving them to a list.

In [140]:
#Finding all the listings links using the <a> tag and reg. expressions 
listings2=re.findall(r'<a\shref="(https://sfbay.craigslist.*)">',str(n2))
len(listings2)

360

In [147]:
#Storing the first 250 listings
listings2=listings2[:250]
#Printing the first 250 listings links
print("The listings links are as follows:\n")
m=0
for i in listings2:
    m=m+1
    print('Listing Number '+str(m)+' :- '+i)

The listings links are as follows:

Listing Number 1 :- https://sfbay.craigslist.org/nby/zip/d/santa-rosa-boulders-need-excivator-and/7709347969.html
Listing Number 2 :- https://sfbay.craigslist.org/sfc/zip/d/san-francisco-free-rei-camping-pad/7707002091.html
Listing Number 3 :- https://sfbay.craigslist.org/sfc/zip/d/san-francisco-free-sofa/7710855801.html
Listing Number 4 :- https://sfbay.craigslist.org/eby/zip/d/moraga-free-litter-robot-and-assorted/7714908972.html
Listing Number 5 :- https://sfbay.craigslist.org/sby/zip/d/campbell-chicken-coop/7708721456.html
Listing Number 6 :- https://sfbay.craigslist.org/eby/zip/d/oakland-gold-crushed-velvet-sofa-chaise/7714908158.html
Listing Number 7 :- https://sfbay.craigslist.org/sby/zip/d/san-jose-queen-bed-mattress-not-included/7714906223.html
Listing Number 8 :- https://sfbay.craigslist.org/sfc/zip/d/san-francisco-pineapple-party-decor/7713532695.html
Listing Number 9 :- https://sfbay.craigslist.org/eby/zip/d/oakland-lots-of-art-frames/771

##### Saving HTML Pages:
For each of the 250 listing URLs, use `requests` to fetch the listing page.

Save each HTML content to a separate file on disk.  Use each listing’s ID to organize files in a way that makes them easily identifiable (e.g., save listing ID 7713901653 to file “7713901653.html”).

In [404]:
directory = os.path.join(os.path.expanduser('~'), 'Downloads', 'ddr_nemo')
if not os.path.exists(directory):
    os.makedirs(directory)
else:
    print('Good to go!')

Good to go!


In [168]:
listings02=listings2[:250]
m=0
for link in listings02:
    listing_url=link
    m+=1
    print(m)
    print(listing_url)
    l_id=re.findall(r'/([0-9]+).html',str(listing_url))
    print(l_id)
    time.sleep(20)
    pageurl_listing = requests.get(listing_url, headers=headers)
    if pageurl_listing.status_code == 200:
        soup = BeautifulSoup(pageurl_listing.content, 'html.parser')
        filename = os.path.join(os.path.expanduser('~'), 'Downloads/ddr_nemo', l_id[0]+'.html')
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(soup.prettify())
    else:
        print("Failed!")


1
https://sfbay.craigslist.org/nby/zip/d/santa-rosa-boulders-need-excivator-and/7709347969.html
['7709347969']
2
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-free-rei-camping-pad/7707002091.html
['7707002091']
3
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-free-sofa/7710855801.html
['7710855801']
4
https://sfbay.craigslist.org/eby/zip/d/moraga-free-litter-robot-and-assorted/7714908972.html
['7714908972']
5
https://sfbay.craigslist.org/sby/zip/d/campbell-chicken-coop/7708721456.html
['7708721456']
6
https://sfbay.craigslist.org/eby/zip/d/oakland-gold-crushed-velvet-sofa-chaise/7714908158.html
['7714908158']
7
https://sfbay.craigslist.org/sby/zip/d/san-jose-queen-bed-mattress-not-included/7714906223.html
['7714906223']
8
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-pineapple-party-decor/7713532695.html
['7713532695']
9
https://sfbay.craigslist.org/eby/zip/d/oakland-lots-of-art-frames/7714905210.html
['7714905210']
10
https://sfbay.craigslist.org/pen/zip/d/burlinga

79
https://sfbay.craigslist.org/pen/zip/d/palo-alto-kitchen-cabinet-doors-free/7714874254.html
['7714874254']
80
https://sfbay.craigslist.org/eby/zip/d/antioch-punch-bowl-set/7714873880.html
['7714873880']
81
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-black-california-king/7714872947.html
['7714872947']
82
https://sfbay.craigslist.org/eby/zip/d/oakland-jars/7714871749.html
['7714871749']
83
https://sfbay.craigslist.org/sby/zip/d/san-jose-free-pink-gaming-study-desk/7714870304.html
['7714870304']
84
https://sfbay.craigslist.org/scz/zip/d/santa-cruz-free-shaped-mid-century/7714870090.html
['7714870090']
85
https://sfbay.craigslist.org/pen/zip/d/mountain-view-cordless-dremel-tool-106v/7714868942.html
['7714868942']
86
https://sfbay.craigslist.org/eby/zip/d/berkeley-free-moving-boxes/7714867966.html
['7714867966']
87
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-vintage-doorbell/7714867783.html
['7714867783']
88
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-vintage-

157
https://sfbay.craigslist.org/eby/zip/d/emeryville-free-brand-new-art-print/7714826362.html
['7714826362']
158
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-pizza-prep-table/7714825659.html
['7714825659']
159
https://sfbay.craigslist.org/eby/zip/d/richmond-free-fill-dirt-dry-too/7711279189.html
['7711279189']
160
https://sfbay.craigslist.org/eby/zip/d/berkeley-packing-materials/7714823840.html
['7714823840']
161
https://sfbay.craigslist.org/eby/zip/d/san-jose-free-pallets-in-fremont/7714822466.html
['7714822466']
162
https://sfbay.craigslist.org/eby/zip/d/alameda-toilet/7711844119.html
['7711844119']
163
https://sfbay.craigslist.org/pen/zip/d/redwood-city-desk-free-dog-not-included/7714821262.html
['7714821262']
164
https://sfbay.craigslist.org/eby/zip/d/emeryville-free-new-zoya-nail-polish/7714821045.html
['7714821045']
165
https://sfbay.craigslist.org/nby/zip/d/san-rafael-free-bathroom-vanity/7711777371.html
['7711777371']
166
https://sfbay.craigslist.org/nby/zip/d/san-rafa

236
https://sfbay.craigslist.org/eby/zip/d/oakland-coffee-table/7714033870.html
['7714033870']
237
https://sfbay.craigslist.org/eby/zip/d/oakland-full-size-mattress/7714762612.html
['7714762612']
238
https://sfbay.craigslist.org/nby/zip/d/larkspur-double-bed-mattress-and-bread/7714771047.html
['7714771047']
239
https://sfbay.craigslist.org/eby/zip/d/san-leandro-free-coin-magnets-pieces/7714768144.html
['7714768144']
240
https://sfbay.craigslist.org/pen/zip/d/millbrae-kids-chalkboard-white-board/7708570461.html
['7708570461']
241
https://sfbay.craigslist.org/sby/zip/d/san-jose-out-of-order-tatung-cooker/7714764215.html
['7714764215']
242
https://sfbay.craigslist.org/pen/zip/d/palo-alto-free-easter-decorations/7714763829.html
['7714763829']
243
https://sfbay.craigslist.org/eby/zip/d/concord-navy-blue-couch-for-free/7714763271.html
['7714763271']
244
https://sfbay.craigslist.org/sby/zip/d/san-jose-food-items-past-best-by-dates/7713785941.html
['7713785941']
245
https://sfbay.craigslist.or

##### Part 2: Parsing and Displaying Information from Saved HTML

 

Reading Saved HTML Files: Write a script that reads each of the saved HTML files from the disk.
   

In [12]:
directory = os.path.join(os.path.expanduser('~'), 'Downloads', 'ddr_nemo')
i=0
for filename in os.listdir(directory) :
    i+=1
    if i>=2:
        break
    else:
        if filename.endswith(".html"):
  
            filepath = os.path.join(directory, filename)
  
            with open(filepath, 'r', encoding='utf-8') as file:
                html = file.read()


<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="craigslist" property="og:site_name"/>
  <meta content="preview" name="twitter:card"/>
  <meta content="carrier - free stuff - craigslist" property="og:title"/>
  <meta content="Email for address. Pick up today!" name="description"/>
  <meta content="Email for address. Pick up today!" property="og:description"/>
  <meta content="https://images.craigslist.org/00Q0Q_ch7Wf5xfUV3_0t20CI_600x450.jpg" property="og:image"/>
  <meta content="https://sfbay.craigslist.org/sby/zip/d/los-gatos-carrier/7714783862.html" property="og:url"/>
  <meta content="article" property="og:type"/>
  <meta content="unavailable_after: 2024-03-06T21:27:37Z" name="robots"/>
  <meta content="37.241700;-121.955400" name="geo.position"/>
  <meta content="37.241700, -121.955400" name="ICBM"/>
  <meta content="Los Gatos" n

#####  Extract Information:
For each HTML file, use `BeautifulSoup` to parse the file content.

Extract and print the following details:

Title: The title of the listing.

URL of first image (if an image exists):  The URL of the displayed image.  It can be found in the `src` attribute of `<img>`

Description: The full description text of the listing.

Post ID: Usually found at the bottom of the page or within the page's HTML structure.

Posted Date: The date when the listing was originally posted.

Last Updated Date: The date when the listing was last updated.

In [206]:
directory = os.path.join(os.path.expanduser('~'), 'Downloads', 'ddr_nemo')
directory
dir_list = os.listdir(directory)
for i in dir_list:
    filepath=directory+'/'+i
    print(filepath)

/Users/mrimonguha/Downloads/ddr_nemo/7714783862.html
/Users/mrimonguha/Downloads/ddr_nemo/7710855801.html
/Users/mrimonguha/Downloads/ddr_nemo/7714842895.html
/Users/mrimonguha/Downloads/ddr_nemo/7714884073.html
/Users/mrimonguha/Downloads/ddr_nemo/7714768144.html
/Users/mrimonguha/Downloads/ddr_nemo/7714859349.html
/Users/mrimonguha/Downloads/ddr_nemo/7714863171.html
/Users/mrimonguha/Downloads/ddr_nemo/7714875230.html
/Users/mrimonguha/Downloads/ddr_nemo/7714847253.html
/Users/mrimonguha/Downloads/ddr_nemo/7714792137.html
/Users/mrimonguha/Downloads/ddr_nemo/7714890753.html
/Users/mrimonguha/Downloads/ddr_nemo/7714772590.html
/Users/mrimonguha/Downloads/ddr_nemo/7712902779.html
/Users/mrimonguha/Downloads/ddr_nemo/7710977171.html
/Users/mrimonguha/Downloads/ddr_nemo/7714881103.html
/Users/mrimonguha/Downloads/ddr_nemo/7714881416.html
/Users/mrimonguha/Downloads/ddr_nemo/7714903849.html
/Users/mrimonguha/Downloads/ddr_nemo/7714870090.html
/Users/mrimonguha/Downloads/ddr_nemo/771476237

In [313]:
directory = os.path.join(os.path.expanduser('~'), 'Downloads', 'ddr_nemo')
directory
dir_list = os.listdir(directory)
cnt=0
for i in dir_list:
    cnt=cnt+1
    filepath=directory+'/'+i
    with open(filepath, 'r',encoding='latin-1') as file:
        html_content = file.read()
        #print(html_content)
        print("List Item Number:",cnt)
        soup = BeautifulSoup(html_content, 'html.parser')
        try:
            ####       TITLE OF THE LISTING
            #title=soup.title.text
            t=soup.find('span',id='titletextonly').text.strip()
            if t:
                print("TITLE:",t)
            else:
                print("TITLE:Title not found")

            ####       URL OF IMAGE
            img_url=soup.select_one('img').get('src') 
            if img_url:
                print("\nURL of first image:",soup.select_one('img').get('src') )
            else:
                print("Image:Image not found")
    
            ####       DESCRIPTION
            description_tag = soup.find('div', class_='print-information print-qrcode-container')
            description = description_tag.next_sibling
            print("\nDescription:",description.strip())

            ####       POST ID
            body=soup.select("p.postinginfo")
            postid=re.findall(r'post\s+id:\s+(\d+)', str(body)) 
            print("\nPost ID:",postid[0])
    
            ####       POSTED AND UPDATED DATES        
            dates=soup.find_all('time',{"class":'date timeago'})
            n=len(dates)

            if n==2:
                posted_date=re.findall(r'(\d+-\d+.*)',dates[n-2].get_text())[0]
                updated_date=''
            else:
                posted_date=re.findall(r'(\d+-\d+.*)',dates[n-2].get_text())[0]
                updated_date=re.findall(r'(\d+-\d+.*)',dates[n-1].get_text())[0]
            print("\nPosted Date:",posted_date)
            print("\nUpdated Date:",updated_date)

        except:
            print("DETAILS NOT FOUND")
        print("\n....................................................................\n")

List Item Number: 1
TITLE: carrier

URL of first image: https://images.craigslist.org/00Q0Q_ch7Wf5xfUV3_0t20CI_600x450.jpg

Description: Email for address.

Post ID: 7714783862

Posted Date: 2024-02-05 13:23

Updated Date: 

....................................................................

List Item Number: 2
TITLE: Free Sofa

URL of first image: https://images.craigslist.org/01414_kocNH60uVG6_0CI0t2_600x450.jpg

Description: Working and clean condition, from a pet free and non smoking house.

Post ID: 7710855801

Posted Date: 2024-01-24 13:00

Updated Date: 2024-02-05 23:32

....................................................................

List Item Number: 3
TITLE: 4 casters for furniture dolly

URL of first image: https://images.craigslist.org/00505_56nbM8eNwWe_03S05a_600x450.jpg

Description: Free

Post ID: 7714842895

Posted Date: 2024-02-05 16:20

Updated Date: 2024-02-05 16:23

....................................................................

List Item Number: 4
TITL

List Item Number: 50
TITLE: Homebrew keg

URL of first image: https://images.craigslist.org/00505_iYkUQ1k2Q60_0t20CI_600x450.jpg

Description: Free home brewing keg to make all grain beer. Lightly used and in great shape. Text or call Jeff to come pick it up!

Post ID: 7714896560

Posted Date: 2024-02-05 20:59

Updated Date: 

....................................................................

List Item Number: 51
TITLE: Free adjustable desk chair

URL of first image: https://images.craigslist.org/00g0g_1zGxfygyXIO_0t20CI_600x450.jpg

Description: Free adjustable desk/office chair. Giving away because we bought a new one.

Post ID: 7714901654

Posted Date: 2024-02-05 21:51

Updated Date: 

....................................................................

List Item Number: 52
TITLE: Lots of art frames

URL of first image: https://images.craigslist.org/00l0l_keIzzoYZyME_0CI0t2_600x450.jpg

Description: I have too many. Photos are of six but I have more. Many sizes. Sorry I donât 

List Item Number: 108
TITLE: Laptop bag rolls on wheels great condition

URL of first image: https://images.craigslist.org/00q0q_77aDvdxb2rk_0t20CI_600x450.jpg

Description: This is in great condition and a laptop bag with file storage and many compartments!

Post ID: 7714139399

Posted Date: 2024-02-03 12:45

Updated Date: 2024-02-05 19:41

....................................................................

List Item Number: 109
TITLE: Free wood chips
DETAILS NOT FOUND

....................................................................

List Item Number: 110
TITLE: Punch Bowl set
DETAILS NOT FOUND

....................................................................

List Item Number: 111
TITLE: Desk (free)
DETAILS NOT FOUND

....................................................................

List Item Number: 112
TITLE: Free Folding Table

URL of first image: https://images.craigslist.org/00Y0Y_kQ38QwwIVBH_0t20CI_600x450.jpg

Description: Please text directly at

Post ID: 77147

List Item Number: 147
TITLE: Kvm and serial data switcher

URL of first image: https://images.craigslist.org/00q0q_geVc6GOReZa_0kE0rx_600x450.jpg

Description: Kvm and serial data switcher

Post ID: 7714104578

Posted Date: 2024-02-03 11:17

Updated Date: 2024-02-05 17:45

....................................................................

List Item Number: 148
TITLE: Folding Leg table

URL of first image: https://images.craigslist.org/00d0d_7jKGxH7qyeC_09G07g_600x450.jpg

Description: Free table. Has some scratches but no issues. Throw a table cloth over it, and nobody will know.

Post ID: 7714797464

Posted Date: 2024-02-05 14:01

Updated Date: 2024-02-05 16:28

....................................................................

List Item Number: 149
TITLE: 5 free closetmaid shelves

URL of first image: https://images.craigslist.org/00F0F_gsIoreQ4Mi6_0lN0CI_600x450.jpg

Description: 5 free closetmaid wire shelves with wall attachments. 96"Ã16".

Post ID: 7714899254

Posted Date:

List Item Number: 205
TITLE: Free shipping crate

URL of first image: https://images.craigslist.org/00s0s_5d3hMxrgt3q_0t20CI_600x450.jpg

Description: Do you need to ship a steering rack, telescope, stripper pole, or stilts?  Look no further!  This shipping crate made of OSB is in great shape for it's next shipping job.

Post ID: 7714835841

Posted Date: 2024-02-05 15:57

Updated Date: 

....................................................................

List Item Number: 206
TITLE: Kohler cast iron bath tub -price reduced

URL of first image: https://images.craigslist.org/00O0O_fyifXiLesDM_0CI0t2_600x450.jpg

Description: Free cast iron tub removed during my bathroom remodel. tub is in perfect condition once cleaned and wiped down.   No cracks.  Comes with original brass fittings.  measures 60in. by 30 in.

Post ID: 7707181899

Posted Date: 2024-01-13 10:19

Updated Date: 2024-02-05 17:00

....................................................................

List Item Number: 207
TI

List Item Number: 243
TITLE: Large mirror

URL of first image: https://images.craigslist.org/01111_eDeUqUeOjAh_0t20CA_600x450.jpg

Description: Crack in top right loose piece can be glued or cut smaller

Post ID: 7710929494

Posted Date: 2024-01-24 16:38

Updated Date: 2024-02-05 19:23

....................................................................

List Item Number: 244
TITLE: 1000 piece puzzle ð§©

URL of first image: https://images.craigslist.org/00H0H_1QSTF3uXEEt_0tT0t2_600x450.jpg

Description: 1000 piece puzzle ð§©

Post ID: 7714897627

Posted Date: 2024-02-05 21:09

Updated Date: 

....................................................................

List Item Number: 245
TITLE: 10" Craftsman tablesaw

URL of first image: https://images.craigslist.org/01515_dwU8snNrNRr_0t20CI_600x450.jpg

Description: 1970s era Craftsman. Works fine. Has new motor

Post ID: 7714850184

Posted Date: 2024-02-05 16:48

Updated Date: 

........................................................