# Individual Project - DDR

## Author - Manickashree "Madhu" Thayumana Sundaram

## Part 1: Scraping and Saving HTML Content



In [48]:
## Importing the required libraries
from bs4 import BeautifulSoup
import requests
import time
import os

### Q1. Identify the Target:
Start with navigating to the “free” section on the Craigslist San Francisco Bay Area site (https://sfbay.craigslist.org/search/zip Links to an external site.).

This page lists items that people are giving away for free.

#### Answer 
##### The url that we need to navigate to get the list of free items is 'https://sfbay.craigslist.org/search/zip#search=1~gallery~0~0'

### Q2.Interact with the Page-Sorting:
Initially, the listings might be sorted by “newest” first.  Try changing the sorting order to “oldest” first by interacting with the page’s UI.

Observe any changes in the URL after you change the sorting order back and forth.

Can you trigger the sorting change directly by modifying only the URL in your browser’s address bar?  If so, how?

Explain what type of request is made when you change the sort order (GET or POST).

What is the variable in the URL associated with sorting?

#### Answer
##### When the listings is sorted by "newest" then URL is "https://sfbay.craigslist.org/search/zip?sort=date#search=1~gallery~0~0" and when sorted by oldest its "https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~0~0". Hence sorting can be triggered by modifying the url.
##### The variable in the URL that is associated with sorting is 'sort'. When the sorting order changes from newest to oldest then the variable 'sort' changes from 'date' to 'dateoldest'.
##### 'GET' request is used when the sort order is changed.

### Q3. Interact with the Page-Pagination:
Craigslist paginates listings, typically displaying a limited number of items per page (120 for me).

Navigate to the second and third pages of results and observe the changes in the URL.

Exploration Task:

Determine how to move between pages by only changing the URL.  What part of the URL changes as you navigate through different pages?

This task will help you understand how pagination works on Craigslist and how you can programmatically access different pages of listings.

Identify the variable associated with page changes.  How does altering this variable in the URL affect the page you’re viewing?

Explain.



#### Answer:

### Q4. Fetch Listing URLs:
Use `requests` to access the first page of the “free” section, ordered “newest” first.

Deploy `BeautifulSoup` to parse the HTML content.

Identify the structure that holds the links to individual listing pages.  What selector do you choose to grab the link?

Can you identify one more possible selection method to retrieve the link to the individual listing?  Explain.

Extract the first 250 unique listing URLs and save them to a list.  Consider the pagination feature of Craigslist to navigate through pages.  Explain your strategy.

Print the list to screen.

In [73]:
## Setting the headers and url
headers = {'User-Agent': 'Chrome/120.0.0.0'}
# url = 'https://sfbay.craigslist.org/search/zip#search=1~gallery~0~0'

## Accessing the first page of the “free” section, ordered “newest” first
url = 'https://sfbay.craigslist.org/search/zip?sort=date#search=1~gallery~0~0'

## Request to access the newest in free section 
page = requests.get(url, headers = headers)

## Using BeautifulSoup to parse HTML Content
soup_ip = BeautifulSoup(page.content, 'html.parser')

## Displaying the page content
print(soup_ip.prettify())

## Verification
if soup_ip is not None:
    print(f"Access website, link: {url}")
else:
    print("Failed.")

Access website, link: https://sfbay.craigslist.org/search/zip?sort=date#search=1~gallery~0~0


##### Based on the soup output, we can infer the website's structure. It is clear that all the URLs can be obtained from the parent `<li>` tag with a class of "cl-static-search-result". For this, using the "select()" method would be the most effective approach.

In [50]:
## fetching all the urls - First Method using select
url_of_interest = soup_ip.select('li.cl-static-search-result > a[href]')
unique_list = []

## Looping through all the urls
for i in range(len(url_of_interest)):
    if i not in unique_list:  ## creating unique URLs
        unique_list.append(url_of_interest[i].get('href'))
        time.sleep(1) ## Pausing between requests
        if len(unique_list) >= 250:  ## Limitting onlt to 250 URLs
            break

In [52]:
## Displaying the list of unique URLs
print(unique_list)

['https://sfbay.craigslist.org/sby/zip/d/saratoga-free-outdoor-standing/7715547249.html', 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-bookpattern-and-flow/7715547173.html', 'https://sfbay.craigslist.org/sby/zip/d/san-jose-chaise-lounge/7715546331.html', 'https://sfbay.craigslist.org/eby/zip/d/oakland-entertainment-unit-59x15x23/7715546023.html', 'https://sfbay.craigslist.org/nby/zip/d/santa-rosa-trampoline/7715545952.html', 'https://sfbay.craigslist.org/scz/zip/d/santa-cruz-free-speakers-and-klipsch-sub/7715545532.html', 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-free-glass-coffee-table/7715544591.html', 'https://sfbay.craigslist.org/nby/zip/d/healdsburg-free-crate-barrel-willow/7712753370.html', 'https://sfbay.craigslist.org/eby/zip/d/el-cerrito-books/7715542855.html', 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-build-your-own-birdhouse/7715542366.html', 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-dining-chairs/7715540749.html', 'https://sfbay.

In [51]:
## Ensuring the only first 250 URLs are extracted 
print(len(unique_list))

250


##### Using select method we are accessing the class 'cl-static-search-result' of the `<li>` tag and then the `<a>` within. Later we iterate through the list of URLs to get the unique 250 urls.
##### Since I am able to fetch 360 URLs using select() method, pagination is not been used.

##### An alternate method is by using find_all() method. Using find_all, we are fetching the anchor tags and later retrieving href which has the URLs. 

In [53]:
## fetching all the urls - Second Method using find_all
url_of_interest_2 = soup_ip.find_all('a')
unique_list_2 = []

for i in range(len(url_of_interest_2)):
    if i not in unique_list_2:  ## creating unique URLs
        unique_list_2.append(url_of_interest_2[i].get('href'))
        time.sleep(1) ## Pausing between requests
        if len(unique_list_2) >= 250:  ## Limitting onlt to 250 URLs
            break
        
## Displaying the list of unique URLs
print(unique_list_2)


['#', '/', 'https://sfbay.craigslist.org/sby/zip/d/saratoga-free-outdoor-standing/7715547249.html', 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-bookpattern-and-flow/7715547173.html', 'https://sfbay.craigslist.org/sby/zip/d/san-jose-chaise-lounge/7715546331.html', 'https://sfbay.craigslist.org/eby/zip/d/oakland-entertainment-unit-59x15x23/7715546023.html', 'https://sfbay.craigslist.org/nby/zip/d/santa-rosa-trampoline/7715545952.html', 'https://sfbay.craigslist.org/scz/zip/d/santa-cruz-free-speakers-and-klipsch-sub/7715545532.html', 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-free-glass-coffee-table/7715544591.html', 'https://sfbay.craigslist.org/nby/zip/d/healdsburg-free-crate-barrel-willow/7712753370.html', 'https://sfbay.craigslist.org/eby/zip/d/el-cerrito-books/7715542855.html', 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-build-your-own-birdhouse/7715542366.html', 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-dining-chairs/7715540749.html', 'http

In [54]:
## Ensuring the only first 250 URLs are extracted 
print(len(unique_list_2))

250


##### We are able to extract the unique 250 urls using find_all.
##### Thus by both select() and find_all() can be used to retrieve the links.

### Q5. Save HTML Pages:
For each of the 250 listing URLs, use `requests` to fetch the listing page.

Save each HTML content to a separate file on disk.  Use each listing’s ID to organize files in a way that makes them easily identifiable (e.g., save listing ID 7713901653 to file “7713901653.html”).

In [68]:
## Creating a directory
os.makedirs('DDR_Dir')
## defining the directoru path
directory_path = 'DDR_Dir'

In [56]:
## Saving the file in the directory
for items in range(len(unique_list)):
        response_html_page = requests.get(unique_list[items], headers = headers)
        listing_ids = unique_list[items].split('/')[-1]  # Extract listing ID from URL
        file_path = os.path.join(directory_path, f"{listing_ids}") 
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(response_html_page.text)
        time.sleep(1)  # Throttle requests

In [60]:
## Listing the files and its count
print(os.listdir(directory_path))
n_lst = os.listdir(directory_path)
print('\n')
print('Total No. Of Files in this directory - ',len(n_lst))

['7715526266.html', '7714731668.html', '7715347447.html', '7715528405.html', '7712106559.html', '7715472124.html', '7715545532.html', '7714735168.html', '7712935607.html', '7715547173.html', '7715354139.html', '7715503718.html', '7715394714.html', '7715395017.html', '7715314438.html', '7715530347.html', '7715423737.html', '7715414014.html', '7715453823.html', '7715530107.html', '7713160483.html', '7715394295.html', '7715544591.html', '7705769410.html', '7715466412.html', '7715524329.html', '7715468334.html', '7711667389.html', '7715525646.html', '7715524057.html', '7715484285.html', '7715545952.html', '7715520868.html', '7715509713.html', '7712229203.html', '7715451413.html', '7715542366.html', '7710208630.html', '7715530720.html', '7715308197.html', '7707133848.html', '7713360347.html', '7713089936.html', '7715495119.html', '7715500343.html', '7715369693.html', '7715306849.html', '7714717921.html', '7705495184.html', '7710582593.html', '7715517490.html', '7715436732.html', '7715329244

##### We can see that 250 HTML content files are saved in a seperate directory name - 'DDR_Dir'

## Part 2: Parsing and Displaying Information from Saved HTML

### Q1. Read Saved HTML Files:

In [61]:
directory = 'DDR_Dir'
os.listdir(directory)

['7715526266.html',
 '7714731668.html',
 '7715347447.html',
 '7715528405.html',
 '7712106559.html',
 '7715472124.html',
 '7715545532.html',
 '7714735168.html',
 '7712935607.html',
 '7715547173.html',
 '7715354139.html',
 '7715503718.html',
 '7715394714.html',
 '7715395017.html',
 '7715314438.html',
 '7715530347.html',
 '7715423737.html',
 '7715414014.html',
 '7715453823.html',
 '7715530107.html',
 '7713160483.html',
 '7715394295.html',
 '7715544591.html',
 '7705769410.html',
 '7715466412.html',
 '7715524329.html',
 '7715468334.html',
 '7711667389.html',
 '7715525646.html',
 '7715524057.html',
 '7715484285.html',
 '7715545952.html',
 '7715520868.html',
 '7715509713.html',
 '7712229203.html',
 '7715451413.html',
 '7715542366.html',
 '7710208630.html',
 '7715530720.html',
 '7715308197.html',
 '7707133848.html',
 '7713360347.html',
 '7713089936.html',
 '7715495119.html',
 '7715500343.html',
 '7715369693.html',
 '7715306849.html',
 '7714717921.html',
 '7705495184.html',
 '7710582593.html',


In [62]:
# Loop through each file in the directory
for filename in os.listdir(directory):
    # Check if the file ends with .html
    if filename.endswith(".html"):
        # Construct the full file path
        filepath = os.path.join(directory, filename)
        # Read file to string
        with open(filepath, 'r', encoding='utf-8') as file:
            html_p1 = file.read()
            soup_file_1 = BeautifulSoup(html_p1, "html.parser")
            # print(soup_file_1.prettify())
            
            ## Ensuring the file is read correctly
            title_page = soup_file_1.find('title') if soup_file_1.find('title') else 'No Title Found'
            print('Title: ',title_page.text)
            time.sleep(1)


Title:  3 drawer lateral file cabinet - free stuff - craigslist
Title:  FREE TEDDY BEARS & GALAXY ROSES FOR VALENTINE'S DAY TO SPREAD LOVE - free stuff - craigslist
Title:  Glassware - free stuff - craigslist
Title:  Free rooster - free stuff - craigslist
Title:  Free 3 Bags of sand, CONSTRUCTION MATERIALS **** - free stuff - craigslist
Title:  Wall art - free - free stuff - craigslist
Title:  Free speakers and klipsch sub - free stuff - craigslist
Title:  Free - free stuff - craigslist
Title:  Free loose dirt for planting - free stuff - craigslist
Title:  Book—Pattern and Flow - free stuff - craigslist
Title:  Two chairs - free stuff - craigslist
Title:  Free bags of wood kindling - free stuff - craigslist
Title:  Free TV 42-inch Panasonic TH-42PX20 - free stuff - craigslist
Title:  Free Sony Bravia 32" TV - free stuff - craigslist
Title:  Hospital bed - free stuff - craigslist
Title:  2-window fans - free stuff - craigslist
Title:  Free King size bed - free stuff - craigslist
Title: 

##### All files in the DDR_Dir directory are read correctly.

### Q2.Extract Information:
For each HTML file, use `BeautifulSoup` to parse the file content.

Extract and print the following details:

Title: The title of the listing.

URL of first image (if an image exists):  The URL of the displayed image.  It can be found in the `src` attribute of `<img>`

Description: The full description text of the listing.

Post ID: Usually found at the bottom of the page or within the page's HTML structure.

Posted Date: The date when the listing was originally posted.

Last Updated Date: The date when the listing was last updated.

In [67]:
## looping through each file in the directory
for filename1 in os.listdir(directory):
    if filename1.endswith('.html'):
        filepath1 = os.path.join(directory, filename1)
        with open(filepath1, 'r', encoding='utf-8') as file:
            html1 = file.read()            
            soup_file = BeautifulSoup(html1, "html.parser")

            ## Printing the Title
            title_page = soup_file.find('title') if soup_file.find('title') else 'No Title Found'
            print('Title: ',title_page.text)

            ## Printing the URL
            image_url = soup_file.select_one('img')['src'] if soup_file.select_one('img') else 'No Image found'
            print('URL: ',image_url)

            ## Printing the Description
            description_element = soup_file.select_one('noscript#no-js > div > p') or soup_file.select_one('div > p')
            description_text = description_element.text if description_element else 'No Description found'
            print('Description: ', description_text)          

            ## Printing the Post ID
            post_id = soup_file.select_one('div.postinginfos > p.postinginfo')
            post_id_text = post_id.text if post_id else 'No Post Id found'           
            print('Post ID: ', post_id_text)

            ## Printing the Posted Date
            post_date = soup_file.select_one('div.postinginfos > p.postinginfo.reveal > time.date.timeago')
            post_date_text = post_date.text if post_date else 'No Post Date found'
            print('Posted Date: ',post_date_text)

            ## Printing the Last Updated Date
            updated_date = soup_file.select('p.postinginfo.reveal > time.date.timeago')
            # updated_date_text = time_tags[1].get('datetime') if len(updated_date) > 1 else 'No Updated Date Found'
            
            updated_date_text = updated_date[1].get('datetime') if len(updated_date) > 1 else 'No Updated Date Found'
            print('Last Updated Date: ',updated_date_text)

            print('\n')

            time.sleep(1)
    

Title:  3 drawer lateral file cabinet - free stuff - craigslist
URL:  https://images.craigslist.org/00Q0Q_5aDc88w6ZRt_0t20CI_600x450.jpg
Description:  Giving away a lateral file cabinet in excellent condition. I dont have the keys, so it wont lock. other than that it works great. I am having a bunch of stuff hauled away on Saturday morning, so you...
Post ID:  post id: 7715526266
Posted Date:  2024-02-07 16:39
Last Updated Date:  2024-02-07T16:39:26-0800


Title:  FREE TEDDY BEARS & GALAXY ROSES FOR VALENTINE'S DAY TO SPREAD LOVE - free stuff - craigslist
URL:  https://images.craigslist.org/00Y0Y_8bQG1tUwXSq_0cI0oc_600x450.jpg
Description:  FREE TEDDY BEARS, VALENTINE PLUSH TOYS &amp; GALAXY ROSES FOR VALENTINE'S DAY TO SPREAD LOVE on GARAGE SALE FRIDAY FEB 9, SATURDAY FEB 10 AT AND SUNDAY FEB 11 AT 249 SAGAMORE STREET IN SF FROM 10AM...
Post ID:  post id: 7714731668
Posted Date:  2024-02-05 11:19
Last Updated Date:  2024-02-05T11:19:00-0800


Title:  Glassware - free stuff - craigslis

##### I have printed the Title, Image URL, description, Post ID, Posted Date, Last Updated Date of all the 250 HTML content files

## Part 3: Automating Login on The Old Reader

### Q1. Creating and Verifying a The Old Reader Account
Account Creation:  Create an account on https://theoldreader.com Links to an external site..  Use an email address and password that you are comfortable sharing with us.

Manual Login Verification: Before automating the login process, ensure you can manually log in to theoldreader.com with your new credentials.  This confirms that your account is active and your credentials are correct.

#### Answer:
Created an account in The Old Reader Account and verified the login process. The URL for the sign-up page is 'https://theoldreader.com/users/sign_up'.

### Q2. Exploring the Login Mechanism
Navigate to the login page of https://theoldreader.com Links to an external site..

Use your browser’s developer tools to inspect the page, focusing on the <form> tag involved in the login process.

Document all `<input>` fields within the login form, paying special attention to their name attributes. These fields are crucial for submitting the login request programmatically.

 

#### Answer:
##### The `<input>` fields within the login form are used to record the username, password of the user. Also they are hidden input type for utf8 and authentication token. The input of type 'submit' is used to submit the sign-in details of the user.

### Q3.Analyzing Network Traffic for Login Request
With the network tab of your browser’s developer tools open, log in to the site again.

Identify the network request made when you submit the login form (GET or POST).  Explain why this method was chosen.

Carefully examine the payload that was submitted to the server during login.  Compare this payload to the `<form>` / `<input>` fields you previously analyzed.  Explain your observation.

#### Answer:
##### The Network request that was made when we submit the login form is POST. Since user information such as user name and password are sent in the payload POST method is used. POST method is helps to protect user credentials and improves the overall application security.
##### The payload has all the information that was captured using the `<input>` field. utf8, authenticity token, user name, password, and commit button - sign in.

In [75]:
## Setting the Headers and URL
headers = {'User-Agent': 'Chrome/120.0.0.0'}
url = 'https://theoldreader.com/users/sign_in'  ## sign-in page

## Requests
page = requests.get(url, headers = headers)
soup_p3 = BeautifulSoup(page.content, 'html.parser')

## Displaying the page content
print(soup_p3.prettify)

## Verification
if soup_p3 is not None:
    print(f"Access website, link: {url}")
else:
    print("Failed.")

Access website, link: https://theoldreader.com/users/sign_in


In [70]:
## Extracting the login form content.
input = soup_p3.select('input#user_login.form-control')
print(input)

[<input autocapitalize="off" autocorrect="off" autofocus="autofocus" class="form-control" id="user_login" name="user[login]" placeholder="Username/Email" size="30" spellcheck="false" type="text"/>]


### Q4. Automating the Login Process

In [71]:
## Selecting the Authenticity token for automating the login
input = soup_p3.select_one('#new_user input[name = authenticity_token]')
a_token = input.get('value')

## Displaying the token value
print(a_token)

## Pausing the session
time.sleep(5)

session = requests.session()

## Configuring the post request
post_r = session.post('https://theoldreader.com/users/sign_in',
                     data = {'authenticity_token' : 'a_token',
                             'user[login]': 'krishimadhu@icloud.com',
                             'user[password]': 'r5Pjft$5EBB$ugR'                           
                         
                     },
                     timeout=20)

## Extrating the cookies and displaying it
cookies = session.cookies.get_dict()
print(cookies)

29ovAdEHqW1oiYOi+4M/Nix/PFtSNB35Bw3SgWoh59A=
{'_new_reader_session': 'BAh7CkkiD3Nlc3Npb25faWQGOgZFVEkiJTA4ZTQwNzlkMzc0N2Q3MjI4NmZhZjk3NzlhOGVjMWZiBjsAVEkiGXdhcmRlbi51c2VyLnVzZXIua2V5BjsAVFsHWwZVOhpNb3BlZDo6QlNPTjo6T2JqZWN0SWQiERUuoC%2F17hUhzdqei0kiIiQyYSQwNSQzaVhoWVFVaUdtSzNNbFFZd3JZbmEuBjsAVEkiDWxhbmd1YWdlBjsARjoHZW5JIhByZWRpcmVjdF90bwY7AEZJIgYvBjsARkkiEF9jc3JmX3Rva2VuBjsARkkiMVFDK053Nzg5UXdVOXFhVmtGWXF3WVRvc3YvZ1FPVWZmRFBsemVXQ3NxQms9BjsARg%3D%3D--d5dea4afcfb30466f6bb671c416a8d20707d7e78', 'i_know_you': 'Madhu', 'remember_user_token': 'BAhbB1sGVToaTW9wZWQ6OkJTT046Ok9iamVjdElkIhEVLqAv9e4VIc3anotJIiIkMmEkMDUkM2lYaFlRVWlHbUszTWxRWXdyWW5hLgY6BkVU--9acde38587ba34395d09db03c46851a965b69d47', 'signed_at': '1707362514'}


##### Succesfully logged in with my credentials and displayed session cookies for the same. 

### Q5. Verifying Successful Login

In [72]:
# Pausing between requests.
time.sleep(5) # 5secs wait period

# Setting the cookies and stay logged in
page3 = requests.get('https://theoldreader.com/', cookies=cookies) # or explicitly set cookies of the session (when not using session.xyz)
soup3_p3 = BeautifulSoup(page3.content, 'html.parser')

## Checking if logged in using username
elements = soup3_p3.find_all(string = 'Madhu  ') 
for element in elements:
    print(str(element.parent))

<a class="dropdown-toggle" data-hover="dropdown" data-toggle="dropdown" href="#" title="Madhu">Madhu  <i class="fa fa-caret-down"></i></a>


##### The extracted html content has 'Madhu' displayed and this verifies that I am logged in successfully.