## Part 1: Scraping and Saving HTML Content

### 2. Interact with the Page-Sorting:

- Can you trigger the sorting change directly by modifying only the URL in your browser’s address bar?  If so, how?  
  - **Answer:** Add sort variable after zip
    - To sort by the oldest listings first, you would use: https://sfbay.craigslist.org/search/zip?sort=dateoldest
    - To sort by the newest listings first, you would use: https://sfbay.craigslist.org/search/zip?sort=date

- Explain what type of request is made when you change the sort order (GET or POST).
  - **Answer:** GET request is made to retrieve posts made with that filter.

- What is the variable in the URL associated with sorting?
  - **Answer:** "sort"


### 3. Interact with the Page-Pagination:

- Determine how to move between pages by only changing the URL.  What part of the URL changes as you navigate through different pages?
  - **Answer:** Every page starts as "~gallery~0~0". As you scroll down through the page, the second number changes to show which listing you can see. As you move from page to page, the first number changes accordingly. Ex: Going to two pages after (3rd page with items 361-480) and scrolling to the middle of the page would result in "~gallery~3~50"
    - https://sfbay.craigslist.org/search/zip?sort=date#search=1~gallery~3~50

- Identify the variable associated with page changes.  How does altering this variable in the URL affect the page you’re viewing?
  - **Answer:** Fully explained how to navigate in previous answer. Variable at use is "gallery".

### 4. Fetch Listing URLs:

- Use `requests` to access the first page of the “free” section, ordered “newest” first.

- Deploy `BeautifulSoup` to parse the HTML content.

- Identify the structure that holds the links to individual listing pages.  What selector do you choose to grab the link?

- Can you identify one more possible selection method to retrieve the link to the individual listing?  Explain.

- Extract the first 250 unique listing URLs and save them to a list.  Consider the pagination feature of Craigslist to navigate through pages.  Explain your strategy.

- Print the list to screen.

In [101]:
import requests
from bs4 import BeautifulSoup
from time import sleep

headers = {'User-Agent': 'Mozilla/5.0'}

url = "https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~0~0"

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")

# print raw output of the page
# print(soup.prettify())

# print count of elements with <li class="cl-static-search-result">
print("Number of posts found:", len(soup.find_all("li", class_="cl-static-search-result")))

list_of_urls = []
# print all href inside "li", class_="cl-static-search-result". Only the first 250 elements
for i in soup.find_all("li", class_="cl-static-search-result")[:250]:
    print(i.a.get("href"))
    list_of_urls.append(i.a.get("href"))


Number of posts found: 358
https://sfbay.craigslist.org/eby/zip/d/clayton-free-watchmakers-bench-world/7704727968.html
https://sfbay.craigslist.org/scz/zip/d/aptos-kombucha-bottles/7696710009.html
https://sfbay.craigslist.org/nby/zip/d/santa-rosa-free-metal-rings/7702095048.html
https://sfbay.craigslist.org/nby/zip/d/el-verano-thin-wire-for-hobbys/7704813557.html
https://sfbay.craigslist.org/eby/zip/d/concord-free-furniture-curbside/7698740287.html
https://sfbay.craigslist.org/eby/zip/d/berkeley-free-metal-glass-filing/7704818546.html
https://sfbay.craigslist.org/eby/zip/d/antioch-tub-lids-for-art-teachers/7704825207.html
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-lots-of-free-items/7703526770.html
https://sfbay.craigslist.org/eby/zip/d/lafayette-pottery-barn-stools/7704834690.html
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-free-queen-size-mattress/7704839118.html
https://sfbay.craigslist.org/eby/zip/d/walnut-creek-three-free-pillows-two/7704841928.html
https://sfba

### 5. Save HTML Pages:
- For each of the 250 listing URLs, use `requests` to fetch the listing page.

- Save each HTML content to a separate file on disk.  Use each listing’s ID to organize files in a way that makes them easily identifiable (e.g., save listing ID 7713901653 to file “7713901653.html”).

In [102]:
# For each of the 250 listing URLs in list_of_urls, use `requests` to fetch the listing page.
# Change IP Address before the for loop

for url in list_of_urls:
    sleep(7)
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")

    # add raw output of the page to a file named "<listingId>.html"
    # Urls are in the format: https://sfbay.craigslist.org/eby/zip/d/clayton-free-watchmakers-bench-world/<listingId>.html
    listingId = url.split("/")[-1].split(".")[0]
    with open(f"{listingId}.html", "w") as file:
        file.write(soup.prettify())

    print(f"File {listingId}.html created")

# FILEPATH: /Users/quino/Documents/GitHub/UC-Davis-HW/BAX 422 Data Design/Individual Project 1/project.ipynb
# Number of posts processed: 250
print("Number of posts processed:", len(list_of_urls))

File 7704727968.html created
File 7696710009.html created
File 7702095048.html created
File 7704813557.html created
File 7698740287.html created
File 7704818546.html created
File 7704825207.html created
File 7703526770.html created
File 7704834690.html created
File 7704839118.html created
File 7704841928.html created
File 7704842962.html created
File 7704844703.html created
File 7704845214.html created
File 7704850004.html created
File 7704853833.html created
File 7700373237.html created
File 7704862210.html created
File 7704864061.html created
File 7704865308.html created
File 7704865930.html created
File 7704869989.html created
File 7704874694.html created
File 7695811368.html created
File 7704878191.html created
File 7704878469.html created
File 7704878908.html created
File 7699582137.html created
File 7704873866.html created
File 7704885157.html created
File 7704888726.html created
File 7704889133.html created
File 7704889847.html created
File 7695914006.html created
File 770489523

## Part 2: Parsing and Displaying Information from Saved HTML

### 1. Write a script that reads each of the saved HTML files from the disk.

In [106]:
# Read the files and print the first 50 characters of each file. 
# Loop over all html files in my current directory and 
# print the first 50 characters of each file.
# I used my own implementation

import os
for file in os.listdir():
    if file.endswith(".html"):
        with open(file, "r") as file:
            print(file.read()[:50])
        

<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>
  <meta charset="ut
<!DOCTYPE html>
<html>
 <head>


### 2. Extract Information:

For each HTML file, use `BeautifulSoup` to parse the file content.

Extract and print the following details:

- Title: The title of the listing.

- URL of first image (if an image exists):  The URL of the displayed image.  It can be found in the `src` attribute of `<img>`

- Description: The full description text of the listing.

- Post ID: Usually found at the bottom of the page or within the page's HTML structure.

- Posted Date: The date when the listing was originally posted.

- Last Updated Date: The date when the listing was last updated.


In [130]:
# Read the files and print the first 50 characters of each file. 
# Loop over all html files in my current directory and 
# print the first 50 characters of each file.
# I used my own implementation
import os

for filename in os.listdir():
    # Check if the file is an HTML file
    if filename.endswith(".html"):
        with open(filename, "r", encoding="utf-8") as file:
            # Read the file content
            html_content = file.read()
            
            # Parse the HTML content using BeautifulSoup
            soup = BeautifulSoup(html_content, 'html.parser')
            
            # Extract title from here:
            #     <meta content="Car seat - free stuff - craigslist" property="og:title"/>
            title = soup.find('span', id='titletextonly').text.strip()
            
            # Extract URL of the first image if exists from here:   
            #     <meta content="https://images.craigslist.org...600x450.jpg" property="og:image"/>
            image_url = soup.find('meta', property='og:image')['content'] if soup.find('meta', property='og:image') else 'No image found'
            
            # Extract description from here:   
            #     <meta content="Please let me know ... Thanks" name="description"/>
            description = soup.find('meta', attrs={'name': 'description'})['content']

            # Extract post ID
            div_element = soup.find('div', class_='postinginfos')
            p_text = div_element.find('p').text.strip() if div_element else 'No p tag found'
            p_text = p_text.replace("post id: ", "")
            
            # Extract posted date
            posted_date = soup.find('time', class_='date timeago').text.strip()
            
            # Print extracted information
            print(f"Title: {title}")
            print(f"URL of first image: {image_url}")
            print(f"Description: {description}")
            print(f"Post ID: {p_text}")
            print(f"Posted Date: {posted_date}")
            print("-" * 50)  # Print a separator line for readability

Title: Various infant and baby supplies
URL of first image: https://images.craigslist.org/00m0m_kcJ6VWOX97U_0lM0CI_600x450.jpg
Description: Email with phone number so we can text to coordinate efficiently. Take one take all just let me know. -Various meds like gripe water, iron drops, and liquid multivitamin. Most new or totally full...
Post ID: 7698463463
Posted Date: 2023-12-16 11:10
--------------------------------------------------
Title: Mazda Miata Automatic Gas & Brake Pedal
URL of first image: https://images.craigslist.org/00808_8yOZlxvchtM_0t20CI_600x450.jpg
Description: In good, working condition. Originally off a ‘93 Miata. Message me through Craigslist chat if possible, easier than digging through all my junky ass email. 
Post ID: 7705222704
Posted Date: 2024-01-07 14:29
--------------------------------------------------
Title: Couch
URL of first image: https://images.craigslist.org/00W0W_58zCEtOZb4s_0fu079_600x450.jpg
Description: Free couch. Cute loveseat couch.
Post ID: 

## Part 3: Automating Login on The Old Reader

### 1. Creating and Verifying a The Old Reader Account

- **Account Creation:**  Create an account on https://theoldreader.com . Use an email address and password that you are comfortable sharing with us.

- **Manual Login Verification:** Before automating the login process, ensure you can manually log in to theoldreader.com with your new credentials.  This confirms that your account is active and your credentials are correct.

Successfully created account.

### 2. Exploring the Login Mechanism

- Navigate to the login page of https://theoldreader.com to an external site.

- Use your browser’s developer tools to inspect the page, focusing on the <form> tag involved in the login process.

- Document all `<input>` fields within the login form, paying special attention to their name attributes. These fields are crucial for submitting the login request programmatically.

In [136]:
# Navigate to the login page of https://theoldreader.com to an external site using bs

url = "https://theoldreader.com/users/sign_in"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
# print(soup)


# Extract all input fields within the form
input_fields = soup.find_all('input')

# Print the name attribute of each input field
for input_field in input_fields:
    print(input_field['name'])

# We are only interested in user[login] and user[password] fields.


utf8
authenticity_token
user[login]
user[password]
commit


### 3. Analyzing Network Traffic for Login Request

- With the network tab of your browser’s developer tools open, log in to the site again.

- Identify the network request made when you submit the login form (GET or POST).  Explain why this method was chosen.

- Carefully examine the payload that was submitted to the server during login.  Compare this payload to the `<form>` / `<input>` fields you previously analyzed.  Explain your observation.

**Answer:** Payload uses the same 5 fields as we captured earlier. It is a POST request to https://theoldreader.com/users/sign_in . We use POST to securely transfer credentials hidden within the body rather than being part of the URL to keep information out of browser history and server logs.

### 4. Automating the Login Process

- Using Python and appropriate libraries like requests, simulate the login process.

- Create a session object to maintain your login state across multiple requests.

- Prepare a payload with your login credentials and other necessary form data identified from the login page and the network analysis.

- Send a POST request to the login form’s action URL to log in, using the session object.

In [137]:
url = "https://theoldreader.com/users/sign_in"  # replace with the actual login URL
data = {
    "utf8": "✓",
    "authenticity_token": "E7UZS//Yd8c+ETSTaG0ApHxKZ+I8EWRloGKXuI/R3D0=",
    "user[login]": "quinocarreteromartinez@gmail.com",
    "user[password]": "Quinox98!",
    "commit": "Sign In",
}

response = requests.post(url, data=data)
print(response.status_code)

200


### 5. Verifying Successful Login

- After attempting to log in, inspect the cookies saved in the session object to understand the information WhoScored.com stores on your computer.

- Use the session object to access https://theoldreader.com to an external site.

- Verify successful login by checking for the presence of your user information that is only available when logged in.

In [141]:
# Check if we logged in by extracting text in <li class="dropdown"> and printing <a title> inside it
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find('li', class_='dropdown').a['title'])

Joaquin


In [3]:
import requests
from bs4 import BeautifulSoup

# The URL of the Craigslist "free" section.
url = 'https://sfbay.craigslist.org/search/zip?'

# Send a GET request to the URL
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)

listings = soup.select('li.cl-search-result cl-search-view-mode-gallery > a.main singleton')

print(listings)

# Extract the URLs from the 'href' attribute of each <a> tag
urls = [listing.get('href') for listing in listings]

print(url)

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="craigslist" property="og:site_name"/>
<meta content="preview" name="twitter:card"/>
<meta content="SF bay area free stuff - craigslist" property="og:title"/>
<meta content="SF bay area free stuff - craigslist" name="description"/>
<meta content="SF bay area free stuff - craigslist" property="og:description"/>
<meta content="https://sfbay.craigslist.org/search/zip" property="og:url"/>
<title>SF bay area free stuff - craigslist</title>
<link href="https://sfbay.craigslist.org/search/zip" rel="canonical"/>
<link href="https://sfbay.craigslist.org/search/zip" hreflang="x-default" rel="alternate"/>
<link href="/favicon.ico" id="favicon" rel="icon">
<script id="ld_searchpage_data" type="application/ld+json">
    {"breadcrumb":{"@type":"BreadcrumbList","@context":"https://schema.org","itemListElement":