## Tutorial Note Book

### 1. What the Scraper is Doing

The script performs the following tasks:

- **Input and Output CSVs**: It reads URLs from an input CSV file and writes the extracted data to an output CSV file.
- **Fetching Web Page Content**: For each URL, it sends an HTTP GET request, simulating a browser to avoid being blocked.
- **Parsing HTML**: It uses BeautifulSoup to parse the HTML content and searches for specific tags (`span` with class `y-css-kw85nd`).
- **Extracting Data**: It retrieves text within anchor (`<a>`) tags nested inside the targeted `span` elements, limited to 5 words per URL.
- **Handling Errors**: The script includes error handling for failed requests, ensuring that the script doesn't crash if a URL is unreachable.
- **Writing to CSV**: The extracted words are written to specific columns (`Label 1` to `Label 5`) in the output CSV.




---




### 2. How to Inspect a Website and Find the Information to Extract

When you want to scrape data from a website, you first need to understand how the information is structured in the HTML. Here's a step-by-step guide on how to do this:

##### 1. Visual Inspection:

- **Start by Visiting the Webpage**: Open the webpage in a browser (e.g., Chrome, Firefox).
- **Identify the Data You Need**: Look at the page and decide what specific information you want to extract (e.g., product names, prices, reviews, etc.).

##### 2. Inspecting the HTML Code:

- **Right-Click on the Desired Data**: Find the piece of data you want on the page (e.g., the name of a restaurant). Right-click on it and select "Inspect" or "Inspect Element" from the context menu.

Note: Make sure that the HTML code that you are viewing is for the web browser and not the mobile version. You can change the version by clicking on the dimension (right above the screen once the HTML code is opened) and selecting the right version. 


- **Examine the HTML Panel**: This opens the developer tools in your browser, highlighting the HTML element that corresponds to the data you selected.
  - **Understanding Tags and Attributes**: HTML is made up of various tags like `<div>`, `<span>`, `<a>`, etc. Each tag can have attributes like `class`, `id`, `href`, etc., that help identify and style it.

##### 3. Finding Patterns:

- **Look for Classes or IDs**: Check if the element has a unique class or id. These attributes are usually consistent across similar elements on the page (e.g., all product names might be in `<span>` tags with a class `product-title`).
- **Navigate Up or Down the HTML Structure**: Sometimes, the data is nested inside another tag. You might need to go one level up or down in the HTML hierarchy to find a container that groups several elements you’re interested in.
- **Identify Repeating Patterns**: If you’re scraping multiple items (like a list of products), find the common pattern in the HTML structure that repeats for each item.

##### 4. Testing Your Selection:

- **Use the Console for Testing**: In the browser's developer tools, you can use the Console to write small JavaScript snippets to see if your selection correctly identifies all desired elements.
- **Copy the Selector**: Once you’ve identified the correct tag and class or id, you can write the selector in your code to target these elements.

This process helps you translate what you see on the webpage into code that can reliably extract the information.


In [None]:
# Incomplete Scraper Script for Students

import csv
import requests
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from requests.exceptions import RequestException
import time

# Function to extract specific words from a given URL
def extract_words_from_url(url):
    try:
        # Create a session object to persist settings across requests
        session = requests.Session()

        #Retry logic to handle server errors
        retries = Retry(
            total = 5, 
            backoff_factor = 1,
            status_forcelist = [500, 502, 503, 504] #allow for various error codes
        )

        #Retry adapter classes
        session.mount('http://', HTTPAdapter(max_retries=retries))
        session.mount('https://', HTTPAdapter(max_retries=retries))

        #Mimic Browser Headers  
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
        }

        #Request to the URL using session.get
        response = session.get(url, headers=headers)
        response.raise_for_status()

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

        # Initialize dictionary for extracted data
        data = {
            'Business Name': '',
            'Review Rating': '',
            'Number of Reviews': '',
            'Food Type': '',
            'Operational Hours': ''
        }

        # Extract business name
        name_tag = soup.find('h1', class_='y-css-olzveb')
        if name_tag:
            data['Business Name'] = name_tag.text.strip()

        # Extract review score and count
        review_div = soup.find('div', {'data-testid': 'BizHeaderReviewCount'})
        if review_div:
            rating = review_div.find('span', class_='y-css-1jz061g')
            reviews = review_div.find('span', class_='y-css-1cafv3i')
            if rating:
                data['Review Rating'] = rating.text.strip()
            if reviews:
                data['Number of Reviews'] = reviews.text.strip('()')

        # Extract food/restaurant type
        food_type = soup.find('a', class_='y-css-1x1e1r2')
        if food_type:
            data['Food Type'] = food_type.text.strip()

        # Extract operational hours
        hours = soup.find('span', {'data-font-weight': 'semibold', 'class': 'y-css-1jz061g'})
        if hours:
            data['Operational Hours'] = hours.text.strip()

        return list(data.values())

    except RequestException as e:
        print(f"Error scraping {url}: {e}")
        return [""] * 5

    finally:
        time.sleep(1) #added a delay based on code seen to avoid overwhelming the server

def scrape_urls_from_csv(input_csv, output_csv):
    # # Testing the web scraper with a single URL (to avoid getting IP blocked again LOL)
    # test_url = "https://www.yelp.ca/biz/sushi-sama-montr%C3%A9al-11"
    # extracted_data = extract_words_from_url(test_url)
    # print("Test Results:")
    # print("Business Name:", extracted_data[0])
    # print("Review Rating:", extracted_data[1])
    # print("Number of Reviews:", extracted_data[2])
    # print("Food Type:", extracted_data[3])
    # print("Operational Hours:", extracted_data[4])

    with open(input_csv, mode='r', newline='', encoding='utf-8') as infile, \
            open(output_csv, mode='w', newline='', encoding='utf-8') as outfile:
        

    # CSV Reader initialization
        reader = csv.DictReader(infile)

        # Fieldnames for the output CSV
        fieldnames = reader.fieldnames + [
            'Business Name',
            'Review Rating',
            'Number of Reviews',
            'Food Type',
            'Operational Hours'
        ]

        # CSV Writer initialization
        writer = csv.DictWriter(outfile, fieldnames=fieldnames) 
        writer.writeheader()

        # Process each row
        for row in reader:
            url = row.get('restaurant_url')
            
            if not url:
                print(f"Skipping row with missing URL: {row}")
                continue

            print(f"Processing URL: {url}")
            
            # Extract business data
            extracted_data = extract_words_from_url(url)
            
            # Add business data to row
            row['Business Name'] = extracted_data[0]
            row['Review Rating'] = extracted_data[1]
            row['Number of Reviews'] = extracted_data[2]
            row['Food Type'] = extracted_data[3]
            row['Operational Hours'] = extracted_data[4]
            
            # Write row once
            writer.writerow(row)
            time.sleep(1)  # delay to avoid overloading server 

if __name__ == "__main__":
    input_csv = "/Users/hari/Documents/INSY 669/Module 1/INSY669_module_1/URLs to scrape.csv"
    output_csv = "/Users/hari/Documents/INSY 669/Module 1/INSY669_module_1/scraped_results.csv"
    # Call the main scraping function
    scrape_urls_from_csv(input_csv, output_csv)


Error scraping https://www.yelp.ca/biz/sushi-sama-montr%C3%A9al-11: 403 Client Error: Forbidden for url: https://www.yelp.ca/biz/sushi-sama-montr%C3%A9al-11
Test Results:
Business Name: 
Review Rating: 
Number of Reviews: 
Food Type: 
Operational Hours: 


### Key Details of the Code

**Session with Retry Logic**:  
This makes the code more robust by handling intermittent failures when accessing the URLs. It retries the request up to 5 times with increasing delays (exponential backoff).

**Browser-Like Headers**:  
The headers help disguise the script as a browser, preventing some websites from blocking it.

**HTML Parsing with BeautifulSoup**:  
The code searches for specific `<span>` elements with a defined class to locate words/keywords and extracts them.

**CSV Handling**:  
The input CSV is read row by row, URLs are processed, and extracted data is written back into a new CSV with additional columns (`Label 1` to `Label 5`).

**Error Handling**:  
If a request fails, the error is logged, and the script moves to the next URL, ensuring that an error doesn't stop the entire process.

**Delay Between Requests**:  
Although set to `0` for now, the `time.sleep(0)` can be adjusted to introduce a delay between requests to avoid overloading a server or getting IP blocked.
