## Agenda (25/03/23)

- Discuss basics of web scraping and address Part 1 in the assignment using BeautifulSoup.

### Notes:

- Selenium comes in handy when we have to simulate clicks or enter text on web pages. Not required for answering both the questions.
- No plug-ins are required. We will use built-in Chrome functionality to get the job done.
- Need to install the BeautifulSoup package.
- Documentation can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### A few myths about web scrapping:

1. Web scrapping involves extensive coding and is tough to do.
2. A prior knowledge of HTML is required to scrap web data.
3. BeautifulSoup has a lot of commands and I cannot remember them all.


### My opinion: 

1. Not at all. Web scraping is divided between coding and non-coding components. However, basic knowledge of Python like lists, dictionaries, loops,dataframes,functions is required.
2. While HTML is not a prerequisite, having a very basic knowledge of it helps. You may find this link useful https://betterprogramming.pub/understanding-html-basics-for-web-scraping-ae351ee0b3f9.
3. We can get the job done using just one command in Soup!

### Notes:

a) The difference between **.content** and **.text** in the **requests** method is
- .content returns the actual content in bytes. Preferred for binary file types such as pdf or image file.
- .text returns the content converted into a string i.e, in Unicode. Generally preferred for textual responses like HTML or XML document

b) The differences between **.find()** and **.find_all()** methods in **BeautifulSoup** are:
- find() returns the first _tag_ that matches the condition. find_all() gives a _list_ of all _tags_ that match the condition.
- The output type of find() is bs4.element.Tag while the output of find_all() is a ResultSet. For all practical purposes, you can think of the Resultset as being a list of the tags.
- Access the individual elements of a Resultset using list indices.

c) CSV package documentation link: https://docs.python.org/3/library/csv.html


## Methodology:

- a) Open the URL and get the underlying HTML code.
- b) Use the "blue box" approach to zero in the box/block-of-code that is relevant to you. 
- c) Perform data cleaning operations, if needed.
- d) Save the output in a required format.

## Code

First, install the BeautifulSoup package

In [1]:
#!pip install lxml

!pip install beautifulsoup4



In [2]:
# Import the necessary packages

import requests
from bs4 import BeautifulSoup
import csv

Pass the following header for the code to mimic a web brower

In [3]:
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Connection': 'keep-alive'}


page = requests.get('https://www.ebay.com/sch/i.html?_nkw=table&_pgn=2',headers=hdr)

In [4]:
#Check whether or not the URL was loaded successfully 

if page.status_code == 200:
    print("Page loaded Successfully ")
else:
    print("Error Loading page")

Page loaded Successfully 


### The next step is to get the underlying HTML code using a parser. 

In [5]:
webpage = BeautifulSoup(page.text,'html')

#These are the alternative ways of writing the same command
# webpage = BeautifulSoup(page.text,'html.parser')
# webpage = BeautifulSoup(page.text)


#Run this command to view the output in an indented form
#print(webpage.prettify())


### **Next, find a good starting point. Remember, we don't need to access all of the HTML code but get only that part which is of relevance to us. We called this the blue box in the session.**

In [6]:
bluebox = webpage.find('ul',class_='srp-results srp-list clearfix' )

### Now, the blue box may contain some additional information along with what we are looking for. We need to find a way to exclude such information. Another way of saying this is to include ONLY the info we require.

In [7]:
products = bluebox.find_all('div',class_='s-item__wrapper clearfix')

### At this step we have all of the products information with us. We will have to loop through every product to extract the necessary fields. Finally, the output must be saved as a .tsv file.

In [8]:
with open('Second_Page.tsv','w',newline='') as file:
    headers = ['Product Name',"Price",'Shipping','Number of Watcher','Condition','Image URL','Product URL']
    
    #Call the write method
    writer = csv.writer(file,delimiter='\t')
    
    #First, add the column header to the file
    writer.writerow(headers)


    for product in products:
        
        name = product.find('div',class_='s-item__title').get_text()
        product_url = product.find('a')


        price = product.find('span',class_='s-item__price').get_text()
        

        shipping = product.find('span',class_='s-item__shipping s-item__logisticsCost')

        if shipping:
            shipping_cost =  shipping.get_text()
        else:
            shipping_cost = "No shipping info found"


        watchers = product.find('span',class_='s-item__dynamic s-item__watchCountTotal')

        if watchers:
            watch = watchers.get_text()
        else:
            watch = "No watchers info"


        condition = product.find('div',class_='s-item__subtitle').get_text()
        
        image_url = product.find('img')['src']
        
        row = [name,price,shipping_cost,watch,condition,image_url,product_url ]
    
        writer.writerow(row)

UnicodeEncodeError: 'charmap' codec can't encode character '\u2033' in position 13: character maps to <undefined>

In [None]:
 Call the write method
    writer = csv.writer(file, delimiter='\t')
    
    # First, add the column header to the file
    writer.writerow(headers)
    
    for products in inner_table:
        Product_Name = products.find('div', class_='s-item__title').get_text().strip()
        
        Price = products.find('span', class_='s-item__price').get_text().strip()
        
        Shipping_cost = products.find('span', class_='s-item__shipping s-item__logisticsCost')
        if Shipping_cost:
            Shipping_cost = Shipping_cost.get_text().strip()
        else:
            Shipping_cost = 'Shipping cost not available for this product'
        
        Number_of_Watchers = products.find('span', class_='s-item__watchers s-item__bidCount')
        if Number_of_Watchers:
            Number_of_Watchers = Number_of_Watchers.get_text().strip()
        else:
            Number_of_Watchers = 'No watchers for this product'
        
        Product_condition = products.find('div', class_='s-item__subtitle').get_text().strip()
        if 'Brand New' in Product_condition or 'Open Box' in Product_condition or 'Pre-Owned' in Product_condition:
            row = [Product_Name, Price, Shipping_cost, Number_of_Watchers, Product_condition]
            writer.writerow(row)
        else:
            print('Product condition not specified')

In [2]:



#Importing libraries
import numpy as np
import pandas as pd
import requests
import csv
from bs4 import BeautifulSoup as bs


In [3]:
#Loading the given website
Website=requests.get('https://www.ebay.com/sch/i.html?_nkw=table')
print(Website) # If the response is 200 then it means that our given website has loaded 

<Response [200]>


In [4]:
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Connection': 'keep-alive'}


page = requests.get('https://www.ebay.com/sch/i.html?_nkw=table&_pgn=2',headers=hdr)

In [5]:
site = bs(Website.text,'lxml') # Using beautifulsoup and loading a package called html parser is used
site

<!DOCTYPE html>
<!--[if IE 9]><html class="ie9" lang="en"><![endif]--><!--[if gt IE 9]><!--><html lang="en"><!--<![endif]--><!--M#s0-1--><body><noscript class="x-page-config"></noscript><!--M/--><meta content="IE=edge" http-equiv="X-UA-Compatible"/><script>"use strict";if(window.PerformanceObserver&&performance&&performance.mark&&performance.getEntriesByName){window.SRP=window.SRP||{};var paintObserver=new window.PerformanceObserver(function(e){var r=e.getEntries();if(r.sort(function(e,r){return e.startTime-r.startTime}),r&&!(r.length<2)){var n=r[1].startTime;window.SRP.TTI_TIMER={lastInteractiveWindow:n};var t=new window.PerformanceObserver(function(e){for(var r=e.getEntries(),i=0,a=r.length;i<a;i++)r[i].startTime-n>=5e3&&(window.SRP.TTI_TIMER.timeToInteract=n,t.disconnect()),n=r[i].startTime+r[i].duration,window.SRP.TTI_TIMER.lastInteractiveWindow=n});t.observe({entryTypes:["longtask"]}),paintObserver.disconnect()}});paintObserver.observe({entryTypes:["paint"]})};</script><title>tabl

In [6]:
table = site.find('ul', class_="srp-results srp-list clearfix") # finding the table in which saving the contents to box by selecting the unordered list and calling out the specific class
inner_table = table.find_all('div', class_='s-item__info clearfix') # finding all the elements in the table with similar patterns that contain all the information about the product

In [8]:
for products in inner_table:
    Product_Name = products.find('div', class_='s-item__title').get_text()
    print(Product_Name)
    
    Price = products.find('span',class_='s-item__price').get_text()
    print(Price)
    
    Shipping_cost = products.find('span',class_='s-item__shipping s-item__logisticsCost')
    if Shipping_cost:
        print(Shipping_cost.get_text())
    else:
        print('Shipping cost not available for this product')
        

    Image_URL = products.find('img')
    if Image_URL != None and Image_URL != -1:
        print(Image_URL.get('src'))

    Number_of_Watchers = products.find('span',class_='s-item__dynamic s-item__watchCountTotal')
        if Number_of_Watchers:
            print(Number_of_Watchers.get_text())
        else:
            print('No Watchers for this product')
        
    product_condition = products.find('div', class_='s-item__subtitle').get_text()
    if product_condition in {'Brand New', 'Open Box', 'Pre-Owned'}:
        print(product_condition)
    else:
        print('Product condition not specified')


Malachite Round Top Coffee Table Top Semi Precious Mosaic Art Inlay Home Decor
$333.00 to $2,300.40
Free shipping
No Watchers for this product
Brand New
Round[360-DEGREE SWIVEL]Counter Pub Bar Table Adjustable Height Wooden Tabletop
$59.99
+$204.06 shipping
No Watchers for this product
Brand New
New Listing18" Agate Side End Table Top Natural stones Handmade Work
$399.00
Free shipping
No Watchers for this product
Brand New
New Listing18" Mix Agate Side Table Top Natural stones Handmade Work
$399.00
Free shipping
No Watchers for this product
Brand New
Industrial Style Porch Table Single Layer Black Oak Triamine Board [105 * 30 * 7
$55.87
+$3.99 shipping
17 watchers
Brand New
Industrial Style Porch Table Single Layer Light Walnut Color Triamine Board
$55.63
+$3.99 shipping
13 watchers
Brand New
66" L-Shaped Gaming Desk Corner Computer Desk PC Laptop Study Table Workstation
$42.66
Free International Shipping
No Watchers for this product
Brand New
Million Dollar Cube Side Table - With Free