**Lab Exercise 1. Scraping Static Websites**

**Task Description**

Scrape the information about the products on the following page: https://clevershop.mk/product-category/mobilni-laptopi-i-tableti/

**Import libraries**

In [7]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

**Scraping**

In [9]:
baseURL = "https://clevershop.mk/product-category/mobilni-laptopi-i-tableti"

**Send a HTTP request to fetch page URL and create a BeautifulSoup object in order to parse the page content, before you can select elements from it**

In [11]:
odgovor = requests.get(baseURL)
if odgovor.status_code==200:
    soupObject = BeautifulSoup(odgovor.text, 'html.parser')

    produkti= soupObject.select('.product')
    edenProdukt = produkti[0]

    print(edenProdukt)
else:
    print("Failed to retrieve page")

<div class="product-grid-item product wd-hover-standard col-lg-3 col-md-3 col-6 first type-product post-21494 status-publish instock product_cat-laptopi has-post-thumbnail taxable shipping-taxable purchasable product-type-simple" data-id="21494" data-loop="1"><div class="berocket_better_labels berocket_better_labels_image"><div class="berocket_better_labels_position berocket_better_labels_position_right"><div class="berocket_better_labels_line berocket_better_labels_line_1"><div class="berocket_better_labels_inline berocket_better_labels_inline_1"><div class="br_alabel br_alabel_type_text br_alabel_template_type_image berocket_alabel_id_14967 berocket-label-user-image br_alabel_better_compatibility" style=""><span class="berocket-label-user-image" style=" background: transparent url('https://clevershop.mk/wp-content/uploads/2023/05/majska-popust.png') no-repeat right top/contain;"><i class="template-span-before" style="background-color: #f16543; border-color: #f16543;"></i><i class="te

In [13]:
def extractingToDict(produkt):
    title = produkt.select_one('.wd-entities-title').text.strip()

    #Check for regular price in <del> tag
    #This line searches for the regular price inside a del tag. 
    #<del> tag indicates a crossed out text, in this case the regular price
    #so, if a product has a discount this selector should return the regular price
    regularPrice = produkt.select_one('span.price del span.woocommerce-Price-amount bdi')
    
    if regularPrice is None: #if it returns a None object, it means that the regular price is not inside a del tag, therefore it doesn't have a discount
        regularPrice = produkt.select_one('span.price span.woocommerce-Price-amount bdi')
        
    if regularPrice is not None: #in case it has a discount, it will acknowledge there is an object in the <del> tag
        regularPrice = regularPrice.text.strip()

    #Check for discount price, if available, inside the <ins> tag
    discountPrice = produkt.select_one('span.price ins span.woocommerce-Price-amount bdi')
    if discountPrice is not None:
        discountPrice = discountPrice.text.strip()
        
    URLtoProduct = produkt.select_one('h3.wd-entities-title a')
    if URLtoProduct is not None:
        URLtoProduct = URLtoProduct.get("href")
        
    cartURL = produkt.select_one('div.wd-add-btn-replace a')
    if cartURL is not None:
        cartURL = cartURL.get("href")

    #dodeka sme seuste vo for loop, gi stavame site promenlivi vo Dict
    produktDict = {
        "title": title,
        "regular price": regularPrice,
        "discount price": discountPrice,
        "url to product" : URLtoProduct,
        "add to cart" : cartURL
    }

    #nakraj funkcijata mora da vrati DataFrame
    return produktDict

In [15]:
extractingToDict(edenProdukt)

{'title': 'Acer A315-23-A7KD',
 'regular price': '17.590\xa0ден',
 'discount price': None,
 'url to product': 'https://clevershop.mk/product/acer-a315-23-a7kd/',
 'add to cart': '?add-to-cart=21494'}

**Collecting products from multiple pages**

**--> Explanation for .format()**

Since baseURL has a placeholder { } for the number of the page, when .format(pageNumber) is used, it replaces { } with the current pageNumber value

example: baseUrl = "https://example.com/products/page/{ }"

In [17]:
def scrapingMultiplePages(firstPage, lastPage):
    #created to store the data for all products across the pages
    productsList = []

    #loop that iterates over each page in the specified range
    for pageNumber in range(firstPage, lastPage+1):
        #.format inserts the pageNumber in the baseURL.
        URLofThisPage = baseURL.format(pageNumber)
        #Send HTTP GET request to fetch the HTML content of each page
        odgovor = requests.get(URLofThisPage)
        
        if odgovor.status_code == 200:
            soup = BeautifulSoup(odgovor.text, 'html.parser') #parses the HTML content for easy element selection
            produkti = soup.select('.product-grid-item')
            
            for edenProdukt in produkti:
                produktDict = extractingToDict(edenProdukt)
                productsList.append(produktDict)
        else:
            print("Failed to retrieve page {pageNumber}")

    return pd.DataFrame(productsList)

**Implementing the scraping function**

In [21]:
df = scrapingMultiplePages(1,3)
df.head(10)

Unnamed: 0,title,regular price,discount price,url to product,add to cart
0,Acer A315-23-A7KD,17.590 ден,,https://clevershop.mk/product/acer-a315-23-a7kd/,?add-to-cart=21494
1,Acer A315-23-R5P2,27.490 ден,,https://clevershop.mk/product/acer-a315-23-r5p2/,?add-to-cart=21510
2,ACER Aspire 1 A115-22,18.999 ден,15.999 ден,https://clevershop.mk/product/acer-aspire-1-nx...,?add-to-cart=20826
3,Acer Aspire 3 A315-23-R26A,29.990 ден,,https://clevershop.mk/product/acer-aspire-3-a3...,?add-to-cart=21516
4,Acer Aspire 3 A315-58-33WK,24.490 ден,,https://clevershop.mk/product/21498/,?add-to-cart=21498
5,Acer Aspire 3 A315-58-33WK,25.990 ден,,https://clevershop.mk/product/acer-aspire-3-a3...,?add-to-cart=21506
6,ACER Aspire 5 (A515-56-35KA),23.990 ден,,https://clevershop.mk/product/acer-aspire-5-a5...,?add-to-cart=21693
7,ACER ASPIRE 5 A515-45,29.990 ден,,https://clevershop.mk/product/acer-aspire-5-a5...,?add-to-cart=21523
8,Acer Aspire 5 A515-45-R07Y Silver,24.990 ден,,https://clevershop.mk/product/acer-aspire-5-a5...,?add-to-cart=21501
9,Acer Aspire 5 A515-45-R1FG,36.990 ден,,https://clevershop.mk/product/acer-aspire-5-a5...,?add-to-cart=18324


**Saving the dataframe as .csv**

In [23]:
df.to_csv("vnp-lab1.csv")