<a href="https://colab.research.google.com/github/masadlara/AmazonWebScraping/blob/main/JOR_Mall_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#JOR Mall Web Scraping

##1. Notebook Description

This notebook presents a web scraping project focused on extracting product title and price information from JOR Mall, an online shopping website. The objective is to collect data for a specific product and analyze its pricing over time.

The notebook begins by introducing the JOR Mall website and the target product. It then demonstrates step-by-step how to scrape the website using Python and BeautifulSoup, a popular web scraping library. The code includes establishing a connection to the website, sending HTTP requests, parsing the HTML content, and extracting relevant information.

The scraping process specifically focuses on retrieving the product title and price. The notebook showcases the identification of the HTML elements containing the desired data, such as the product title and price tags. It demonstrates how to navigate the HTML structure using BeautifulSoup's methods to extract the required information accurately.

Overall, this notebook serves as a comprehensive guide for web scraping the JOR Mall website, extracting product title and price information, exporting the scraped data into a CSV file, and automating the process of price checking everyday!

##2. Introduction
JOR Mall is an online shopping platform that offers a wide range of products across various categories. It provides customers with a convenient and user-friendly shopping experience, allowing them to explore and purchase products from the comfort of their homes.

One of the featured products on JOR Mall is the Samsung 55" QLED Smart 4K TV (2020) with the model number QA55Q95TAUXTW. This television model offers advanced features and high-quality visuals, making it an attractive choice for those seeking an immersive viewing experience.

The Samsung 55" QLED Smart 4K TV boasts a QLED display technology, which ensures vibrant and lifelike colors with enhanced contrast. It supports a 4K resolution, delivering sharp and detailed images. Additionally, as a smart TV, it comes equipped with built-in Wi-Fi and various connectivity options, allowing users to access online streaming services, browse the web, and connect to other devices effortlessly.

With its sleek design and slim profile, the Samsung 55" QLED Smart 4K TV adds a touch of elegance to any living space. It offers a range of features and functionalities, including voice control, multi-view capabilities, and a variety of pre-installed apps.

As a flagship model from Samsung's TV lineup, the QA55Q95TAUXTW is designed to provide an exceptional viewing experience, whether you're watching movies, playing games, or enjoying your favorite TV shows. With its combination of cutting-edge technology and stylish design, this TV aims to meet the demands of discerning customers who seek both performance and aesthetics.

Overall, the Samsung 55" QLED Smart 4K TV available on JOR Mall is a top-tier television model that caters to individuals who value superior picture quality, smart features, and a visually appealing design. It offers a compelling option for those looking to upgrade their home entertainment system and enjoy an immersive viewing experience.

##3. Load Needed Libraries

This section shows loading the needed libraries for this case study

In [55]:
#import libraries
from bs4 import BeautifulSoup
import requests
import time
import datetime

import smtplib # for sending emails to yourself

##4. Connect to Website

In [56]:
#Connect to Jormall Website
URL = 'https://jormall.net/collections/electronic/products/samsung-55-qled-smart-4k-tv-2020-qa55q95tauxtw'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
}

page = requests.get(URL, headers=headers)

##5. Show HTML Code of Website

In [57]:
#this shows the html code of the webpage
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<!-- Global site tag (gtag.js) - Google Ads: 798571495 -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=AW-798571495"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'AW-798571495');
</script>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0, height=device-height, minimum-scale=1.0, maximum-scale=1.0" name="viewport"/>
<meta content="#00aeef" name="theme-color"/><title>Samsung - 55" QLED Smart 4K Tv (2020) – JorMall
</title><meta content='- Series: 9 Display: - Screen Size: 65" - Resolution: 3,840 x 2,160 - Moth Eye: Yes Smart Service: - Samsung SMART TV: Smart - Operating System: Tizen™ - Bixby: US English, Korean, UK English, French, German, Italian, Spanish, India English (features vary by language) - Far-Field Vo

In [58]:
#this code shows the html code of the webpage in a better format
soup1= BeautifulSoup(soup.prettify(),"html.parser")
print(soup1)

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<!-- Global site tag (gtag.js) - Google Ads: 798571495 -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=AW-798571495">
</script>
<script>
   window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'AW-798571495');
  </script>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0, height=device-height, minimum-scale=1.0, maximum-scale=1.0" name="viewport"/>
<meta content="#00aeef" name="theme-color"/>
<title>
   Samsung - 55" QLED Smart 4K Tv (2020) – JorMall
  </title>
<meta content='- Series: 9 Display: - Screen Size: 65" - Resolution: 3,840 x 2,160 - Moth Eye: Yes Smart Service: - Samsung SMART TV: Smart - Operating System: Tizen™ - Bixby: US English, Korean, UK English, French, German, Italian, Spanish, India English (features vary by language) - 

##6. Scrape Product Title & Product Price from Website

In [59]:
#product title html code:
title_html = '<h1 class="product-meta__title heading h1">Samsung - 55" QLED Smart 4K Tv (2020)</h1>'

#scraping product title from webpage
def scrape_product_title(title_html):
    souptitle = BeautifulSoup(title_html, 'html.parser')
    title_element = souptitle.find('h1', class_='product-meta__title')

    if title_element:
        title = title_element.get_text(strip=True)
        return title
    return None

product_title = scrape_product_title(title_html)
if product_title:
    print('Product Title:', product_title)
else:
    print('Product title not found.')

Product Title: Samsung - 55" QLED Smart 4K Tv (2020)


In [60]:
#product price html code:
price_html = '<span class="money">2,189 JOD</span>'

#scraping product price from webpage
def scrape_product_price(price_html):
    soupprice = BeautifulSoup(price_html, 'html.parser')
    price_element = soupprice.find('span', class_='money')

    if price_element:
        price = price_element.get_text(strip=True)
        return price
    return None

product_price = scrape_product_price(price_html)
if product_price:
    print('Product Price:', product_price)
else:
    print('Product price not found.')

Product Price: 2,189 JOD


In [61]:
#Print Scraped Data
print(product_title)
print(product_price)

Samsung - 55" QLED Smart 4K Tv (2020)
2,189 JOD


In [62]:
#Print Date of Today
import datetime
today =datetime.date.today()
print(today)

2023-07-01


In [63]:
#Check data types of scraped data
type(product_title)
type(product_price)

str

In [64]:
type(today)

datetime.date

##7. Export to CSV

In [67]:
#Fill Scraped Date into CSV File
import csv
header = ['Product_Title', 'Product_Price', 'Date']
data = [product_title, product_price, today]

with open ('JOR Mall Web Scraper.csv', 'w', newline='', encoding ='UTF8') as f:
  writer=csv.writer(f)
  writer.writerow(header)
  writer.writerow(data) # the CSV file is found in the content folder if you are using Google Colab
  #Comment this section out to make sure it is only run one time, it will delete all scraped data if you run it again after days have passed

In [68]:
#alternatively, you can use this code to download the CSV file into your local machine
from google.colab import files

files.download('JOR Mall Web Scraper.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##8. Automating the Process of Scraping Data & Exporting to CSV Everyday

In [69]:
#Appending Data to the CSV || Adding more rows into our data file
with open ('JOR Mall Web Scraper.csv', 'a+', newline='', encoding ='UTF8') as f:
  writer=csv.writer(f)
  writer.writerow(data)

In [70]:
#Automating the process of web scraping - price checking every day (grouping all previous code in this cell)
def check_price ():
    URL = 'https://jormall.net/collections/electronic/products/samsung-55-qled-smart-4k-tv-2020-qa55q95tauxtw'

    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
}

    page = requests.get(URL, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    soup1= BeautifulSoup(soup.prettify(),"html.parser")
    #product title html code:
    title_html = '<h1 class="product-meta__title heading h1">Samsung - 55" QLED Smart 4K Tv (2020)</h1>'

#scraping product title from webpage
    def scrape_product_title(title_html):
        souptitle = BeautifulSoup(title_html, 'html.parser')
        title_element = souptitle.find('h1', class_='product-meta__title')

        if title_element:
           title = title_element.get_text(strip=True)
           return title
        return None

    product_title = scrape_product_title(title_html)
    if product_title:
        print('Product Title:', product_title)
    else:
        print('Product title not found.')

    #product price html code:
    price_html = '<span class="money">2,189 JOD</span>'

#scraping product price from webpage
    def scrape_product_price(price_html):
       soupprice = BeautifulSoup(price_html, 'html.parser')
       price_element = soupprice.find('span', class_='money')

       if price_element:
          price = price_element.get_text(strip=True)
          return price
       return None

    product_price = scrape_product_price(price_html)
    if product_price:
        print('Product Price:', product_price)
    else:
        print('Product price not found.')

    import datetime
    today =datetime.date.today()

    import csv
    header = ['Product_Title', 'Product_Price', 'Date']
    data = [product_title, product_price, today]

    with open ('JOR Mall Web Scraper.csv', 'a+', newline='', encoding ='UTF8') as f:
      writer=csv.writer(f)
      writer.writerow(data)

In [71]:
while(True):
  check_price()
  time.sleep(86400) #The scraping process of this web page will run every day to check if the price ever changes

Product Title: Samsung - 55" QLED Smart 4K Tv (2020)
Product Price: 2,189 JOD


KeyboardInterrupt: ignored

In [72]:
from google.colab import files

files.download('JOR Mall Web Scraper.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##General Note for Web Scraping Projects:
The employed code sometimes succeeds to scrape data from the website and sometimes it does not, as some websites employ various techniques to prevent web scraping and automated data extraction from their website. They may use measures such as CAPTCHA challenges, IP blocking, user agent detection, and dynamic HTML structures to make scraping more difficult.

In some cases, Amazon may change the HTML structure or class names of elements on their website, which can cause scraping code to break or result in "Product title not found" errors. This is done intentionally to deter automated scraping and protect their data.