# Web Scraping and Data Analysis- Halal Meat Products

This programs aims to gather data and information on the all cost of halal meat products sold by **Asda** and **Tesco**.

Asda and Tesco are major British retail chains, operating as supermarkets and offering a wide range of products, including groceries, clothing, electronics, and household goods.

This programs involves using web scraping and formating the data and information into appropriate data structure and finalise the results with data visualisation

## 1. Web Scraping - Collecting the Data 

* Web scraping is an automated process that involves extracting data from websites using scripts or bots to navigate and parse the HTML structure. 
  * This enables users to collect and analyze information for various purposes. 
* While `BeautifulSoup` is a commonly used library for web scraping in Python, it may face challenges with complex and dynamically generated content. 
* To address such complexities, the combination of `selenium` and `WebDriver` proves effective. 
* `Selenium` is a powerful tool for browser automation, allowing the program to control web browsers, perform actions like navigating to URLs and clicking buttons, and scrape dynamic content efficiently.

### 1.1 Importing the Libraries

These imports are commonly used when combining BeautifulSoup with Selenium for web scraping tasks. BeautifulSoup is used for parsing HTML, while Selenium is employed for automating browser interactions and handling dynamic content. The By class helps specify how to locate elements on a webpage, and WebDriverWait with expected_conditions is used for waiting until certain conditions are met before proceeding with the script execution.

In [1]:
# Import the BeautifulSoup class from the bs4 (Beautiful Soup) library
from bs4 import BeautifulSoup

# Import the WebDriver class from the selenium library
from selenium import webdriver

# Import the 'By' class from selenium.webdriver.common.by module
# This class is used to specify the mechanism used to locate elements on a web page
from selenium.webdriver.common.by import By

# Import the WebDriverWait class from selenium.webdriver.support.ui module
# WebDriverWait is used to wait for a certain condition before proceeding with the execution
from selenium.webdriver.support.ui import WebDriverWait

# Import the expected_conditions module from selenium.webdriver.support
# This module provides predefined conditions to use with WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Import pandas to transform the data into a dataframe
import pandas as pd

By using Selenium in combination with BeautifulSoup, you can interact with dynamic web pages, wait for elements to load, and then extract information using BeautifulSoup. This combination is especially useful when dealing with websites that rely heavily on JavaScript or have dynamic content.

### 1.2 WebDriver, Web Sraping and Data Transformation

In [2]:
# Initialize SafariDriver - It opens a new Safari browser window that this Python script can control.
driver = webdriver.Safari()

# List to store data - Initializes an empty list called data. This list will be used to store information extracted from the web pages.
data = []

# Loop through the pages - terates through three pages by changing the page_number variable in the URL (in this case, asda has 3 search result pages)
for page_number in range(1, 4):
    # Construct the URL for each page
    url = f"https://groceries.asda.com/search/halal%20meat/products?page={page_number}"

    # Navigate to the URL
    driver.get(url)

    # Wait for the product listings to be present (adjust the timeout as needed) - starting with the products name or titles
    # Waits for the presence of an element with the class name "co-product__title" on the page, ensuring that the page has loaded before proceeding.
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "co-product__title")))


    # Get Page Source and Parse with BeautifulSoup 
        # Retrieves the HTML source code of the page using the web driver and parses it using BeautifulSoup, a library for pulling data out of HTML and XML files.
    
    # Get the page source
    page_source = driver.page_source
    
    # Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(page_source, "html.parser")

    # Extract product titles
    # Finds all the HTML elements with the tag <h3> and class "co-product__title". These are the titles of the halal meat products on the page.
    titles = soup.find_all("h3", class_="co-product__title")

    # Extract other information and store in the data list 
    # Iterates through each product title, extracts additional information like ratings, price, and price per kilogram, and appends this data to the 'data' list.
    for title in titles:
        product_title = title.text.strip()

        # 2. Ratings
        ratings_button = title.find_next("button", class_="co-product__rating")
        ratings = ratings_button["aria-label"].split(" ")[0] if ratings_button else None

        # 3. Price
        prices = title.find_next("strong", class_="co-product__price").text.strip() if title.find_next("strong", class_="co-product__price") else None

        # 4. Price per Kilogram
        price_per_kgs_elements = title.find_next("p", class_="co-item__price-per-uom-msg")
        price_per_kg = price_per_kgs_elements.find("span", class_="co-product__price-per-uom").text.strip() if price_per_kgs_elements and price_per_kgs_elements.find("span", class_="co-product__price-per-uom") else None


        # Append the data to the list
        data.append([product_title, ratings, prices, price_per_kg])

# Closes the web browser once all pages have been processed.
driver.quit()

# Create a DataFrame from the collected data
columns = ["Title", "Ratings", "Price", "Price_per_Kg"]
df = pd.DataFrame(data, columns=columns)

# Display the DataFrame
print(df)

                                                 Title Ratings      Price  \
0    Shazans Halal Beef Mince (Typically Less Than ...    2.38  now £4.40   
1                 Shazans Halal Chicken Breast Fillets    3.58  now £4.00   
2                         Shazans Diced Chicken Breast    3.08  now £4.35   
3                         Shazans Chicken Mini Fillets       3  now £4.50   
4                        Shazans Chicken Thigh Fillets    2.15  now £4.50   
..                                                 ...     ...        ...   
123      Melis Pastirma Turkish Style Spicy Cured Beef    2.33  now £2.50   
124  Aunty Noray's 6 Hand Made Charcoaled Premium C...    3.63  now £3.50   
125      Tahira 4 Chicken Grills, Four Peppers Flavour     2.5  now £1.25   
126  Wai Wai X-Press Instant Noodles Creamy Chicken...       5  now £2.50   
127  Indomie Indomie Noodles Special Chicken Flavou...       5  now £2.50   

     Price_per_Kg  
0      (£8.80/kg)  
1      (£8.89/kg)  
2      (£9.67/k

In the for loop, we iterate through the product titles (`titles`) extracted from each page of the Asda website. For each title, we extract various information such as ratings, price, and price per kilogram.

***Ratings Extraction***

We locate the rating button within each product title using the class "co-product__rating". The rating is stored in the "aria-label" attribute of this button. We use the `split(" ")` method to split the label into a list of words. For example, the splitting the label string `"2.38 stars out of 5 based on 50+ reviews"` into `["2.38", "stars", "out", "of", "5", "based", "on", "50+", "reviews"]`. 

The `[0]` index is then used to extract the first element of this list, which corresponds to the numeric rating. If there is no rating button, the `ratings` variable is set to `None`.

***Price Extraction***

Similarly, we locate the price information within each product title using the class "co-product__price". The `.text.strip()` method extracts the text content of the price element. If no price element is found, the `prices` variable is set to `None`.

***Price per Kilogram Extraction***

For the price per kilogram, we locate the corresponding element with the class "co-item__price-per-uom-msg". Within this element, we find the span element with the class "co-product__price-per-uom". The `.text.strip()` method extracts the text content, representing the price per kilogram. If no such element is found, or if the span element is not present, the `price_per_kg` variable is set to `None`.

These extracted details are then appended to the `data` list, which will be used to create a DataFrame containing information about halal meat products from Asda.


Let's view the data created

In [3]:
df.head()

Unnamed: 0,Title,Ratings,Price,Price_per_Kg
0,Shazans Halal Beef Mince (Typically Less Than ...,2.38,now £4.40,(£8.80/kg)
1,Shazans Halal Chicken Breast Fillets,3.58,now £4.00,(£8.89/kg)
2,Shazans Diced Chicken Breast,3.08,now £4.35,(£9.67/kg)
3,Shazans Chicken Mini Fillets,3.0,now £4.50,(£9.00/kg)
4,Shazans Chicken Thigh Fillets,2.15,now £4.50,(£7.50/kg)


In [4]:
df.tail()

Unnamed: 0,Title,Ratings,Price,Price_per_Kg
123,Melis Pastirma Turkish Style Spicy Cured Beef,2.33,now £2.50,(£3.12/100g)
124,Aunty Noray's 6 Hand Made Charcoaled Premium C...,3.63,now £3.50,(£11.67/kg)
125,"Tahira 4 Chicken Grills, Four Peppers Flavour",2.5,now £1.25,(£4.81/kg)
126,Wai Wai X-Press Instant Noodles Creamy Chicken...,5.0,now £2.50,(£7.14/kg)
127,Indomie Indomie Noodles Special Chicken Flavou...,5.0,now £2.50,(£6.67/kg)


## 2. Saving the data into a CSV File

In [5]:
df.to_csv("Data/halal_meat_data_asda.csv", index=False)