#### Project Statement 
_____
- In this project, we shall be extracting data from Jumia (www.jumia.co.ke) an e-Commerce website. 

- We shall be scrapping the website to access products with discounts currently. 

- The data will be moved to a Postgres database housed at Aiven - (https://aiven.io/) 

#### Key libraries for this projects include;
___

1. Beautiful Soup - `pip install beautifulsoup4`

2. Pandas - `pip install pandas`

3. requests 

#### Stage 1: Setting up the project 

- Importing the libraries,

- Setting project variables

In [49]:
# Installing necessary libaries 

from bs4 import BeautifulSoup 
import pandas as pd 
import lxml
import requests 
import time

In [83]:
BASE_URL = "https://www.jumia.co.ke/{}/?page={}#catalog-listing" # This is the BASE_URL that will be used in this project

# This list will hold the product categories we shall scrape
PRODUCT_CATEGORIES = [
    "electronics",
    "phones-tablets",
    # "category-fashion-by-jumia",
    # "home-office",
    # "health-beauty",
    # "home-office-appliances",
    # "computing",
    # "baby-products",
    # "sporting-goods"
]

MAX_PAGE_COUNT = 2 # Sets the number of pages to scrape for every product category. Max = 50

# To make sure that we are sending requests as user agennts for all our HTTP requests.
# The default user agent using python requests in Python
PAGE_HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}

### Step 2: Scrape the Web

In [86]:
def scrapper() -> list:
    """ 
    This function scrapes the project URL to find products, thier prices, and discounts prices

    Returns:
        all_products (list): A list of dictionaries containing products that have been scrapped.
    """ 

    all_products = [] # The scraped products will be added here as a list of dictionaries

    current_page_num = 0 # Holds the value for the current page being scrapped 


    # Looping through the product categories of interest
    for product_category in PRODUCT_CATEGORIES:

        # Make sure we don't try to access pages that don't exist
        while current_page_num <= MAX_PAGE_COUNT: 

            response = requests.get(BASE_URL.format(product_category, current_page_num), headers=PAGE_HEADERS) 

            soup = BeautifulSoup(response.text, 'lxml') # Create a soup 

            products_wrapper = soup.find_all("article", {"class": "prd _fb col c-prd"})  # Find all the HTML tags wrapping each product
            
            # Loop and access each wrapper to access specific information for earch product
            for product in products_wrapper:
                product_name = product.find("h3", {"class": "name"}).text # Access the product name 

                current_price = product.find("div", {"class": "prc"}).text # Access the current price 
 
                try: # Accounting for products that may not have old price
                    old_price = product.find("div", {"class": "old"}).text
                except:
                    old_price = "0" 

                # Create a dictionary for this product and append to the list all_products
                current_product_details = {
                    "product_name": product_name,
                    "category": product_category,
                    "current_price": current_price,
                    "old_price": old_price
                } 

                all_products.append(current_product_details)
            
            current_page_num = current_page_num + 1 # Increment this to move to the next page 

            # We want the scrapper to pause for 2 seconds before making another request
            print("current_category = {} | current_page = {}".format(product_category, current_page_num))
            # print("Sleeping for 3 seconds...")
            time.sleep(4) 

            # Break from the while loop if we reach the last page
            if current_page_num == MAX_PAGE_COUNT:
                break
            
        # Reset the page counter when done with each category
        current_page_num = 0


    return all_products 

In [87]:
len(scrapper()) 

current_category = electronics | current_page = 1
current_category = electronics | current_page = 2
current_category = phones-tablets | current_page = 1
current_category = phones-tablets | current_page = 2


160

In [13]:
# Scrape the page 

response = requests.get(
    url=BASE_URL + 'category-fashion-by-jumia/?page=1#catalog-listing',
    headers=PAGE_HEADERS
)

In [14]:
# create a soup 

soup = BeautifulSoup(response.text, 'lxml') 

# Get the products 
products = soup.find_all("article", {"class": "prd _fb col c-prd"})

In [15]:
# Access product names and details 
# Loops through the result_set and access each individual product

for product in products:

    product_name = product.find("h3", {"class": "name"}).text

    current_price = product.find("div", {"class": "prc"}).text 

    try:
        old_price = product.find("div", {"class": "old"}).text
    except:
        old_price = "0"

    print("{}, {}, {}".format(product_name, current_price, old_price))

Fashion 6Pcs Soft Cotton Checked Men's Boxers – Multicolor, KSh 618, 0
Fashion 4 In 1 Ladies Handbags Women Shoulder Bags Set PU Leather -Pink, KSh 764, KSh 1,505
Fashion 4pcs Ladies Handbags Women Shoulder Tote Bags Set PU Leather-Black, KSh 895, KSh 1,402
Fashion 2024 Mens Casual High-Top Shoes Running Sneakers - Beige, KSh 1,674 - KSh 1,860, KSh 2,735
Fashion Couple Canvas Low Top Lace-up Shoes Classic Casual Sneakers Black, KSh 690 - KSh 1,000, KSh 1,200 - KSh 1,500
Fashion Mens Sneakers Shoes Sports Shoes Breathable Running Shoes, KSh 1,000 - KSh 1,100, KSh 1,800 - KSh 2,000
Fashion Men Shoes Sneakers Skateboarding Shoes Sport Shoes Running Sneakers Casual Shoes, KSh 989 - KSh 1,099, KSh 1,371 - KSh 1,913
Fashion New Young Fashion Trendy Board Shoes - Blue, KSh 1,590, KSh 2,606 - KSh 2,839
Fashion Men's Trendy Sneakers - White, KSh 1,341 - KSh 1,490, KSh 2,223
Fashion Slipon Shoes Loafers Shoes Canvas Shoes Casual Shoes Mens Sneaker Black, KSh 650 - KSh 759, KSh 1,200 - KSh 1,500
