#### Project Statement 
_____
- In this project, we shall be extracting data from Jumia (www.jumia.co.ke) an e-Commerce website. 

- We shall be scrapping the website to access products with discounts currently. 

- The data will be moved to a Postgres database housed at Aiven - (https://aiven.io/) 

#### Key libraries for this projects include;
___

1. Beautiful Soup - `pip install beautifulsoup4`

2. Pandas - `pip install pandas`

3. requests 

#### Stage 1: Setting up the project 

- Importing the libraries,

- Setting project variables

In [86]:
# Installing necessary libaries 

from bs4 import BeautifulSoup 
import pandas as pd 
import lxml
import requests 
import time
import re
import os

# To be used with database 
from sqlalchemy import create_engine, text
from sqlalchemy.exc import OperationalError

In [76]:
BASE_URL = "https://www.jumia.co.ke/{}/?page={}#catalog-listing" # This is the BASE_URL that will be used in this project

# This list will hold the product categories we shall scrape
PRODUCT_CATEGORIES = [
    "electronics",
    "phones-tablets",
    # "category-fashion-by-jumia",
    # "home-office",
    # "health-beauty",
    # "home-office-appliances",
    # "computing",
    # "baby-products",
    # "sporting-goods"
]

MAX_PAGE_COUNT = 1 # Sets the number of pages to scrape for every product category. Max = 50

# To make sure that we are sending requests as user agennts for all our HTTP requests.
# The default user agent using python requests in Python
PAGE_HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}

### Step 2: Scrape the Website

In [77]:
def scrapper() -> list:
    """ 
    This function scrapes the project URL to find products, thier prices, and discounts prices

    Returns:
        all_products (list): A list of dictionaries containing products that have been scrapped.
    """ 

    all_products = [] # The scraped products will be added here as a list of dictionaries

    current_page_num = 1 # Holds the value for the current page being scrapped 

    # Looping through the product categories of interest
    for product_category in PRODUCT_CATEGORIES:

        print("Now Scrapping {}".format(product_category)) # Outputs the current product being scrapped

        # Make sure we don't try to access pages that don't exist
        while current_page_num <= MAX_PAGE_COUNT: 

            # print("Current Page Number {}".format(current_page_num)) # Outputs the current page being scrapped

            response = requests.get(BASE_URL.format(product_category, current_page_num), headers=PAGE_HEADERS) 

            soup = BeautifulSoup(response.text, 'lxml') # Create a soup 

            products_wrapper = soup.find_all("article", {"class": "prd _fb col c-prd"})  # Find all the HTML tags wrapping each product
            
            # Loop and access each wrapper to access specific information for earch product
            for product in products_wrapper:
                product_name = product.find("h3", {"class": "name"}).text # Access the product name 

                current_price = product.find("div", {"class": "prc"}).text # Access the current price 
 
                try: # Accounting for products that may not have old price
                    old_price = product.find("div", {"class": "old"}).text
                except:
                    old_price = "0" 

                # Create a dictionary for this product and append to the list all_products
                current_product_details = {
                    "product_name": product_name,
                    "category": product_category,
                    "current_price": current_price,
                    "old_price": old_price
                } 

                all_products.append(current_product_details)
            
            current_page_num = current_page_num + 1 # Increment this to move to the next page

            # We want the scrapper to pause for 4 seconds before making another request
            print("Scrapper going to sleep...")
            print("")
            time.sleep(4)
            
        # Reset the page counter when done with each category
        current_page_num = 1

    return all_products 

### Step 3: Data Storage 
___ 
This stage involves storing the scrapped data to the database 

Below are the implementation details;

 - **#1.** Move the data to a Pandas Dataframe. 

 - **#2.** Perform some data cleaning tasks, e.g., transformations 

 - **#3.** Set up the database

 - **#4.** Use Pandas to move the cleaned data to our database.

#### #1. Moving the data to a Pandas Dataframe.

- Since the data from the scrapper is a list of dictionaries, we can simply create a Pandas DataFrame as follows 

    `pd.DataFrame(list_of_dictionaries)`

In [78]:

products_df = pd.DataFrame(scrapper()) # Creates a dataframe from the scrapped data

Now Scrapping electronics
Scrapper going to sleep...

Now Scrapping phones-tablets
Scrapper going to sleep...



#### #2. Perform some data cleaning tasks, e.g., transformations 

In [79]:
# Overview of the data 

products_df.head()

Unnamed: 0,product_name,category,current_price,old_price
0,"Vitron HTC4388FS - 43"" Smart Android Frameless...",electronics,"KSh 19,799","KSh 28,599"
1,"Vitron V527 - 2.1 CH Multimedia Speaker, BT/US...",electronics,"KSh 5,450","KSh 7,599"
2,"Vitron HTC3200S,32""Inch Bluetooth Enabled Fram...",electronics,"KSh 14,399","KSh 18,500"
3,Amtec AM-02 2.1CH Multimedia Speaker BT/USB/SD...,electronics,"KSh 5,450","KSh 7,500"
4,"Vitron V643 3.1Ch Bluetooth Speaker System, 12...",electronics,"KSh 4,749","KSh 5,600"


In [80]:
# The column for current_price and old_price should be converted to float 
# "Ksh" and "," should also be stripped from the values 

# Removing 'Ksh' and "," from the current_price and old_price values 
products_df["current_price"] = products_df.current_price.str.replace("Ksh ", "", regex=True, flags=re.IGNORECASE).str.replace(",", "") 

products_df["old_price"] = products_df.old_price.str.replace("Ksh ", "", regex=True, flags=re.IGNORECASE).str.replace(",", "") 

products_df.head()

Unnamed: 0,product_name,category,current_price,old_price
0,"Vitron HTC4388FS - 43"" Smart Android Frameless...",electronics,19799,28599
1,"Vitron V527 - 2.1 CH Multimedia Speaker, BT/US...",electronics,5450,7599
2,"Vitron HTC3200S,32""Inch Bluetooth Enabled Fram...",electronics,14399,18500
3,Amtec AM-02 2.1CH Multimedia Speaker BT/USB/SD...,electronics,5450,7500
4,"Vitron V643 3.1Ch Bluetooth Speaker System, 12...",electronics,4749,5600


#### #3. Setting up the database 

- The database is hosted at;

  `https://aiven.io/` 

- We will use SQLAlchemy to move data in our dataframe to the databases.

- Database details have already been stored as environment variables.

In [89]:

# Accessing the database details 
db_credentials = {
   "HOST": os.getenv("DB_HOST"),
   "NAME": os.getenv("DB_NAME"),
   "PASSWORD": os.getenv("DB_PASSWORD"),
   "PORT": os.getenv("DB_PORT"),
   "USER": os.getenv("DB_USER")
} 

# Create a database engine 
db_engine = create_engine(
    url="postgresql://{}:{}@{}:{}/{}?sslmode=require".format(
        db_credentials.get("USER"),
        db_credentials.get("PASSWORD"),
        db_credentials.get("HOST"),
        db_credentials.get("PORT"),
        db_credentials.get("NAME")
    )
) 

# We want the data to be moved to a table called "jumia_products" 
table_name = "jumia_products" 

# Using Pandas to move the scrapped data to our database 
products_df.to_sql(name=table_name, con=db_engine, if_exists="replace", index=False)

80