# Web Scraping using JUMIA SITE

**Objective** <br>
This goal of this script is to scrape products data from jumia website. This dataset will be transformed stored to form a data catalog for out price prediction app.

## Step 1: Import the neccessary Libraries

In [4]:
import requests  # make a request to a url
from bs4 import BeautifulSoup  # parse the requests as html
import pandas as pd  # data manipulation
import re

## Step 2: Created a Dictionary

In [6]:
product_data = {
    "Product Name": [],
    "Current Price": [],
    "Old Price": [],
    "Discount": [],
    "Rating": [],
    "Vendor": []
}

## Step 3: Loop through the pages

In [149]:
for page_num in range(1, 50):
    URL = f"https://www.jumia.com.ng/laptops/?page={page_num}#catalog-listing"

    try:
        response = requests.get(url=URL)
        if response.status_code == 200:
            content = response.content
        else:
            print("Resource Not Found!")
    except:
        pass

    # soup
    soup = BeautifulSoup(content, "html.parser")
    # find articles
    articles = soup.find_all('article', class_="prd _fb col c-prd")

    # looping the articles
    for article in articles:

        product_data["Vendor"] = "Jumia"
        
        name = article.find('h3', class_='name')
        if name != None:
            product_data['Product Name'].append(name.text)
        else:
            product_data['Product Name'].append("")
    
        current_price = article.find('div', class_='prc')
        if current_price != None:
            product_data["Current Price"].append(current_price.text)
        else:
            product_data["Current Price"].append("")
    
        old_price = article.find('div', class_='old')
        if old_price != None:
            product_data["Old Price"].append(old_price.text)
        else:
            product_data["Old Price"].append("")
    
        discount = article.find('div', class_='bdg _dsct _sm')
        if discount != None:
            product_data["Discount"].append(discount.text)
        else:
            product_data["Discount"].append("")
    
        rating = article.find('div', class_='stars _s')
        if rating != None:
            product_data["Rating"].append(rating.text)
        else:
            product_data["Rating"].append("")
            

    print(f"Done Collecting Data from {URL}")

Done Collecting Data from https://www.jumia.com.ng/laptops/?page=1#catalog-listing
Done Collecting Data from https://www.jumia.com.ng/laptops/?page=2#catalog-listing
Done Collecting Data from https://www.jumia.com.ng/laptops/?page=3#catalog-listing
Done Collecting Data from https://www.jumia.com.ng/laptops/?page=4#catalog-listing
Done Collecting Data from https://www.jumia.com.ng/laptops/?page=5#catalog-listing
Done Collecting Data from https://www.jumia.com.ng/laptops/?page=6#catalog-listing
Done Collecting Data from https://www.jumia.com.ng/laptops/?page=7#catalog-listing
Done Collecting Data from https://www.jumia.com.ng/laptops/?page=8#catalog-listing
Done Collecting Data from https://www.jumia.com.ng/laptops/?page=9#catalog-listing
Done Collecting Data from https://www.jumia.com.ng/laptops/?page=10#catalog-listing
Done Collecting Data from https://www.jumia.com.ng/laptops/?page=11#catalog-listing
Done Collecting Data from https://www.jumia.com.ng/laptops/?page=12#catalog-listing
D

## Step 4 store in dataframe

In [208]:
jumia_laptop_df = pd.DataFrame.from_dict(product_data)
jumia_laptop_df

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor
0,"AOCWEI 14.1"" Intel Celeron N4020 6GB+256GB, SS...","₦ 294,325","₦ 547,000",46%,4.6 out of 5,Jumia
1,AOCWEI Laptop Windows 11 Intel Celeron 6GB+256...,"₦ 290,900","₦ 1,666,000",83%,5 out of 5,Jumia
2,Hp Refurbished EliteBook 840 G6 Intel Core I5-...,"₦ 400,660","₦ 500,000",20%,,Jumia
3,"WOZIFAN 14.1""Intel Celeron N4020 6GB+256GB,SSD...","₦ 260,300","₦ 1,606,500",84%,3.8 out of 5,Jumia
4,Hp Stream 11 Intel Celeron 2GB RAM- 32GB HDD W...,"₦ 135,000",,,5 out of 5,Jumia
...,...,...,...,...,...,...
2075,Hp 15 - Intel Celeron - 500GB HDD 4GB RAM - Wi...,"₦ 325,000",,,,Jumia
2076,Hp ProBook 11 X360- TOUCH- 128GB SSD/4GB RAM-I...,"₦ 260,000","₦ 370,000",30%,,Jumia
2077,Hp ProBook 11 X360 TOUCH INTEL CORE I5 512GB S...,"₦ 330,000",,,,Jumia
2078,Hp 15 TOUCHSCREEN 12TH GEN INTEL CORE I5 16GB ...,"₦ 760,500","₦ 880,000",14%,,Jumia


## Step 5: Store data into CSV

In [210]:
jumia_laptop_df.to_csv("jumia_laptop.csv", index=False)

## Performing a Data Cleaning on Jumia Data

In [297]:
jumia_df = pd.read_csv("jumia_laptop.csv")
jumia_df

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor
0,"AOCWEI 14.1"" Intel Celeron N4020 6GB+256GB, SS...","₦ 294,325","₦ 547,000",46%,4.6 out of 5,Jumia
1,AOCWEI Laptop Windows 11 Intel Celeron 6GB+256...,"₦ 290,900","₦ 1,666,000",83%,5 out of 5,Jumia
2,Hp Refurbished EliteBook 840 G6 Intel Core I5-...,"₦ 400,660","₦ 500,000",20%,,Jumia
3,"WOZIFAN 14.1""Intel Celeron N4020 6GB+256GB,SSD...","₦ 260,300","₦ 1,606,500",84%,3.8 out of 5,Jumia
4,Hp Stream 11 Intel Celeron 2GB RAM- 32GB HDD W...,"₦ 135,000",,,5 out of 5,Jumia
...,...,...,...,...,...,...
2075,Hp 15 - Intel Celeron - 500GB HDD 4GB RAM - Wi...,"₦ 325,000",,,,Jumia
2076,Hp ProBook 11 X360- TOUCH- 128GB SSD/4GB RAM-I...,"₦ 260,000","₦ 370,000",30%,,Jumia
2077,Hp ProBook 11 X360 TOUCH INTEL CORE I5 512GB S...,"₦ 330,000",,,,Jumia
2078,Hp 15 TOUCHSCREEN 12TH GEN INTEL CORE I5 16GB ...,"₦ 760,500","₦ 880,000",14%,,Jumia


In [299]:
jumia_df.shape

(2080, 6)

In [301]:
jumia_df.ndim

2

In [303]:
jumia_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2080 entries, 0 to 2079
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Product Name   2080 non-null   object
 1   Current Price  2080 non-null   object
 2   Old Price      1414 non-null   object
 3   Discount       1414 non-null   object
 4   Rating         696 non-null    object
 5   Vendor         2080 non-null   object
dtypes: object(6)
memory usage: 97.6+ KB


In [305]:
# current price
jumia_df['Current Price'] = jumia_df['Current Price'].str.replace("₦ ","")

In [307]:
jumia_df['Current Price'] = jumia_df['Current Price'].str.replace(",","")

In [309]:
# drop a record where the current price is '413250 - 650000'

jumia_df = jumia_df[jumia_df['Current Price'] != '413250 - 650000']
jumia_df = jumia_df[jumia_df['Current Price'] != '238800 - 322800']
jumia_df = jumia_df[jumia_df['Current Price'] != '1200000 - 1500000']
jumia_df = jumia_df[jumia_df['Current Price'] != '420000 - 650000']

In [311]:
jumia_df[jumia_df['Current Price'].str.contains("-")]

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor


In [313]:
jumia_df['Current Price'] = jumia_df['Current Price'].astype("float64")

In [315]:
jumia_df.head()

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor
0,"AOCWEI 14.1"" Intel Celeron N4020 6GB+256GB, SS...",294325.0,"₦ 547,000",46%,4.6 out of 5,Jumia
1,AOCWEI Laptop Windows 11 Intel Celeron 6GB+256...,290900.0,"₦ 1,666,000",83%,5 out of 5,Jumia
2,Hp Refurbished EliteBook 840 G6 Intel Core I5-...,400660.0,"₦ 500,000",20%,,Jumia
3,"WOZIFAN 14.1""Intel Celeron N4020 6GB+256GB,SSD...",260300.0,"₦ 1,606,500",84%,3.8 out of 5,Jumia
4,Hp Stream 11 Intel Celeron 2GB RAM- 32GB HDD W...,135000.0,,,5 out of 5,Jumia


In [317]:
# Old price
jumia_df['Old Price'] = jumia_df['Old Price'].str.replace("₦ ","")
jumia_df['Old Price'] = jumia_df['Old Price'].str.replace(",","")
jumia_df['Old Price'].fillna(value=0, inplace=True)

In [319]:
jumia_df.head()

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor
0,"AOCWEI 14.1"" Intel Celeron N4020 6GB+256GB, SS...",294325.0,547000,46%,4.6 out of 5,Jumia
1,AOCWEI Laptop Windows 11 Intel Celeron 6GB+256...,290900.0,1666000,83%,5 out of 5,Jumia
2,Hp Refurbished EliteBook 840 G6 Intel Core I5-...,400660.0,500000,20%,,Jumia
3,"WOZIFAN 14.1""Intel Celeron N4020 6GB+256GB,SSD...",260300.0,1606500,84%,3.8 out of 5,Jumia
4,Hp Stream 11 Intel Celeron 2GB RAM- 32GB HDD W...,135000.0,0,,5 out of 5,Jumia


In [325]:
jumia_df.tail()

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor
2075,Hp 15 - Intel Celeron - 500GB HDD 4GB RAM - Wi...,325000.0,0,,,Jumia
2076,Hp ProBook 11 X360- TOUCH- 128GB SSD/4GB RAM-I...,260000.0,370000,30%,,Jumia
2077,Hp ProBook 11 X360 TOUCH INTEL CORE I5 512GB S...,330000.0,0,,,Jumia
2078,Hp 15 TOUCHSCREEN 12TH GEN INTEL CORE I5 16GB ...,760500.0,880000,14%,,Jumia
2079,Lenovo IDEAPAD 15 INTEL CELERON 4GB RAM 256GB ...,298450.0,392000,24%,,Jumia


In [321]:
jumia_df['Old Price'].isnull().sum()

0

In [333]:
jumia_df = jumia_df[jumia_df['Old Price'] != '685000 - 900000']

In [337]:
jumia_df['Old Price'] = jumia_df['Old Price'].astype("float64")

In [339]:
jumia_df.head()

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor
0,"AOCWEI 14.1"" Intel Celeron N4020 6GB+256GB, SS...",294325.0,547000.0,46%,4.6 out of 5,Jumia
1,AOCWEI Laptop Windows 11 Intel Celeron 6GB+256...,290900.0,1666000.0,83%,5 out of 5,Jumia
2,Hp Refurbished EliteBook 840 G6 Intel Core I5-...,400660.0,500000.0,20%,,Jumia
3,"WOZIFAN 14.1""Intel Celeron N4020 6GB+256GB,SSD...",260300.0,1606500.0,84%,3.8 out of 5,Jumia
4,Hp Stream 11 Intel Celeron 2GB RAM- 32GB HDD W...,135000.0,0.0,,5 out of 5,Jumia


In [341]:
# Discount
jumia_df['Discount'] = jumia_df['Discount'].str.replace("%","")

In [343]:
# let fill missing value in discount
jumia_df["Discount"].fillna(value=0, inplace=True)

In [345]:
jumia_df["Discount"].isnull().sum()

0

In [347]:
# now let change the datatype
jumia_df['Discount'] = jumia_df['Discount'].astype("int")

In [349]:
jumia_df['Rating'] = jumia_df['Rating'].str.replace("out of 5","")

In [351]:
jumia_df['Rating'] = jumia_df['Rating'].str.strip()

In [353]:
jumia_df["Rating"].fillna(value=0, inplace=True)

In [355]:
jumia_df['Rating'] = jumia_df['Rating'].astype("float64")

In [357]:
jumia_df

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor
0,"AOCWEI 14.1"" Intel Celeron N4020 6GB+256GB, SS...",294325.0,547000.0,46,4.6,Jumia
1,AOCWEI Laptop Windows 11 Intel Celeron 6GB+256...,290900.0,1666000.0,83,5.0,Jumia
2,Hp Refurbished EliteBook 840 G6 Intel Core I5-...,400660.0,500000.0,20,0.0,Jumia
3,"WOZIFAN 14.1""Intel Celeron N4020 6GB+256GB,SSD...",260300.0,1606500.0,84,3.8,Jumia
4,Hp Stream 11 Intel Celeron 2GB RAM- 32GB HDD W...,135000.0,0.0,0,5.0,Jumia
...,...,...,...,...,...,...
2075,Hp 15 - Intel Celeron - 500GB HDD 4GB RAM - Wi...,325000.0,0.0,0,0.0,Jumia
2076,Hp ProBook 11 X360- TOUCH- 128GB SSD/4GB RAM-I...,260000.0,370000.0,30,0.0,Jumia
2077,Hp ProBook 11 X360 TOUCH INTEL CORE I5 512GB S...,330000.0,0.0,0,0.0,Jumia
2078,Hp 15 TOUCHSCREEN 12TH GEN INTEL CORE I5 16GB ...,760500.0,880000.0,14,0.0,Jumia


In [359]:
jumia_df['Product Name'] = jumia_df['Product Name'].str.strip()

In [361]:
jumia_df

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor
0,"AOCWEI 14.1"" Intel Celeron N4020 6GB+256GB, SS...",294325.0,547000.0,46,4.6,Jumia
1,AOCWEI Laptop Windows 11 Intel Celeron 6GB+256...,290900.0,1666000.0,83,5.0,Jumia
2,Hp Refurbished EliteBook 840 G6 Intel Core I5-...,400660.0,500000.0,20,0.0,Jumia
3,"WOZIFAN 14.1""Intel Celeron N4020 6GB+256GB,SSD...",260300.0,1606500.0,84,3.8,Jumia
4,Hp Stream 11 Intel Celeron 2GB RAM- 32GB HDD W...,135000.0,0.0,0,5.0,Jumia
...,...,...,...,...,...,...
2075,Hp 15 - Intel Celeron - 500GB HDD 4GB RAM - Wi...,325000.0,0.0,0,0.0,Jumia
2076,Hp ProBook 11 X360- TOUCH- 128GB SSD/4GB RAM-I...,260000.0,370000.0,30,0.0,Jumia
2077,Hp ProBook 11 X360 TOUCH INTEL CORE I5 512GB S...,330000.0,0.0,0,0.0,Jumia
2078,Hp 15 TOUCHSCREEN 12TH GEN INTEL CORE I5 16GB ...,760500.0,880000.0,14,0.0,Jumia


In [363]:
jumia_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2075 entries, 0 to 2079
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Product Name   2075 non-null   object 
 1   Current Price  2075 non-null   float64
 2   Old Price      2075 non-null   float64
 3   Discount       2075 non-null   int64  
 4   Rating         2075 non-null   float64
 5   Vendor         2075 non-null   object 
dtypes: float64(3), int64(1), object(2)
memory usage: 113.5+ KB


In [365]:
jumia_df[jumia_df["Discount"] == 0]

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor
4,Hp Stream 11 Intel Celeron 2GB RAM- 32GB HDD W...,135000.0,0.0,0,5.0,Jumia
12,Lenovo THINKPAD X1 CARBON GEN 8 CORE I7-1035G7...,993900.0,0.0,0,0.0,Jumia
15,Acer TRAVELMATE B3 TMB311 CELERON N4020 4GB RA...,226850.0,0.0,0,0.0,Jumia
30,Hp Stream 11 Pro- Intel Celeron - 4GB RAM - 64...,130000.0,0.0,0,4.6,Jumia
35,Hp Stream11intel Celeron D/C 64GB HDD+4GB RAM+...,137000.0,0.0,0,3.3,Jumia
...,...,...,...,...,...,...
2069,Hp Stream 11 Intel Celeron D/C 2GB RAM- 32GB ...,141000.0,0.0,0,4.0,Jumia
2070,Hp ProBook 11 X360- TOUCHSCREN- Intel P 128GB ...,199999.0,0.0,0,0.0,Jumia
2074,Asus Vivobook X515JA Intel Core I5 4GB/512 SSD...,699000.0,0.0,0,0.0,Jumia
2075,Hp 15 - Intel Celeron - 500GB HDD 4GB RAM - Wi...,325000.0,0.0,0,0.0,Jumia


In [369]:
# duplicate
jumia_df.duplicated().sum()

352

In [371]:
jumia_df[jumia_df.duplicated()]

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor
40,"AOCWEI 14.1"" Intel Celeron N4020 6GB+256GB, SS...",294325.0,547000.0,46,4.6,Jumia
41,AOCWEI Laptop Windows 11 Intel Celeron 6GB+256...,290900.0,1666000.0,83,5.0,Jumia
42,Hp Refurbished EliteBook 840 G6 Intel Core I5-...,400660.0,500000.0,20,0.0,Jumia
43,"WOZIFAN 14.1""Intel Celeron N4020 6GB+256GB,SSD...",260300.0,1606500.0,84,3.8,Jumia
44,Hp Stream 11 Intel Celeron 2GB RAM- 32GB HDD W...,135000.0,0.0,0,5.0,Jumia
...,...,...,...,...,...,...
2054,DELL Latitude 7490 TOUCHSCREEN Core I7/ 32GB R...,750000.0,950000.0,21,0.0,Jumia
2057,Hp ProBook 11 X360- TOUCH- 512GB SSD/4GB RAM-I...,300000.0,350000.0,14,0.0,Jumia
2063,Hp EliteBook 840 G5 Intel Core I5- 8GB RAM/512...,465000.0,550000.0,15,0.0,Jumia
2067,Hp EliteBook 830 G6 TOUCHSCREEN Core I5-16GB R...,489250.0,750000.0,35,0.0,Jumia


In [375]:
jumia_df.drop_duplicates(inplace=True)

In [377]:
jumia_df

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor
0,"AOCWEI 14.1"" Intel Celeron N4020 6GB+256GB, SS...",294325.0,547000.0,46,4.6,Jumia
1,AOCWEI Laptop Windows 11 Intel Celeron 6GB+256...,290900.0,1666000.0,83,5.0,Jumia
2,Hp Refurbished EliteBook 840 G6 Intel Core I5-...,400660.0,500000.0,20,0.0,Jumia
3,"WOZIFAN 14.1""Intel Celeron N4020 6GB+256GB,SSD...",260300.0,1606500.0,84,3.8,Jumia
4,Hp Stream 11 Intel Celeron 2GB RAM- 32GB HDD W...,135000.0,0.0,0,5.0,Jumia
...,...,...,...,...,...,...
2074,Asus Vivobook X515JA Intel Core I5 4GB/512 SSD...,699000.0,0.0,0,0.0,Jumia
2075,Hp 15 - Intel Celeron - 500GB HDD 4GB RAM - Wi...,325000.0,0.0,0,0.0,Jumia
2077,Hp ProBook 11 X360 TOUCH INTEL CORE I5 512GB S...,330000.0,0.0,0,0.0,Jumia
2078,Hp 15 TOUCHSCREEN 12TH GEN INTEL CORE I5 16GB ...,760500.0,880000.0,14,0.0,Jumia


In [379]:
# converting to percent
jumia_df["Discount"] = jumia_df["Discount"] / 100

In [381]:
jumia_df

Unnamed: 0,Product Name,Current Price,Old Price,Discount,Rating,Vendor
0,"AOCWEI 14.1"" Intel Celeron N4020 6GB+256GB, SS...",294325.0,547000.0,0.46,4.6,Jumia
1,AOCWEI Laptop Windows 11 Intel Celeron 6GB+256...,290900.0,1666000.0,0.83,5.0,Jumia
2,Hp Refurbished EliteBook 840 G6 Intel Core I5-...,400660.0,500000.0,0.20,0.0,Jumia
3,"WOZIFAN 14.1""Intel Celeron N4020 6GB+256GB,SSD...",260300.0,1606500.0,0.84,3.8,Jumia
4,Hp Stream 11 Intel Celeron 2GB RAM- 32GB HDD W...,135000.0,0.0,0.00,5.0,Jumia
...,...,...,...,...,...,...
2074,Asus Vivobook X515JA Intel Core I5 4GB/512 SSD...,699000.0,0.0,0.00,0.0,Jumia
2075,Hp 15 - Intel Celeron - 500GB HDD 4GB RAM - Wi...,325000.0,0.0,0.00,0.0,Jumia
2077,Hp ProBook 11 X360 TOUCH INTEL CORE I5 512GB S...,330000.0,0.0,0.00,0.0,Jumia
2078,Hp 15 TOUCHSCREEN 12TH GEN INTEL CORE I5 16GB ...,760500.0,880000.0,0.14,0.0,Jumia


In [383]:
jumia_df.to_csv("jumia_clean_laptop.csv", index=False)