Author: Gickel OKABI

Python developer

.

The successful candidate will be tasked with developing a Python script that can scrape all the product information from the website www.scoop.co.za. The information to be scraped includes, but is not limited to:

Product names
Product descriptions
Product weights and dimensions
Product prices (various prices as they have retail and dealer pricing)

The data should be saved in an easy-to-use and easy-to-read format such as a CSV file or JSON file.

Finally, I will retain full ownership of the script upon completion and payment, and would like to be able to use it for future scraping tasks. Please ensure that the code is well-documented and easy to modify as needed.

.

In [5]:
# Import necessary libraries

import re
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    }

.

First of all, we will create a function that will allow us to extract the menus and submenus from each menu to automate searches and extract the desired information.


In [70]:
def creat_research_option_file(url):
    
    
    menu_liste = []
    sub_menu_liste = []
    sub_menu_link_liste = []
    
    req = requests.get(url, headers=headers)
    
    soup = BeautifulSoup(req.content, 'html.parser')
    
    #Extract menus
    menu = soup.find('div', class_=["section-item-content nav-sections-item-content"] , id="store.menu").find('li', class_=["level0 nav-1 category-item first level-top parent"])

    menu_ul = menu.select('ul.level0.submenu > li')
    
    menu_list = [menu for menu in menu_ul]

    # for all menu in a menu list
    for m in menu_list:
        menu_text = m.find('a').text
    

        #Select the submenus
        sous_menu_soup = m.select('ul.level1 > li.level2')

        # For all the submenu of each menu
        sub_menu_text = [sub_menu_text.text for sub_menu_text in sous_menu_soup]
        
        #Get links for each category and subcategory
        sub_menu_link = ["".join([sub_menu_text.find('a')['href'],'?product_list_limit=all']) for sub_menu_text in sous_menu_soup]
        

        menu_liste.append(menu_text)
        sub_menu_liste.append(sub_menu_text)
        sub_menu_link_liste.append(sub_menu_link)
    
    option_frame = pd.DataFrame({
        'menu' : menu_liste,
        'sub_menu' : sub_menu_liste,
        'sub_menu_link' : sub_menu_link_liste
    })
    
        
    return option_frame

.

In [71]:
option_frame = creat_research_option_file("https://scoop.co.za/")

In [72]:
option_frame.shape

(11, 3)

In [73]:
option_frame.head()
    

Unnamed: 0,menu,sub_menu,sub_menu_link
0,wired-networking,"[fast-ethernet, gigabit-ethernet, poe-switches...",[https://scoop.co.za/products/wired-networking...
1,wireless-networking,"[2,4-ghz-antennas, 5,8-ghz-wireless-antennas, ...",[https://scoop.co.za/products/wireless-network...
2,structured-cabling,"[free-standing-cabinets, wall-mount-cabinets, ...",[https://scoop.co.za/products/structured-cabli...
3,fibre,"[fibre-cable, fibre-flyleads, fibre-pigtails, ...",[https://scoop.co.za/products/fibre/fibre-cabl...
4,installation-accessories,"[brackets-&-masts, tape, tools, trunking, cabl...",[https://scoop.co.za/products/installation-acc...


11


In [56]:
# former un lien de recherche à partir du dataframe

base_url = "https://scoop.co.za/products/"

In [89]:
def make_like_data_frame(dataframe):

    menu_liste = []
    sub_menu_liste = []
    sub_menu_link_liste = []

    for index, row in dataframe.iterrows ():
        sub_menu = row['sub_menu']
        sub_menu_link = row['sub_menu_link']
        
        for i in range(len(sub_menu_link)):    
            sub_menu_link_liste.append(sub_menu_link[i])
            menu_liste.append(row['menu'])
            sub_menu_liste.append(sub_menu[i])
            
    option_frame = pd.DataFrame({
        'menu' : menu_liste,
        'sub_menu' : sub_menu_liste,
        'sub_menu_link' : sub_menu_link_liste
    })
    
    return option_frame

In [90]:
like_data_frame = make_like_data_frame(option_frame)

In [91]:
like_data_frame.head()

Unnamed: 0,menu,sub_menu,sub_menu_link
0,wired-networking,fast-ethernet,https://scoop.co.za/products/wired-networking/...
1,wired-networking,gigabit-ethernet,https://scoop.co.za/products/wired-networking/...
2,wired-networking,poe-switches,https://scoop.co.za/products/wired-networking/...
3,wired-networking,ubiquiti-unifi-switch,https://scoop.co.za/products/wired-networking/...
4,wired-networking,ubiquiti-edgemax,https://scoop.co.za/products/wired-networking/...


In [92]:
for link in like_data_frame['sub_menu_link']:
    print(link)

https://scoop.co.za/products/wired-networking/fast-ethernet/?product_list_limit=all
https://scoop.co.za/products/wired-networking/gigabit-ethernet/?product_list_limit=all
https://scoop.co.za/products/wired-networking/poe-switches/?product_list_limit=all
https://scoop.co.za/products/wired-networking/ubiquiti-unifi-switch/?product_list_limit=all
https://scoop.co.za/products/wired-networking/ubiquiti-edgemax/?product_list_limit=all
https://scoop.co.za/products/wired-networking/mikrotik-cloudcore/?product_list_limit=all
https://scoop.co.za/products/wired-networking/mikrotik-rack-mount/?product_list_limit=all
https://scoop.co.za/products/wired-networking/mikrotik-switches/?product_list_limit=all
https://scoop.co.za/products/wired-networking/mikrotik-desktop/?product_list_limit=all
https://scoop.co.za/products/wired-networking/mikrotik-gpen/?product_list_limit=all
https://scoop.co.za/products/wired-networking/mikrotik-software/?product_list_limit=all
https://scoop.co.za/products/wired-networ

.

Maintenant nous pouvons scraper les données sur tout le site car nous avons toutes les combinaisons de recherches possible.

In [135]:
def data_collecte(links):
    Product_names = []
    Product_descriptions = []
    Dealer_Excl_VATs = []
    Retail_Price_Incl_VATs = []
    Product_specifications = []
    for url in links:
        req = requests.get(url, headers=headers)
        soup_ = BeautifulSoup(req.content, 'html.parser')
        products = soup_.find_all('li', class_=["item product product-item"])
        for product in products:
            product_link = product.find('div', class_=["product-item-info"]).find('a')['href']
            product_link_contenu = requests.get(product_link, headers=headers)
            product_soup = BeautifulSoup(product_link_contenu.content, 'html.parser')


            product_name = product_soup.find('div', class_ = ["page-title-wrapper product"]).text
            Product_description = product_soup.find('div', class_=["product attribute description"]).text

            product_price = [re.sub(r"\n","",prices.text) for prices in product_soup.find_all('div', class_=["product-item-final-price"])]

            Dealer_Excl_VAT = product_price[0].split(':')[-1] 
            Retail_Price_Incl_VAT = product_prices[1].split(':')[-1] 

            Product_info = product_soup.select('td.col.data[data-th = "Specifications"]')

            Product_specification = [info.text for info in Product_info]
            
            
            Product_names.append(product_name)
            Product_descriptions.append(Product_description)
            Dealer_Excl_VATs.append(Dealer_Excl_VAT)
            Retail_Price_Incl_VATs.append(Retail_Price_Incl_VAT)
            Product_specifications.append(Product_specification)
            

    dataframe = pd.DataFrame({
                'Product_names' : Product_names,
                'Product_descriptions':Product_descriptions,
                'Dealer_Excl_VATs' : Dealer_Excl_VATs,
                'Retail_Price_Incl_VATs' :Retail_Price_Incl_VATs,
                'Product_specifications' : Product_specifications,
            })        
        
    return dataframe   

    

Function 1 takes as a parameter the dataframe containing the whole combination of searches


In [137]:
dataframe = data_collecte(like_data_frame['sub_menu_link'])

In [138]:
dataframe.head()

Unnamed: 0,Product_names,Product_descriptions,Dealer_Excl_VATs,Retail_Price_Incl_VATs,Product_specifications
0,\n\nTenda 16 Port Fast Ethernet Desktop Switch...,\nScoop's S16 by Tenda is a 16-port Desktop Et...,R450.00,"R7,150.00",[Ethernet Ports: 16x 10/100\r\nHardware Button...
1,\n\nTenda 16 Port Fast Ethernet Rack Mount Swi...,\nTenda's TEF1016D is a 16-Port 10/100Mbps Fas...,R750.00,"R7,150.00",[Ethernet Ports: 16x 10/100\r\nHardware Button...
2,\n\nCudy 16 Port Gigabit Rack-Mount Switch | G...,\nCudy's CD-GS1016 is a 16 Port Gigabit Switch...,"R1,085.00","R7,150.00",[Ethernet Ports: 16x 10/100/1000 \r\nHardware ...
3,\n\nCudy 24 Port Gigabit Rack-Mount Switch | G...,\nCudy's CD-GS1024 is a 24 Port Gigabit Switch...,"R1,425.00","R7,150.00",[Ethernet Ports: 24x 10/100/1000 \r\nHardware ...
4,\n\nCudy 5 Port Gigabit Desktop Switch | GS105...,\nEstimated Arrival: 19 June at our Midrand Br...,R250.00,"R7,150.00",[Ethernet Ports: 5x 10/100/1000 \r\nHardware B...


.

Thanks for looking through the notebook.

If you have a data collection project, do not hesitate to entrust it to me. I will provide you with both the code and the collected dataset.

Gickel OKABI