<img src="https://s3.cloud.cmctelecom.vn/tinhte1/2018/03/4267082_CV.jpg" width=browser_width >

# **Tiki Web Scraping with Selenium**


**Overview**: Build a web-crawler that take in a Tiki URL and return a dataframe 

**Requirements** 
1. Your function should be able to take in an URL and return a pandas dataframe
2. The final dataframe should at least contain the following informations: 
    * Product Name
    * Price
    * URL of the product image
    * URL of that product page

There are 4 mandatory information along with extra information (optional) in the end of the project that you need to scrap

#Install resources

In [None]:
# install selenium and other resources for crawling data
!pip install selenium
!apt-get update
!apt install chromium-chromedriver

Collecting selenium
  Downloading selenium-4.1.3-py3-none-any.whl (968 kB)
[K     |████████████████████████████████| 968 kB 25.5 MB/s 
[?25hCollecting urllib3[secure,socks]~=1.26
  Downloading urllib3-1.26.8-py2.py3-none-any.whl (138 kB)
[K     |████████████████████████████████| 138 kB 66.4 MB/s 
[?25hCollecting trio~=0.17
  Downloading trio-0.20.0-py3-none-any.whl (359 kB)
[K     |████████████████████████████████| 359 kB 48.4 MB/s 
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting outcome
  Downloading outcome-1.1.0-py2.py3-none-any.whl (9.7 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting cryptography>=1.3.4
  Downloading cryptography-36.0.2-cp36-abi3-manylinux_2_24_x86_64.whl (3.6 MB)
[K     |███████████████████████

# Import necessary libraries

In [None]:
import re
import time
import pandas as pd

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

#Configuration for Driver and links

In [None]:
# Urls
TIKI = 'https://tiki.vn'
MAIN_CATEGORIES = [
    {'Name': 'Điện Thoại - Máy Tính Bảng',
     'URL': 'https://tiki.vn/dien-thoai-may-tinh-bang/c1789?src=c.1789.hamburger_menu_fly_out_banner'},

    {'Name': 'Điện Tử - Điện Lạnh',
     'URL': 'https://tiki.vn/tivi-thiet-bi-nghe-nhin/c4221?src=c.4221.hamburger_menu_fly_out_banner'},

    {'Name': 'Phụ Kiện - Thiết Bị Số', 
     'URL': 'https://tiki.vn/thiet-bi-kts-phu-kien-so/c1815?src=c.1815.hamburger_menu_fly_out_banner'},

    {'Name': 'Laptop - Thiết bị IT', 
     'URL': 'https://tiki.vn/laptop-may-vi-tinh/c1846?src=c.1846.hamburger_menu_fly_out_banner'},

    {'Name': 'Máy Ảnh - Quay Phim', 
     'URL': 'https://tiki.vn/may-anh/c1801?src=c.1801.hamburger_menu_fly_out_banner'},

    {'Name': 'Điện Gia Dụng', 
     'URL': 'https://tiki.vn/dien-gia-dung/c1882?src=c.1882.hamburger_menu_fly_out_banner'},

    {'Name': 'Nhà Cửa Đời Sống', 
     'URL': 'https://tiki.vn/nha-cua-doi-song/c1883?src=c.1883.hamburger_menu_fly_out_banner'},

    {'Name': 'Hàng Tiêu Dùng - Thực Phẩm', 
     'URL': 'https://tiki.vn/bach-hoa-online/c4384?src=c.4384.hamburger_menu_fly_out_banner'},

    {'Name': 'Đồ chơi, Mẹ & Bé', 
     'URL': 'https://tiki.vn/me-va-be/c2549?src=c.2549.hamburger_menu_fly_out_banner'},

    {'Name': 'Làm Đẹp - Sức Khỏe', 
     'URL': 'https://tiki.vn/lam-dep-suc-khoe/c1520?src=c.1520.hamburger_menu_fly_out_banner'},

    {'Name': 'Thể Thao - Dã Ngoại', 
     'URL': 'https://tiki.vn/the-thao/c1975?src=c.1975.hamburger_menu_fly_out_banner'},

    {'Name': 'Xe Máy, Ô tô, Xe Đạp', 
     'URL': 'https://tiki.vn/o-to-xe-may-xe-dap/c8594?src=c.8594.hamburger_menu_fly_out_banner'},

    {'Name': 'Hàng quốc tế', 
     'URL': 'https://tiki.vn/hang-quoc-te/c17166?src=c.17166.hamburger_menu_fly_out_banner'},

    {'Name': 'Sách, VPP & Quà Tặng', 
     'URL': 'https://tiki.vn/nha-sach-tiki/c8322?src=c.8322.hamburger_menu_fly_out_banner'},

    {'Name': 'Voucher - Dịch Vụ - Thẻ Cào', 
     'URL': 'https://tiki.vn/voucher-dich-vu/c11312?src=c.11312.hamburger_menu_fly_out_banner'}
]

# Global driver to use throughout the script
DRIVER = None

# 0. Function to Start and Close Driver

In [None]:
# Wrapper to close driver if its created
def close_driver():
    global DRIVER
    if DRIVER is not None:
        DRIVER.close()
    DRIVER = None

# Function to (re)start driver
def start_driver(force_restart=False):
    global DRIVER
    
    if force_restart:
        close_driver()
    
    # Setting up the driver
    options = webdriver.ChromeOptions()
    options.add_argument('-headless') # we don't want a chrome browser opens, so it will run in the background
    options.add_argument('-no-sandbox')
    options.add_argument('-disable-dev-shm-usage')

    DRIVER = webdriver.Chrome('chromedriver',options=options)

In [None]:
close_driver()

In [None]:
start_driver(force_restart=True)

In [None]:
DRIVER.get('https://tiki.vn/thiet-bi-kts-phu-kien-so/c1815?src=c.1815.hamburger_menu_fly_out_banner')

In [None]:
DRIVER.current_url

'https://tiki.vn/thiet-bi-kts-phu-kien-so/c1815?src=c.1815.hamburger_menu_fly_out_banner'

In [None]:
news_elements = DRIVER.find_elements(By.CLASS_NAME, 'product-item')

In [None]:
len(news_elements)

48

There are 48 product items at the first page of Tiki, which you need to extract all the information in the next session (Function to get info from 1 product)

# 1. Function to get info from one product

>**NOTE:** Sometimes, the web element returned by the driver can be faulty due to the way the website is set up. This can lead to a situation where calling `.text` from that web element returns an empty string, even though there are visible texts inside the element when checked manually using Inspect. When this happens, you can use `.get_attribute('innerHTML')` instead of `.text`.

In [None]:
news_element = all_news_elements[0]
print(news_element)


<selenium.webdriver.remote.webelement.WebElement (session="b1634437eaec614b6ed5e84f26ec66d5", element="253eb3bd-a437-4f0b-b092-b9ee7f9563ed")>


In [None]:
# Function to extract product info from the product
def get_product_info_single(news_element):
    info = {'name':'',
            'price':'',
            'discount':'',
            'product_url':'',
            'image':'',
            'freeship':'',
            'rating':'',
            'badge_rhht':'',
            'freegift':'',
            'installment':''}

    # name get name through find_element_by_class_name
    try:
        info['name'] = news_element.find_element(By.CLASS_NAME, 'name').find_element(By.TAG_NAME,'span').get_attribute('innerHTML')
    except NoSuchElementException:
        pass

    # get price find_element_by_class_name
    try:
        info['price'] = news_element.find_element(By.CLASS_NAME, 'price-discount__price').get_attribute('innerHTML')
    except NoSuchElementException:
        info['price'] = -1

    # get discount find_element_by_class_name
    try:
        info['discount'] = news_element.find_element(By.CLASS_NAME, 'price-discount__discount').get_attribute('innerHTML')
    except NoSuchElementException:
        info['discount'] = False
    
    # get link from .get_attribute()
    try:
        product_link     = news_element.get_attribute('href')
        info['product_url'] = news_element.get_attribute('href')
    except NoSuchElementException:
        pass

    # get thumbnail by class_name and Tag name and get_attribute()
    try:
        thumbnail = news_element.find_element(By.CLASS_NAME, 'webpimg-container')
        info['image'] = thumbnail.find_element_by_tag_name('img').get_attribute('src')
    except NoSuchElementException:
        pass

    # get freeship
    try:
        thumbnail = news_element.find_element(By.CLASS_NAME, 'thumbnail')
        info['freeship'] = thumbnail.find_element_by_tag_name('img').get_attribute('src')
    except NoSuchElementException:
        pass

      # get number of ratings  
    try:
        elem_review = news_element.find_element(By.CLASS_NAME, 'average')
        info['rating'] = float(re.sub(r'\D','',elem_review.get_attribute('style')))/100*5
    except NoSuchElementException:
        info['rating']=0
        pass

          # get the badgge of re hon hoan tien  
    try:
        thumbnail = news_element.find_element(By.CLASS_NAME, 'item')
        info['badge_rhht'] = thumbnail.find_element_by_tag_name('img').get_attribute('src')
    except NoSuchElementException:
        info['badge_rhht']= False
        pass

          # get the badge of free gift
    try:
        fg = news_element.find_element(By.CLASS_NAME, 'freegift_list')
        info['freegift'] = fg.find_element_by_tag_name('img').get_attribute('src')
    except NoSuchElementException:
        info['freegift']= False
        pass

          # get the badge of installment
    try:
        fg = news_element.find_element(By.CLASS_NAME, 'badge_benefits')
        info['freegift'] = fg.find_element_by_tag_name('img').get_attribute('src')
    except NoSuchElementException:
        info['freegift']= False
        pass

    
    return info


In [None]:
get_product_info_single(news_element)



{'badge_rhht': 'https://salt.tikicdn.com/ts/upload/9f/32/dd/8a8d39d4453399569dfb3e80fe01de75.png',
 'discount': '-41%',
 'freegift': False,
 'freeship': 'https://salt.tikicdn.com/ts/upload/dc/0d/49/3251737db2de83b74eba8a9ad6d03338.png',
 'image': 'https://salt.tikicdn.com/cache/200x200/ts/product/66/07/85/9239ee60729c18c92f5393841d4864f4.jpg',
 'installment': '',
 'name': '',
 'price': '129.000 ₫',
 'product_url': 'https://tka.tiki.vn/pixel/pixel?data=djAwMQtPOst0ylYv0Zm9KyF-a4JalO_nwhHbnGOwk8kBS0qqh6b0x0UD5lt9YSAMrixUSG8Ag0P4wnIAQESTYQE1F5DM9Ba8k0uswm3tfCpgkDbm8VJ7-2lF33T2TSzNKvaOrTQB00F6ZjeJNG1EdpzSM_C4qvH5r44UiVp_t2_5fWXQGq4u_n79AB17_DYLx1Z53ctPTIyNLB1SVh-I4wFW2qQVVpIZAJLnAh6qGHM6OnkrdI-qquxKcL3WCv6gTlPJqwQ6NBSZ0eN4sEzC2vgEqyOLgJCZp9NMzlSosdS4L96bhZ8qWdAEp7hyeRWX1dGK13iCPc5phuf3oxJ7t1t0WwshXn4es3n7Bv4MIUstgVRZmdJs4MpGE2SvU_zJxCQElRd3YANx17NVB2wqJjZUhKA1mt93WaOh6DYooCTLzcVxwSAebe_xhBSrAl43NrlkdKXkfWwvzl8lnq3lKdu4S07GBxlubz-7VFh_WMfjgkTZGOh2CYbLqtf5Glw_5KsVgtit4IfZ2Y7xM-Y1c42s9bIeBs7c

# 2. Function to scrape info of all products from a Page URL

To make your own life easier, you should use the function for a single product inside this one.

In [None]:
# Function to scrape all products from a page
def get_product_info_from_page(page_url):
   
    global DRIVER

    data = []            # Store the info dictionary of each product in this list
    DRIVER.get(page_url) # Use the driver to get info from the product page
    time.sleep(2)        # MUST have the sleep function

    products_all = DRIVER.find_elements(By.CLASS_NAME, 'product-item')
    print(type(len(products_all)))
    print(f'Found {len(products_all)} products')

    for product in products_all:
      result = get_product_info_single(product) 
      data.append(result)
    return data

In [None]:
get_product_info_from_page('https://tiki.vn/thiet-bi-kts-phu-kien-so/c1815?src=c.1815.hamburger_menu_fly_out_banner')

<class 'int'>
Found 48 products




[{'badge_rhht': False,
  'discount': '-7%',
  'freegift': False,
  'freeship': 'https://salt.tikicdn.com/ts/upload/dc/0d/49/3251737db2de83b74eba8a9ad6d03338.png',
  'image': 'https://salt.tikicdn.com/cache/200x200/ts/product/8b/7b/4d/b552547d3f21139e0bf413c60d7d7105.jpg',
  'installment': '',
  'name': '',
  'price': '649.000 ₫',
  'product_url': 'https://tka.tiki.vn/pixel/pixel?data=djAwMet2jZ0GEMnTDq25Gw0qRhu5lQd9POr81XByr87wJLxc5W8P0t_tQBaQNo5A_VsN167VGP5Cw8OgmYgH2HNnT2jH4ntTU1LazoPPd9OnsweAPtPKpZQkYDFhRxJJHo16uKMAS6XQqwav8DPu9IE-8JZnOW9KmvlEjy97h-BcohUaZ6F1ujvvAnFy38-DL_OaPODeHd0mYYE_51kRDyTxbch5AvXlLlN_827uRAb8pNxZsoWa3aOPwCHGkBSsQ8aql1YBIOc7SML1ZE1EdXlFCK-HO1GDVChMHaucbayAmWjW4Fgi6lQMd6YFyG9uJ7h2uU4l6USqF_Fdhrjlo6gxmeAsWelsq08xAFSeOFFUY6CFdzcL6_XP7J_N0JmN21p91R7yEa9FLzGWUNbSQVfIpkVABLPkCycnRKoQdqdKiescueGSCTbgKxLFfRVxk4vKguiwDtU7wDf0RPPy92WU85_Kq346qEOS43g9nQHDRUmbg09ctbyW9gL2qntrfqkFsnCUPvz6KZAa-acdHaWP7wWniKVzaEtSChW6yU5d9ACOr905ehDs0g01Be898PJV3m1EdZiLjgTBXKpkemXFgVfh0kBLkJFZ1

#Function to scrap all 1 main category (Optional)

In [None]:
### Function to get product info from a main category
def get_product_info_from_category(cat_url, max_page=0, extra_info=False):
    '''
    Scrape for multiple pages of products of a category.
    Uses get_product_info_from_page().

    Args:
        cat_url: (string) a url string of a category
        max_page: (int) an integer denoting the maximum number of pages to scrape.
                  Default value is 0 to scrape all pages.
    Returns: 
        products: a list in which every element is a dictionary of one product's information
    '''
    products = []

    page_n = 1
    main_url, url_opts = cat_url.split('?')
    cat_page_url = main_url + f'?page={page_n}&' + url_opts
    product_list = get_product_info_from_page(cat_page_url)

    while len(product_list)>0:
        products.extend(product_list)
        page_n += 1

        # stop_flag = False if max_page <= 0 else (page_n > max_page)
        stop_flag = max_page>0 and page_n>max_page # For stopping the scrape according to max_page
        if stop_flag:
            break

        cat_page_url = main_url + f'?page={page_n}&' + url_opts
        product_list = get_product_info_from_page(cat_page_url)
    
    return products

# 3. Start scraping

Try to scrape at least 2 pages of your chosen main category.

In [None]:

main_cat  = MAIN_CATEGORIES[2] # Pick any category you like by changing the index

start_driver(force_restart=True)
print('Scraping', main_cat['Name'])
print('Link:', main_cat['URL'])

prod_data = [] # STORE YOUR PRODUCT INFO DICTIONARIES IN HERE
num_max_page = 3 # Scraping 3 pages of my main category
extra_info = False # Whether to scrape more info or not

prod_per_cat = get_product_info_from_category(main_cat['URL'], num_max_page, extra_info=True)
prod_data.extend(prod_per_cat)

print(prod_data)

close_driver() # Close driver when we're done

Scraping Phụ Kiện - Thiết Bị Số
Link: https://tiki.vn/thiet-bi-kts-phu-kien-so/c1815?src=c.1815.hamburger_menu_fly_out_banner
<class 'int'>
Found 48 products




<class 'int'>
Found 48 products
<class 'int'>
Found 48 products
[{'name': '', 'price': '649.000 ₫', 'discount': '-7%', 'product_url': 'https://tka.tiki.vn/pixel/pixel?data=djAwMXQGWCK9N6NkrrWhDWeN7nshlqpT1heWTJlRB0bNFd_Kipyi1XLDr52U-500rc9x8RZvwRZO_zQkngty-3thhqdM5apJGtb3-YvtjheyDCtB3RMUeRwkAgX0NTuk-ZkNvI1OvL6TuOCjEZw1OyXPCZj-oEyz4nsJCD9LnSF0KtP3EmFDvSyNn_YNWRIR8_bRrT5qGZMMq-dHdA0w10iM9_V2Pw4z7gcqPcHpnxDZULZXNfmCnAIG3l5h4sFrh3o9uugVwtSgjVhIpCT5lQx6hFEJhQ70TsMPi2c_obxJQV4TIFdcYFZW1Zjb6CKq_l0O8X-3I9kLp9UfHNDpC8pBmdfBSXFhucdYOfoENECJOncWu0DyqkGsT6-NefWp30iaV83ABXFoQF0rd2ORC0QQtHhVdx7LmQezVT_UFrzSRJsvVxWj9l3l5OLCYphVQ4LoUl2OGHFpYUlKudwgJRwr9hrcPHdOhTHsJRHqoeaEynnfGFbh2-T4Xa9Rifs9JXFA1wHXYo-9FDSgxKBMo1hWz79d6-w58t-1kLAthB2LxCrXKnLNxl4KnWI9ZFz4lxiIA00hvkP8Ub0-KvTJXtM2VKbDMdgSapHMvL0biYrMwSKl8NK1K72ZfLwSY9UVv_xPwGkvoeP2Jc13dXJ5PJgjl1IhN3mUYefvuqbSp_KRGZhpgtTnp3WXeaV9SiUOBHaw3MswcObM88mxCpI2Fdpp9JR5QScWL0YmYXdVvOkDl8sFyOQrJmO80evTBfBmuZStKNDVmXhEhYukL7ad1Pkg3QZlOCdwEQEAzsKny7QhJlQPzodFTTV-LBc0

In this session, we already 3 pages of URL as above and ready to convert into a .csv (Dataframe) as below

# 4. Run cell below to save your scraped data into a .csv file
If you've scraped correctly, then the cell should run without error and the information in the table should look reasonable.

In [None]:
# SAVE DATA TO CSV FILE
df = pd.DataFrame(data=prod_data, columns=prod_data[0].keys())
df.to_csv('tiki_products.csv')

n_products_to_view = 50 # Change this as you like to check more products
df.head(n_products_to_view)

Unnamed: 0,name,price,discount,product_url,image,freeship,rating,badge_rhht,freegift,installment
0,,649.000 ₫,-7%,https://tka.tiki.vn/pixel/pixel?data=djAwMXQGW...,https://salt.tikicdn.com/cache/200x200/ts/prod...,https://salt.tikicdn.com/ts/upload/dc/0d/49/32...,4.5,False,False,
1,,1.540.000 ₫,False,https://tiki.vn/loa-tro-giang-bluetooth-soundm...,https://salt.tikicdn.com/cache/200x200/ts/prod...,https://salt.tikicdn.com/ts/upload/dc/0d/49/32...,0.0,https://salt.tikicdn.com/ts/upload/51/ac/cc/52...,False,
2,,225.000 ₫,-44%,https://tiki.vn/tai-nghe-nhet-tai-jbl-c150si-h...,https://salt.tikicdn.com/cache/200x200/ts/prod...,https://salt.tikicdn.com/ts/upload/dc/0d/49/32...,4.6,https://salt.tikicdn.com/ts/upload/9f/32/dd/8a...,False,
3,,389.900 ₫,-<!-- -->33<!-- -->%,https://tiki.vn/loa-bluetooth-suntek-kimiso-km...,https://salt.tikicdn.com/cache/200x200/ts/prod...,https://salt.tikicdn.com/ts/upload/dc/0d/49/32...,0.0,False,False,
4,,450.000 ₫,False,https://tka.tiki.vn/pixel/pixel?data=djAwMS7jV...,https://salt.tikicdn.com/cache/200x200/ts/prod...,https://salt.tikicdn.com/ts/upload/dc/0d/49/32...,0.0,False,False,
5,,237.000 ₫,False,https://tiki.vn/webcam-hoc-online-hd-full-1080...,https://salt.tikicdn.com/cache/200x200/ts/prod...,https://salt.tikicdn.com/cache/200x200/ts/prod...,0.0,https://salt.tikicdn.com/ts/upload/51/ac/cc/52...,False,
6,,490.000 ₫,-11%,https://tiki.vn/adapter-sac-1-cong-usb-type-c-...,https://salt.tikicdn.com/cache/200x200/ts/prod...,https://salt.tikicdn.com/ts/upload/dc/0d/49/32...,5.0,https://salt.tikicdn.com/ts/upload/9f/32/dd/8a...,False,
7,,339.000 ₫,False,https://tiki.vn/tai-nghe-chup-tai-game-thu-rem...,https://salt.tikicdn.com/cache/200x200/ts/prod...,https://salt.tikicdn.com/ts/upload/dc/0d/49/32...,0.0,False,False,
8,,199.000 ₫,-33%,https://tka.tiki.vn/pixel/pixel?data=djAwMacRq...,https://salt.tikicdn.com/cache/200x200/ts/prod...,https://salt.tikicdn.com/ts/upload/dc/0d/49/32...,5.0,False,False,
9,,169.000 ₫,False,https://tiki.vn/xuat-khau-my-cap-day-hdmi-full...,https://salt.tikicdn.com/cache/200x200/ts/prod...,https://salt.tikicdn.com/cache/200x200/ts/prod...,5.0,False,False,


# OPTIONAL: Extra information


If you've managed to successfully completed all of the above, you can look to get extra information for each product.

* Does it has FreeShip? <img src="https://salt.tikicdn.com/ts/upload/dc/0d/49/3251737db2de83b74eba8a9ad6d03338.png">
* Number of reviews?
* Number of sale?
* How many stars or percentage of stars?
* Does it got "badge under price" (Rẻ hơn hoàn tiền)? <img src="https://salt.tikicdn.com/ts/upload/51/ac/cc/528e80fe3f464f910174e2fdf8887b6f.png">
* Discount percentage?
* Does it got "shocking price" badge? <img src="https://salt.tikicdn.com/ts/upload/75/34/d2/4a9a0958a782da8930cdad8f08afff37.png">
* Does it allowed to be paid by installments? <img src="https://salt.tikicdn.com/ts/upload/ba/4e/6e/26e9f2487e9f49b7dcf4043960e687dd.png">
* Does it comes with free gifts? <img src="https://salt.tikicdn.com/ts/upload/47/35/8c/446f61d046eba9a305d3f39dc0834c4a.png">
    
