<img src="https://s3.cloud.cmctelecom.vn/tinhte1/2018/03/4267082_CV.jpg" width=browser_width >

# **Tiki Web Scraping with Selenium**


**Overview**: Build a web-crawler that take in a Tiki URL and return a dataframe 

**Due Date**: Before Monday next week.

**Requirements** 
1. Your function should be able to take in an URL and return a pandas dataframe
2. The final dataframe should at least contain the following informations: 
    * Product Name
    * Price
    * URL of the product image
    * URL of that product page

Try to follow the guideline below

#Install resources

In [None]:
# install selenium and other resources for crawling data
!pip install selenium
!apt-get update
!apt install chromium-chromedriver

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:6 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:7 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:8 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:13 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Get:14 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:15 http://ppa.launchpad.net/grap

# Import necessary libraries

In [None]:
import re
import time
import pandas as pd

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, InvalidSessionIdException


#Configuration for Driver and links

In [None]:
# Urls
TIKI = 'https://tiki.vn'
MAIN_CATEGORIES = [
    {'Name': 'Điện Thoại - Máy Tính Bảng',
     'URL': 'https://tiki.vn/dien-thoai-may-tinh-bang/c1789?src=c.1789.hamburger_menu_fly_out_banner'},

    {'Name': 'Điện Tử - Điện Lạnh',
     'URL': 'https://tiki.vn/tivi-thiet-bi-nghe-nhin/c4221?src=c.4221.hamburger_menu_fly_out_banner'},

    {'Name': 'Phụ Kiện - Thiết Bị Số', 
     'URL': 'https://tiki.vn/thiet-bi-kts-phu-kien-so/c1815?src=c.1815.hamburger_menu_fly_out_banner'},

    {'Name': 'Laptop - Thiết bị IT', 
     'URL': 'https://tiki.vn/laptop-may-vi-tinh-linh-kien/c1846?page=0'}, #'https://tiki.vn/laptop-may-vi-tinh/c1846?src=c.1846.hamburger_menu_fly_out_banner'},

    {'Name': 'Máy Ảnh - Quay Phim', 
     'URL': 'https://tiki.vn/may-anh/c1801?src=c.1801.hamburger_menu_fly_out_banner'},

    {'Name': 'Điện Gia Dụng', 
     'URL': 'https://tiki.vn/dien-gia-dung/c1882?src=c.1882.hamburger_menu_fly_out_banner'},

    {'Name': 'Nhà Cửa Đời Sống', 
     'URL': 'https://tiki.vn/nha-cua-doi-song/c1883?src=c.1883.hamburger_menu_fly_out_banner'},

    {'Name': 'Hàng Tiêu Dùng - Thực Phẩm', 
     'URL': 'https://tiki.vn/bach-hoa-online/c4384?src=c.4384.hamburger_menu_fly_out_banner'},

    {'Name': 'Đồ chơi, Mẹ & Bé', 
     'URL': 'https://tiki.vn/me-va-be/c2549?src=c.2549.hamburger_menu_fly_out_banner'},

    {'Name': 'Làm Đẹp - Sức Khỏe', 
     'URL': 'https://tiki.vn/lam-dep-suc-khoe/c1520?src=c.1520.hamburger_menu_fly_out_banner'},

    {'Name': 'Thể Thao - Dã Ngoại', 
     'URL': 'https://tiki.vn/the-thao/c1975?src=c.1975.hamburger_menu_fly_out_banner'},

    {'Name': 'Xe Máy, Ô tô, Xe Đạp', 
     'URL': 'https://tiki.vn/o-to-xe-may-xe-dap/c8594?src=c.8594.hamburger_menu_fly_out_banner'},

    {'Name': 'Hàng quốc tế', 
     'URL': 'https://tiki.vn/hang-quoc-te/c17166?src=c.17166.hamburger_menu_fly_out_banner'},

    {'Name': 'Sách, VPP & Quà Tặng', 
     'URL': 'https://tiki.vn/nha-sach-tiki/c8322?src=c.8322.hamburger_menu_fly_out_banner'},

    {'Name': 'Voucher - Dịch Vụ - Thẻ Cào', 
     'URL': 'https://tiki.vn/voucher-dich-vu/c11312?src=c.11312.hamburger_menu_fly_out_banner'}
]

# Global driver to use throughout the script
DRIVER = None

# 0. Function to Start and Close Driver

In [None]:
# Wrapper to close driver if its created
def close_driver():
    global DRIVER
    if DRIVER is not None:
        DRIVER.close()
    DRIVER = None

# Function to (re)start driver
def start_driver(force_restart=False):
    global DRIVER
    
    if force_restart:
        close_driver()
    
    # Setting up the driver
    options = webdriver.ChromeOptions()
    options.add_argument('-headless') # we don't want a chrome browser opens, so it will run in the background
    options.add_argument('-no-sandbox')
    options.add_argument('-disable-dev-shm-usage')

    DRIVER = webdriver.Chrome('chromedriver',options=options)

# 1. Function to get info from one product

>**NOTE:** Sometimes, the web element returned by the driver can be faulty due to the way the website is set up. This can lead to a situation where calling `.text` from that web element returns an empty string, even though there are visible texts inside the element when checked manually using Inspect. When this happens, you can use `.get_attribute('innerHTML')` instead of `.text`.

In [None]:
# product_class = DRIVER.find_element(By.CLASS_NAME, 'product-item')
# print(product_class.get_attribute('outerHTML'))

In [None]:
#list all products in class = product-item
# list_products = DRIVER.find_elements(By.CLASS_NAME, 'product-item')
# len(list_products)

In [None]:
#find product name
# product_name = product_class.find_element(By.CLASS_NAME, 'name')
# product_name.text

In [None]:
#find product price
# product_price = product_class.find_element(By.CLASS_NAME, 'price-discount__price')
# print(product_price.text)

In [None]:
#find url of the product image
# product_image = product_class.find_element(By.CLASS_NAME, 'webpimg-container')
# image_url = product_image.find_element(By.TAG_NAME, 'img').get_attribute('src')
# print(product_image.get_attribute('outerHTML'))
# print(image_url)

In [None]:
#find url of product
# product_url = product_class.get_attribute('href')
# print(product_url)

In [None]:
# print(get_product_info_single(product_class))

In [None]:
# Function to extract product info from the product
def get_product_info_single(product_item):
    info = {'name':'',
            'price':'',
            'product_url':'',
            'image':'',
            'freeshipping':'',
            'number_of_sale':'',
            'badge_under_price':'',
            'star':'',
            'discount':'',
            'paid_by_installments':'',
            'freegift':''}
    # product_class = product_item.find_element(By.CLASS_NAME, 'product-item')
    # name get name through find_element_by_class_name
    try:
        name_class = product_item.find_element(By.CLASS_NAME, 'name')
        info['name'] = name_class.find_element(By.TAG_NAME, 'span').get_attribute('innerHTML') #YOUR CODE HERE
    except NoSuchElementException:
        pass

    # get price find_element_by_class_name
    try:
        info['price'] = product_item.find_element(By.CLASS_NAME, 'price-discount__price').get_attribute('innerHTML').strip('₫') #YOUR CODE HERE
    except NoSuchElementException:
        info['price'] = -1
    
    # get link from .get_attribute()
    try:
        product_link  = product_item.get_attribute('href')#YOUR CODE HERE 
        info['product_url'] = product_link #YOUR CODE HERE CLEAN LINK WITH STRING MANUPULATION
    except NoSuchElementException:
        pass

    # get thumbnail by class_name and Tag name and get_attribute()
    try:
        thumbnail =  product_item.find_element(By.CLASS_NAME, 'webpimg-container')#YOUR CODE HERE
        info['image'] = thumbnail.find_element(By.TAG_NAME, 'img').get_attribute('src')#YOUR CODE HERE
    except NoSuchElementException:
        pass
    
    #get free shipping badge if it have it
    try:
        thumbnail_class = product_item.find_element(By.CLASS_NAME, 'thumbnail')
        # freeship_badge = thumbnail_class.find_element(By.TAG_NAME, 'img')
        if(thumbnail_class.find_element(By.TAG_NAME, 'img')):
          info['freeshipping'] = 'YES' #freeship_badge.get_attribute('src')
        else:
          info['freeshipping'] = 'NO'   
    except NoSuchElementException:
        info['freeshipping'] = 'NO'
        pass
    
    #number of sale
    try:
        number_of_sale = product_item.find_element(By.CLASS_NAME, 'styles__StyledQtySold-sc-732h27-2').get_attribute('innerHTML')
        nos = number_of_sale.strip('Đã bán ')
        info['number_of_sale'] = nos
    except NoSuchElementException:
        pass

    #"rẻ hơn hoàn tiền" badge
    try:
        badge_under_class = product_item.find_element(By.CLASS_NAME, 'badge-under-price')
        badge_under_price = badge_under_class.find_element(By.TAG_NAME, 'img').get_attribute('src')
        if badge_under_price:
          info['badge_under_price'] = 'YES' #badge_under_price
        else:
          info['badge_under_price'] = 'NO'
    except NoSuchElementException:
        info['badge_under_price'] = 'NO'
        pass
    
    #percentage of stars
    try:
        star = product_item.find_element(By.CLASS_NAME, 'average').get_attribute('style')
        l_perc_star = star.lstrip('width: ')
        perc_star = int(l_perc_star.strip('%;'))
        perc_star = (perc_star * 5) / 100
        info['star'] = perc_star
    except NoSuchElementException:
      pass
    
    #discount price
    try:
        discount_class = product_item.find_element(By.CLASS_NAME, 'price-discount__discount').get_attribute('innerHTML')
        #r_discount = discount_class.rstrip('<!-- -->')
        discount = discount_class.lstrip('-<!-- -->')
        discount = discount.rstrip('<!-- -->%')
        discount = discount.strip('-')
        info['discount'] = discount
    except NoSuchElementException:
      pass
    
    #paid by installment
    try:
        badge_benefits = product_item.find_element(By.CLASS_NAME, 'badge-benefits')
        string_span_tag = badge_benefits.find_element(By.TAG_NAME, 'span').get_attribute('innerHTML')
        if (string_span_tag == 'Trả góp'):
          info['paid_by_installments'] = 'YES' #string_span_tag
        else:
          info['paid_by_installments'] = 'NO'
    except NoSuchElementException:
        info['paid_by_installments'] = 'NO'
        pass
    
    #freegift
    try: 
        free_gift = product_item.find_element(By.CLASS_NAME,'freegift-list')
        info['freegift'] = 'YES'
    except NoSuchElementException:
        info['freegift'] = 'NO'
        pass
    return info


# 2. Function to scrape info of all products from a Page URL

To make your own life easier, you should use the function for a single product inside this one.

In [None]:
# Function to scrape all products from a page
def get_product_info_from_page(page_url):
    """ Extract info from all products of a specfic page_url on Tiki website
        Args:
            page_url: (string) url of the page to scrape
        Returns:
            data: list of dictionary of products info. If no products shown, return empty list.
    """
    global DRIVER

    data = []           # Store the info dictionary of each product in this list
    DRIVER.get(page_url) # Use the driver to get info from the product page
    time.sleep(5)        # MUST have the sleep function

    products_all = DRIVER.find_elements(By.CLASS_NAME, 'product-item')#YOUR CODE HERE
    print(f'Found {len(products_all)} products')
    # print(products_all[10].text)
    for product in products_all:
        # Look through the product and get the data
        # YOUR CODE HERE
        detail = {}
        detail = get_product_info_single(product)
        data.append(detail)
    return data

In [None]:
# DRIVER.find_element(By.CLASS_NAME,'product-item').find_element(By.CLASS_NAME,'style__StyledNotFoundProductView-sc')

In [None]:
# Function to scrape all products from two pages
def get_product_info_from_n_page(page_url, page_number): #page_number: the number of page you want to get. Example: you wan to get product from 3 page then (page_url, 3)
    """ Extract info from all products of a specfic page_url on Tiki website
        Args:
            page_url: (string) url of the page to scrape
        Returns:
            data: list of dictionary of products info. If no products shown, return empty list.
    """
    global DRIVER

    data = []           # Store the info dictionary of each product in this list
    for i in range(1,page_number+1):
      cur_url = page_url + str(i)
      print(cur_url)
      DRIVER.get(cur_url) # Use the driver to get info from the product page
      time.sleep(5)        # MUST have the sleep function
      products_all = DRIVER.find_elements(By.CLASS_NAME, 'product-item')#YOUR CODE HERE
      print(f'Found {len(products_all)} products')
      for product in products_all:
        # Look through the product and get the data
        # YOUR CODE HERE
        detail = {}
        detail = get_product_info_single(product)
        data.append(detail)
    return data

# 3. Start scraping

Try to scrape at least 2 pages of your chosen main category.

In [None]:
# start_driver(force_restart=True)
# url = MAIN_CATEGORIES[3]['URL']
# DRIVER.get(url)
# test = DRIVER.find_element(By.CLASS_NAME, 'Pagination__Root-sc-cyke21-0')
# print(test.get_attribute('innerHTML'))
# close_driver()

In [None]:
from random import randint

In [None]:
main_cat  = MAIN_CATEGORIES[3] # Pick any category you like by changing the index

start_driver(force_restart=True)
print('Scraping', main_cat['Name'])
print('Link:', main_cat['URL'])

prod_data = [] # STORE YOUR PRODUCT INFO DICTIONARIES IN HERE

##################################
### YOUR CODE HERE TO GET DATA ###
##################################
n = randint(1,10) #get a random integer to get number of product page
print(n)
prod_data = get_product_info_from_n_page(main_cat['URL'],n)
close_driver() # Close driver when we're done

Scraping Laptop - Thiết bị IT
Link: https://tiki.vn/laptop-may-vi-tinh-linh-kien/c1846?page=0
3
https://tiki.vn/laptop-may-vi-tinh-linh-kien/c1846?page=01
Found 72 products
https://tiki.vn/laptop-may-vi-tinh-linh-kien/c1846?page=02
Found 72 products
https://tiki.vn/laptop-may-vi-tinh-linh-kien/c1846?page=03
Found 72 products


In [None]:
print(len(prod_data))

216


In [None]:
prod_data[:5]

# 4. Run cell below to save your scraped data into a .csv file
If you've scraped correctly, then the cell should run without error and the information in the table should look reasonable.

In [None]:
# SAVE DATA TO CSV FILE
df = pd.DataFrame(data=prod_data, columns=prod_data[0].keys())
df.to_csv('tiki_products.csv')

n_products_to_view = 20 # Change this as you like to check more products
df.head(n_products_to_view)

Unnamed: 0,name,price,product_url,image,freeshipping,number_of_sale,badge_under_price,star,discount,paid_by_installments,freegift
0,Bộ nguồn cấp điện liên tục UPS PROLiNK 850VA (...,1.122.000,https://tka.tiki.vn/pixel/pixel?data=djAwMR4db...,https://salt.tikicdn.com/cache/200x200/ts/prod...,YES,36,YES,4.6,17.0,NO,NO
1,Bộ Phát Wifi TP-Link Archer C54 Băng Tần Kép C...,419.000,https://tiki.vn/bo-phat-wifi-tp-link-archer-c5...,https://salt.tikicdn.com/cache/200x200/ts/prod...,YES,1000+,YES,4.7,53.0,NO,NO
2,Phần Mềm Diệt Virus BKAV Profressional 12 Thán...,189.000,https://tiki.vn/phan-mem-diet-virus-bkav-profr...,https://salt.tikicdn.com/cache/200x200/ts/prod...,YES,1000+,YES,4.8,2.0,NO,NO
3,Bộ Mở Rộng Sóng Wifi TP-Link TL-WA850RE Chuẩn ...,248.000,https://tiki.vn/bo-mo-rong-song-wifi-tp-link-t...,https://salt.tikicdn.com/cache/200x200/ts/prod...,YES,149,YES,4.5,,NO,NO
4,Bộ chia mạng switch 5 cổng RJ45 10/100/1000Mbq...,299.000,https://tka.tiki.vn/pixel/pixel?data=djAwMQSZ7...,https://salt.tikicdn.com/cache/200x200/ts/prod...,YES,20,NO,5.0,,NO,NO
5,Ổ Cứng SSD Kingston A400 (240GB) - Hàng Chính ...,779.000,https://tiki.vn/o-cung-ssd-kingston-a400-240gb...,https://salt.tikicdn.com/cache/200x200/ts/prod...,YES,1000+,YES,4.6,22.0,NO,NO
6,Bộ Kích Sóng Wifi Repeater Mercusys MW300RE 30...,195.000,https://tiki.vn/bo-kich-song-wifi-repeater-mer...,https://salt.tikicdn.com/cache/200x200/ts/prod...,YES,958,NO,4.5,35.0,NO,NO
7,Bộ Chuyển Đổi USB Wifi TP-Link Archer T2U Plus...,279.000,https://tiki.vn/bo-chuyen-doi-usb-wifi-tp-link...,https://salt.tikicdn.com/cache/200x200/ts/prod...,YES,7,NO,4.8,,NO,NO
8,Bộ phát Wifi di động Tenda 4G LTE 4G185 - Hàng...,1.055.000,https://tka.tiki.vn/pixel/pixel?data=djAwMTjSZ...,https://salt.tikicdn.com/cache/200x200/ts/prod...,YES,16,NO,5.0,30.0,NO,YES
9,Bộ Chuyển Đổi USB Wifi TP-Link Archer T2U Nano...,239.000,https://tiki.vn/bo-chuyen-doi-usb-wifi-tp-link...,https://salt.tikicdn.com/cache/200x200/ts/prod...,YES,1000+,NO,4.8,27.0,NO,NO


In [None]:
prod_data[111]

{'badge_under_price': 'YES',
 'discount': '',
 'freegift': 'NO',
 'freeshipping': 'YES',
 'image': 'https://salt.tikicdn.com/cache/200x200/ts/product/c4/28/b7/97574fd96b2578c7c14ca5c001f32f95.jpg',
 'name': 'Màn Hình Dell S2721HN 27inch FHD (1920x1080) 4ms 75Hx IPS 300nits/HDMI+Audio/AMD FreeSync - Hàng Chính Hãng',
 'number_of_sale': '87',
 'paid_by_installments': 'YES',
 'price': '6.290.000 ',
 'product_url': 'https://tiki.vn/man-hinh-dell-s2721hn-27inch-fhd-1920x1080-4ms-75hx-ips-300nits-hdmi-audio-amd-freesync-hang-chinh-hang-p78496225.html?itm_campaign=tiki-reco_UNK_DT_UNK_UNK_tiki-listing_UNK_p-category-mpid-listing-v1_202112100600_MD_PID.78496226&itm_medium=CPC&itm_source=tiki-reco&spid=78496226',
 'star': 4.7}

# OPTIONAL: Extra information


If you've managed to successfully completed all of the above, you can look to get extra information for each product.

* Does it has FreeShip? <img src="https://salt.tikicdn.com/ts/upload/dc/0d/49/3251737db2de83b74eba8a9ad6d03338.png">
* Number of reviews?
* Number of sale?
* How many stars or percentage of stars?
* Does it got "badge under price" (Rẻ hơn hoàn tiền)? <img src="https://salt.tikicdn.com/ts/upload/51/ac/cc/528e80fe3f464f910174e2fdf8887b6f.png">
* Discount percentage?
* Does it got "shocking price" badge? <img src="https://salt.tikicdn.com/ts/upload/75/34/d2/4a9a0958a782da8930cdad8f08afff37.png">
* Does it allowed to be paid by installments? <img src="https://salt.tikicdn.com/ts/upload/ba/4e/6e/26e9f2487e9f49b7dcf4043960e687dd.png">
* Does it comes with free gifts? <img src="https://salt.tikicdn.com/ts/upload/47/35/8c/446f61d046eba9a305d3f39dc0834c4a.png">
    
