<a href="https://colab.research.google.com/github/nhamhung/Coder-School-Machine-Learning/blob/master/Week1_Web_Scraping_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Web Scraping with Beautiful Soup

![](https://sixfeetup.com/blog/an-introduction-to-beautifulsoup/@@images/1cc2fc44-5ac8-4378-bef2-95048d5bc5ad.png)

Frequently we don't have access to any/enough data to perform accurate analysis, this is a common issue to a new/nich project. In those cases, we might need to find a way to collect data on our own.

__A Web Scraper__ is a program that extract data from a website. __BeautifulSoup__ is a Python library that provides many functions to pull data out of HTML and XML files.

### Problem Statement
- Build a Web Scraper to collect data about articles on [https://vnexpress.net/](https://vnexpress.net/).
- Required information:
  - Title
  - Description
  - Link to the Article
  - Link to Thumbnail Image


## Send GET Request to the Website

The first step is to request a HTML text file of vnexpress. We can archieve that by sending a [GET request](https://www.w3schools.com/tags/ref_httpmethods.asp) to [https://vnexpress.net/]( https://vnexpress.net/) using the Python library `requests`.

In [0]:
# import requests

# r = requests.get('https://vnexpress.net/')

# print(r.text)

## Parse the Raw Text with BeautifulSoup
In order to extract the information from the HTML file, we need to put it through a parser. 

A parser builds data structrure from input data (usually text), allowing us to easily find and extract the components of the data. __BeautifulSoup__ is one of the most popular HTML parser for Python.

In [0]:
import requests

r = requests.get('https://tiki.vn/laptop/c8095?src=c.1846.hamburger_menu_fly_out_banner&_lc=Vk4wMzkwMTkwMDQ=')

# print (r.text)

In [0]:
from bs4 import BeautifulSoup

# r.text is a HTML file so we will use html.parser
#soup = BeautifulSoup(r.text, 'html.parser')

# Make the soup object look nicer
#print(soup.prettify()[:500])

In [0]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'html.parser')

# print (soup.prettify()[:5000])

## Extract Information
A __soup__ object has many method to find and extract information.

- `<soup>.find(<tag>, {<attribute1>:<value1>, <attribute2>:<value2>})`: Return the FIRST occurence of `<tag>` with `<attributes>` equal to `<values>` in `<soup>` object. Output: Tag Object.  

- `<soup>.find_all(<tag>, {<attribute1>:<value1>, <attribute2>:<value2>})`: Return ALL occurences of `<tag>` with `<attributes>` equal to `<values>` in `<soup>` object. Output: ResultSet (List) containing one or many Tag Objects.

- `<soup>.<tag>`: Return FIRST occurence of `<tag>` in `<soup>` object. Output: Tag Object.

In [0]:
#First occurence of the tag 
first_article = soup.find('div', {'class':'product-item'})

#All occurences of the tag
articles = soup.find_all('div',{'class': 'product-item'})
#print(articles)



In [18]:
# seller_id = first_article["data-seller-product-id"]
# print(seller_id)
# product_id = first_article["data-id"]
# print(product_id)
# product_title = first_article["data-title"]
# print(product_title)
# price = first_article["data-price"]
# print(price)
# url_tag = first_article.find('img',{'class': 'product-image'})
# print(url_tag)
# product_image = url_tag["src"]
# print (product_image)

prices = soup.find_all('span', {'class': 'price-regular'})
# prices
for price in prices:
  print(price.text)

32.990.000đ
14.990.000đ
23.990.000đ
16.990.000đ
32.990.000đ
15.990.000đ
36.990.000đ
24.750.000đ
12.990.000đ
6.090.000đ
11.990.000đ
23.990.000đ
6.990.000đ
24.750.000đ
17.990.000đ
15.990.000đ
16.490.000đ
23.990.000đ
6.990.000đ
21.990.000đ
22.290.000đ
21.990.000đ
16.990.000đ
40.990.000đ
19.990.000đ
30.990.000đ
17.490.000đ
16.990.000đ
24.990.000đ
15.990.000đ
24.490.000đ
19.990.000đ
28.000.000đ
16.990.000đ
7.990.000đ
39.990.000đ
74.990.000đ
18.990.000đ
32.990.000đ
17.990.000đ
22.990.000đ
10.990.000đ
6.990.000đ
8.990.000đ
26.990.000đ
20.990.000đ


## Putting it all together!

Now that we learned how to find and extract information with BeautifulSoup. Let's write a program to solve the requirements!

### Main Component of the Scraper

In [0]:
# We want to save information about all articles in a list
# data = []

# Find all article tags
products = soup.find_all('div', {'class':'product-item'})

# Extract information of each products
for product in products:
    d = {'product_id':'', 'seller_id':'', 'product_title':'', 'price':'', 'original_price': '', 'image_url':''}
    #Try-excerpt
    try:
      d['product_id'] = product['data-id'] #(product=div, attribute= 'data-id')
      d['seller_id'] = product['data-seller-product-id']
      d['product_title'] = product['data-title']
      d['price'] = product['data-price']
      original_price = product.find('span', {'class': 'price-regular'}).text
      int_original_price = int("".join(e for e in original_price if e.isdigit()))
      # print(int_original_price)
      d['original_price'] = int_original_price
      # d['url_tag'] = product['']
      #image... 
      if product.img: 
        d['image_url'] = product.img['src']
      #Append the dictionary to data list
      data.append (d)
    except:
          # Skip if error
          pass

# print(data)

### Package into Functions

In [0]:
# Import Library
import requests
from bs4 import BeautifulSoup

def get_url(url):
    """Get parsed HTML from url
      Input: url to the webpage
      Output: Parsed HTML text of the webpage
    """
    # Send GET request
    r = requests.get(url)

    # Parse HTML text
    soup = BeautifulSoup(r.text, 'html.parser')

    return soup

def scrape_tiki(url='https://tiki.vn/laptop/c8095?src=c.1846.hamburger_menu_fly_out_banner&_lc=Vk4wMzkwMTkwMDQ='):
    """Scrape the home page of vnexpress
      Input: url to the webpage. Default: 'https://tiki.vn/laptop/c8095?src=c.1846.hamburger_menu_fly_out_banner&_lc=Vk4wMzkwMTkwMDQ='
      Output: A list containing scraped data of all articles
    """

    # Get parsed HTML
    soup = get_url(url)

    # Find all article tags
    products = soup.find_all('div', {'class':'product-item'})

    # List containing data of all articles
    data = []
    # print(data)
    # Extract information of each article
    for product in products:
        d = {'seller_id':'', 'product_id':'', 'product_title':'', 'price':'', 'original_price':'', 'product_image':''}
        
        try:
            d['seller_id'] = product['data-seller-product-id']
            d['product_id'] = product['data-id'] 
            d['product_title'] = product['data-title']
            d['price'] = product['data-price']
            
            original_price = product.find('span', {'class': 'price-regular'}).text
            new_original_price = "".join(e for e in original_price if e.isdigit())
            d['original_price'] = int(new_original_price)

            if product.find('span', {'class': 'code'}):
              d['tiki_now'] = True
            else:
              d['tiki_now'] = False

            if product.img:
              d['product_image'] = product.img['src'] 

            # Append the dictionary to data list
            data.append(d)
        except:
          # Skip if error
            pass
  
    return data

### Test the Scraper

In [21]:
# # Test the scraper
data = scrape_tiki()
# data
print(data)

[{'seller_id': '52179840', 'product_id': '52179836', 'product_title': 'Apple Macbook Air 2020 - 13 Inchs (i3-10th/ 8GB/ 256GB) - Hàng Nhập Khẩu Chính Hãng', 'price': '25390000', 'original_price': 32990000, 'product_image': 'https://salt.tikicdn.com/cache/280x280/ts/product/bb/37/9c/23a0437ae9c6911db6caa4ca873e09c7.jpg', 'tiki_now': True}, {'seller_id': '22615307', 'product_id': '22615306', 'product_title': 'Laptop Asus Vivobook X509FJ-EJ053T Core i5-8265U/ MX230 2GB/ Win10 (15.6 FHD) - Hàng Chính Hãng', 'price': '12399000', 'original_price': 14990000, 'product_image': 'https://salt.tikicdn.com/cache/280x280/ts/product/9a/d4/0a/5a56bcb447297643327693af076f5b6a.jpg', 'tiki_now': True}, {'seller_id': '21922157', 'product_id': '21922156', 'product_title': 'Laptop Lenovo Legion Y540-15IRH 81SY0037VN Core i5-9300H/ GTX 1650 4GB/ Dos (15.6 FHD IPS) - Hàng Chính Hãng', 'price': '19449000', 'original_price': 23990000, 'product_image': 'https://salt.tikicdn.com/cache/280x280/ts/product/05/0b/13/

In [22]:
# Save data to a DataFrame
import pandas as pd #alias
# print(data[0].keys())
# print(data)
# print(data[0].values())

data = scrape_tiki()
data_frame = pd.DataFrame(data = data, columns = data[1].keys())
data_frame
# print(data_frame.info())

Unnamed: 0,seller_id,product_id,product_title,price,original_price,product_image,tiki_now
0,52179840,52179836,Apple Macbook Air 2020 - 13 Inchs (i3-10th/ 8G...,25390000,32990000,https://salt.tikicdn.com/cache/280x280/ts/prod...,True
1,22615307,22615306,Laptop Asus Vivobook X509FJ-EJ053T Core i5-826...,12399000,14990000,https://salt.tikicdn.com/cache/280x280/ts/prod...,True
2,21922157,21922156,Laptop Lenovo Legion Y540-15IRH 81SY0037VN Cor...,19449000,23990000,https://salt.tikicdn.com/cache/280x280/ts/prod...,True
3,49334630,49334629,Laptop Dell Inspiron N3593 70205744 (Core i5-1...,14359000,16990000,https://salt.tikicdn.com/cache/280x280/ts/prod...,True
4,23264104,23264100,Apple Macbook Air 2019 - 13 inchs (i5/ 8GB/ 12...,24599000,32990000,https://salt.tikicdn.com/cache/280x280/ts/prod...,True
5,20955085,20955084,Laptop Asus Vivobook A512DA-EJ406T AMD R5-3500...,13369000,15990000,https://salt.tikicdn.com/cache/280x280/ts/prod...,True
6,23264375,23264373,Apple Macbook Pro Touch Bar 2019 - 13 inchs (i...,30199000,36990000,https://salt.tikicdn.com/cache/280x280/ts/prod...,True
7,25661193,721995,Macbook Air 2017 MQD32 (13.3 inch) - Hàng Chín...,18899000,24750000,https://salt.tikicdn.com/cache/280x280/ts/prod...,True
8,24351538,24351537,Laptop Asus Vivobook A412FA-EK377T Core i3-814...,11699000,12990000,https://salt.tikicdn.com/cache/280x280/ts/prod...,True
9,4378945,4378943,Laptop Asus E203MAH-FD004T Celeron N4000/Win10...,5199000,6090000,https://salt.tikicdn.com/cache/280x280/ts/prod...,True
