## Nhóm
* 1612418 - Phạm Lưu Trọng Nghĩa
* 1612377 - Tô Hiếu Minh

## Ý tưởng
* Dự đoán giá của điện thoại di động khi cho trước cấu hình, nhà sản xuất, ngày ra mắt của điện thoại.
* Tránh việc mua hớ giá

## Website
* Crawl từ https://thegioididong.com, https://fptshop.com.vn
* Dữ liệu hợp lệ https://thegioididong.com/robots.txt, https://fptshop.com.vn/robots.txt

## Dữ liệu
* Dữ liệu hiện tại: 500 dòng, 13 cột
* Các cột phục vụ cho việc dự đoán:
    * **prices**: Giá
    * **screens**: Độ rộng màn hình (inches)
    * **os**: Hệ điều hành
    * **front_cams**: Camera trước (MP)
    * **back_cams**: Camera sau (MP)
    * **cpus**: Loại cpu
    * **rams**: RAM (GB)
    * **memories**: Bộ nhớ trong (GB)
    * **batteries**: Dung lượng pin (mAh)
    * **companies**: Nhà sản xuất
    * **release_date**: Ngày ra mắt
* Các cột hỗ trợ việc hiển thị dữ liệu:
    * **images**: Link hình ảnh
    * **urls**: Link đi đến chi tiết

In [1]:
from requests_html import HTMLSession
from requests_html import HTML
import json
import time
import pandas as pd
import datetime as dt
import re
import io
from selenium import webdriver

## Helper functions

1. init_array(): Khởi tạo các mảng tương ứng với các thuộc tính cần crawl.
1. get_html_session(url): Tạo html session từ url 
1. find_first(html, query): Tìm phần tử đầu tiên trong đoạn *html* và thỏa *query*.
1. add_element_to_array(array, element): Thêm một *element* vào *array*, nếu phần tử đó rỗng, add *None*
1. return_none_if_empty(value): Trả về giá trị *None* nếu là chuỗi rỗng
1. read_html(in_file): Đọc html từ file
1. read_from_csv(csv_file): Đọc file csv với separation là tab

In [2]:
# HELPERS

def init_array():
    images = []
    prices = []
    screens = []
    os = []
    front_cams = []
    back_cams = []
    cpus = []
    rams = []
    memories = []
    batteries = []
    urls = []
    companies = []
    release_date = []
    
    return (images, prices, screens, os, front_cams, back_cams, cpus, rams, memories, batteries, urls, companies, release_date)

def get_html_session(url):
    session = HTMLSession()
    return session.get(url)

def find_first(html, query):
    result = html.find(query, first=True)
    if(result == None):
        class E:
            text = None
        return E()
    return result

def add_element_to_array(array, element):
    return array + [return_none_if_empty(element)]

def return_none_if_empty(value):
    result = value if value else None
    return result

def read_html(in_file):
    f = open(in_file, 'r', encoding="utf8")
    if f.mode == 'r':
        contents = f.read()
    f.close()
    
    return contents

def read_from_csv(csv_file):
    return pd.read_csv(csv_file, sep='\t')

def find_by_selenium(driver, query):
    return driver.find_element_by_css_selector(query)

## Implement crawler

<ul>
    <li>
        <strong>Step 1</strong>: Get tất cả url và ghi vào file.
    </li>
    <li>
        <strong>Step 2</strong>: Đọc file urls và sử dụng request html để crawl thông tin chi tiết của từng điện thoại. 
    </li>
    <li>
        <strong>Step 3</strong>: Chuẩn hóa lại data
    </li>
</ul>

In [3]:
def get_all_urls(out_file):
    urls = []
    contents = read_html('a.html')
    
    urls = []
    html = HTML(html=contents)

    ul = html.find('ul.homeproduct', first=True)
    li = ul.find('li')
    for item in li:
        url = find_first(item, 'a').attrs['href']
        if(url == '/dtdd' or url == '/dtdd/vivo-s1-pro'):
            continue
        urls = add_element_to_array(urls, find_first(item, 'a').attrs['href'])
        
    f = io.open(out_file, "w+", encoding='utf-8')
    f.write('\n'.join(urls))
    f.close()

In [4]:
def get_info_tgdd(urls_file, out_file, sleep_time=1):
    (images, prices, screens, os, front_cams, back_cams, cpus, rams, memories, batteries, urls, companies, release_date) = init_array()
    
    f = io.open(urls_file, 'r', encoding='utf-8')
    if f.mode == 'r':
        contents = f.read()
    f.close()
    
    domain = 'https://thegioididong.com'
    urls = contents.split('\n')
    
    num = 100
    finish = False
    i=0
    while finish == False:
        r = get_html_session(domain + urls[i])
        if r.ok == True:
            print(urls[i], i)
            html = r.html
            companies = add_element_to_array(companies, re.findall('/([a-z]+)-', urls[i])[0])
            if(html.find('aside.picture > img', first=True) != None):
                images = add_element_to_array(images, find_first(html, 'aside.picture > img').attrs['src'])
                prices = add_element_to_array(prices, find_first(html, 'aside.price_sale > div.area_price > strong').text)

                li = html.find('ul.parameter > li > div')
                if len(li) >= 10:
                    screens = add_element_to_array(screens, li[0].text)
                    os = add_element_to_array(os, li[1].text)
                    back_cams = add_element_to_array(back_cams, li[2].text)
                    front_cams = add_element_to_array(front_cams, li[3].text)
                    cpus = add_element_to_array(cpus, li[4].text)
                    rams = add_element_to_array(rams, li[5].text)
                    memories = add_element_to_array(memories, li[6].text)
                    if len(li) > 10:
                        batteries = add_element_to_array(batteries, li[10].text)
                    else:
                        batteries = add_element_to_array(batteries, li[9].text)
                else:
                    screens = add_element_to_array(screens, None)
                    os = add_element_to_array(os, None)
                    back_cams = add_element_to_array(back_cams, None)
                    front_cams = add_element_to_array(front_cams, None)
                    cpus = add_element_to_array(cpus, None)
                    rams = add_element_to_array(rams, None)
                    memories = add_element_to_array(memories, None)
                    batteries = add_element_to_array(batteries, None)

                if len(screens) == num:
                    finish = True
            i+=1
        print(release_date)
        time.sleep(sleep_time)
        
    f = io.open(out_file, "w+", encoding='utf-8')
    f.write("images\tprices\tscreens\tos\tfront_cams\tback_cams\tcpus\trams\tmemories\tbatteries\turls\tcompanies\n")
    for i in range(num):
        f.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (images[i], prices[i], screens[i], os[i], front_cams[i], back_cams[i], cpus[i], rams[i], memories[i], batteries[i], urls[i], companies[i]))
    f.close()

In [5]:
# def get_info_tgdd(urls_file, out_file, sleep_time=2):
#     (images, prices, screens, os, front_cams, back_cams, cpus, rams, memories, batteries, urls, companies, release_date) = init_array()
    
#     f = io.open(urls_file, 'r', encoding='utf-8')
#     if f.mode == 'r':
#         contents = f.read()
#     f.close()
    
#     domain = 'https://thegioididong.com'
#     urls = contents.split('\n')
    
#     num = 100
#     finish = False
#     i=0
#     driver = webdriver.Chrome()
#     while finish == False:
#         r = get_html_session(domain + urls[i])
#         driver.get(domain + urls[i])
#         if r.ok == True:
#             print(urls[i], i)
#             html = r.html
#             companies = add_element_to_array(companies, re.findall('/([a-z]+)-', urls[i])[0])
#             if(html.find('aside.picture > img', first=True) != None):
#                 images = add_element_to_array(images, find_first(html, 'aside.picture > img').attrs['src'])
#                 prices = add_element_to_array(prices, find_first(html, 'aside.price_sale > div.area_price > strong').text)

#                 li = html.find('ul.parameter > li > div')
#                 if len(li) >= 10:
#                     screens = add_element_to_array(screens, li[0].text)
#                     os = add_element_to_array(os, li[1].text)
#                     back_cams = add_element_to_array(back_cams, li[2].text)
#                     front_cams = add_element_to_array(front_cams, li[3].text)
#                     cpus = add_element_to_array(cpus, li[4].text)
#                     rams = add_element_to_array(rams, li[5].text)
#                     memories = add_element_to_array(memories, li[6].text)
#                     if len(li) > 10:
#                         batteries = add_element_to_array(batteries, li[10].text)
#                     else:
#                         batteries = add_element_to_array(batteries, li[9].text)
#                 else:
#                     screens = add_element_to_array(screens, None)
#                     os = add_element_to_array(os, None)
#                     back_cams = add_element_to_array(back_cams, None)
#                     front_cams = add_element_to_array(front_cams, None)
#                     cpus = add_element_to_array(cpus, None)
#                     rams = add_element_to_array(rams, None)
#                     memories = add_element_to_array(memories, None)
#                     batteries = add_element_to_array(batteries, None)
                    
#                 # Selenium click to get release date
#                 find_by_selenium(driver, 'button.viewparameterfull').click()
#                 time.sleep(sleep_time)
#                 release_date = add_element_to_array(release_date, find_by_selenium(driver, 'li.g13045 > div').get_attribute('innerHTML'))
                
#                 if len(screens) == num:
#                     finish = True
#             i+=1
        
#     f = io.open(out_file, "w+", encoding='utf-8')
#     f.write("images\tprices\tscreens\tos\tfront_cams\tback_cams\tcpus\trams\tmemories\tbatteries\turls\tcompanies\trelease_date\n")
#     for i in range(num):
#         f.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (images[i], prices[i], screens[i], os[i], front_cams[i], back_cams[i], cpus[i], rams[i], memories[i], batteries[i], urls[i], companies[i], release_date[i]))
#     f.close()

In [6]:
def pre_processing_data():
    domain = 'https://thegioididong.com'
    df = read_from_csv('data_tgdd_file.csv')

    df['prices'] = df['prices'].str[:-1]
    df['screens'] = df['screens'].str.extract(r'([0-9].[0-9]+)\"', expand=True)
    df['os'] = df['os'].str.extract(r'(.+\s\d\.?\d?)', expand=False).str.lower()
    df['front_cams'] = df['front_cams'].str.extract(r'(\d+)\sMP', expand=True)
    df['back_cams'] = df['back_cams'].str.extract(r'(\d+)\sMP', expand=True)
    df['cpus'] = df['cpus'].str.extract(r'(\w+\s\w+)\s', expand=True)
    df['rams'] = df['rams'].str.extract(r'(\d+)', expand=True)
    df['memories'] = df['memories'].str.extract(r'(\d+)', expand=True)
    df['batteries'] = df['batteries'].str.extract(r'(\d+)\smAh', expand=True)
    df['urls'] = str(domain) + df['urls'].map(str)
    df['companies'] = df['companies'].replace(to_replace='iphone', value='apple')
    df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

    
    (cpus, companies) = (set(df['cpus'].values), set(df['companies'].values))
    
    return df

In [7]:
# get_all_urls('urls_tgdd_file.txt')

In [8]:
# get_info_tgdd('urls_tgdd_file.txt', 'data_tgdd_file.csv')

In [9]:
tgdd_df = pre_processing_data()
tgdd_df['companies'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1902908ccc8>

In [24]:
df = read_from_csv('fpt.csv')
# fpt_df
fpt_df = df.dropna(subset=['prices'])
phone_df = tgdd_df.append(fpt_df)

phone_df.head()
# phone_df.info()

Unnamed: 0,back_cams,batteries,companies,cpus,front_cams,images,memories,os,prices,rams,release_date,screens,urls
0,12,3969,apple,Apple A13,12,https://cdn.tgdd.vn/Products/Images/42/210654/...,512,ios 13,43.990.000,4,11/2019,6.5,https://thegioididong.com/dtdd/iphone-11-pro-m...
1,12,3969,apple,Apple A13,12,https://cdn.tgdd.vn/Products/Images/42/210653/...,256,ios 13,37.990.000,4,11/2019,6.5,https://thegioididong.com/dtdd/iphone-11-pro-m...
2,12,3046,apple,Apple A13,12,https://cdn.tgdd.vn/Products/Images/42/210655/...,256,ios 13,34.990.000,4,11/2019,5.8,https://thegioididong.com/dtdd/iphone-11-pro-2...
3,12,3969,apple,Apple A13,12,https://cdn.tgdd.vn/Products/Images/42/200533/...,64,ios 13,33.990.000,4,11/2019,6.5,https://thegioididong.com/dtdd/iphone-11-pro-max
4,12,3174,apple,Apple A12,7,https://cdn.tgdd.vn/Products/Images/42/190322/...,256,ios 12,33.990.000,4,11/2018,6.5,https://thegioididong.com/dtdd/iphone-xs-max-2...
