# Automatic URL classifying based on IAB taxonomy categories

## Introduction
IAB is Interactive Advertising Bureau that empowers the media and marketing industries to thrive in the digital economy. The IAB Technology Laboratory (IAB Tech Lab) is a nonprofit research and development consortium charged with producing and helping companies implement global industry technical standards and solutions for the digital media and advertising industries. 

So in this project I'll creating a classifier and engine to detect the content of the URL and label it automatically. 


## My day to day problem
* I need to classified the URL manually by looking at the website and tag 1 to 1 inside sql database. URL/Website content changes everyday and this hampering my day to day time.
* So many taxonomy out there.

## Goal of my project
* Be able to classified unknown website.
* Reduce manually searching on website tagging
* Using single taxonomy to clasified the website
* Can be extend to DMP integration


## Scope
This is for EDS final project. The output would be classifying the URL to IAB taxonomy. Since IAB taxonomy having more than 1100+ categories, we will only focus on ecommerce website and below IAB categories:

1. Others (600)
2. Technology & Computing > Consumer Electronics > Cameras and Camcorders (621)
3. Technology & Computing > Consumer Electronics > Home Entertainment Systems (622)
4. Technology & Computing > Consumer Electronics > Smartphones (623)
5. Technology & Computing > Consumer Electronics > Tablets and E-readers (624)
6. Technology & Computing > Consumer Electronics > Wearable Technology (625)

1st phase will targetted Lazada since 70% of our link going to Lazada compare to other ecommerce.

## Task
### To create training dataset
1. Go to e-commerce website scrape it's content.
2. Label the content based on scope IAB taxonomy
3. Train the dataset using the define classifier

### To test result:
1. Extract unknown category data from Hadoop and create a list
2. Feed the data into machine learning preprocessing
3. Extracting the data and remove unwanted content

## Go to e-commerce website scrape it's content

* Using Web Scraper io chrome extension, we can provide the link and it will scrape the whole item URL and solve it pagination automatically

![image1](images/webscraper_tools.JPG)

* Export the links into a csv file

![image2](images/extracted_data.JPG)

## Preprocess extracted links from lazada for training set
* Since there is no existing training dataset, so I need to create the training dataset which scraping the website and categories the data.

* I've been exploring so may scraper out there but most of the webscraping package doesn't support ajax/javascript, however selenium is different. It's using available browser engine like chrome, firefox and etc

* For this project I'm using phantomjs since it can run in the background.

In [1]:
# import all the required package
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.action_chains import ActionChains
from bs4 import BeautifulSoup

import re
import pandas as pd
import numpy as np
import csv

In [2]:
pd.set_option('display.max_colwidth', -1) #set this for remove truncated by jupyter/python
raw_links = pd.read_csv("./webpage_scrap/lz_smartphone.csv") #load the url list from scraping list

In [3]:
clean_links = raw_links['linked-href'][raw_links['pagination'] != 100]
print('clean_links len:',len(clean_links))
clean_links2 = clean_links.drop_duplicates() #remove duplicate inside list
clean_links2.dropna(inplace=True)
print('clean_links2 len: ',len(clean_links2))
clean_links2[4:5].to_string()

clean_links len: 216
clean_links2 len:  178


'5    https://www.lazada.com.my/allwin-original-unlocked-50-touch-screen-dual-sim-dual-standbysmart-phone-white-11793727.html?ff=1&sc=EwI='

In [4]:
def gettingResult(pageSource):
    pageSource = browser.page_source
    bsObj = BeautifulSoup(pageSource,'lxml')  

    try:
        result = bsObj.find(id = 'prod_title').text.strip() + bsObj.find(class_='product-description__block').getText(' ').strip().replace("\n"," ").replace("\t"," ") + bsObj.find(class_='prd-attributesList ui-listBulleted js-short-description').getText(' ').strip()
    except:
        result = bsObj.find(class_ = 'product-description__title').text.strip() + bsObj.find(class_='product-description__block').getText(' ').strip().replace("\n"," ").replace("\t"," ") + bsObj.find(class_='prd-attributesList ui-listBulleted js-short-description').getText(' ').strip()

    result2 = re.sub(' +',' ',result.lower().strip().replace(")"," ").replace("("," ").replace("\"","\'").replace("\n"," ").replace("/","or"))
    print('=========== data scraped! ============')
    return result2

In [5]:
final = []
count = 1
ttl_url = len(clean_links2)
csv_name = './scraping_data/result_smartphone.csv'
# ttl_url = len(clean_links2) - 62

In [76]:
for url in clean_links2:
    print('====== Getting web browser ready ======')
#     browser = webdriver.Chrome('./engine/chromedriver.exe')
    browser = webdriver.PhantomJS('./engine/phantomjs.exe')
    browser.get(url)
    counter = str(count)+'/'+str(ttl_url)
    print(counter,'Brows to',url)
    delay = 3 # seconds
    count +=1
    
    try:
        link3 = browser.find_element_by_class_name("delivery-option-st__label") #this for removing popup on oversea delivery
        ActionChains(browser).move_to_element(link3).perform()
        link3.click()
        
#         try:
#             myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'product-description__block-expand-button')))

        try:
            link = browser.find_element_by_class_name("product-description__block-expand-button")
            ActionChains(browser).move_to_element(link).perform()
            link.click()

            try:
                link2 = browser.find_element_by_class_name("more-desc-button")
                ActionChains(browser).move_to_element(link2).perform()
                link2.click()

                final.append(gettingResult(browser.page_source))
                browser.close()
            except:
                final.append(gettingResult(browser.page_source))
                browser.close()

        except:
            try:
                link2 = browser.find_element_by_class_name("more-desc-button")
                ActionChains(browser).move_to_element(link2).perform()
                link2.click()

                final.append(gettingResult(browser.page_source))
                browser.close()

            except:
                final.append(gettingResult(browser.page_source))
                browser.close()

#         except TimeoutException:
#             print ("Loading took too much time!")
    except:
        try:
            myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'product-description__block-expand-button')))
            
            try:
                link = browser.find_element_by_class_name("product-description__block-expand-button")
                ActionChains(browser).move_to_element(link).perform()
                link.click()
                
                try:
                    link2 = browser.find_element_by_class_name("more-desc-button")
                    ActionChains(browser).move_to_element(link2).perform()
                    link2.click()
                    
                    final.append(gettingResult(browser.page_source))
                    browser.close()
                except:
                    final.append(gettingResult(browser.page_source))
                    browser.close()

            except:
                try:
                    link2 = browser.find_element_by_class_name("more-desc-button")
                    ActionChains(browser).move_to_element(link2).perform()
                    link2.click()
                    
                    final.append(gettingResult(browser.page_source))
                    browser.close()
                    
                except:
                    final.append(gettingResult(browser.page_source))
                    browser.close()
                
            
#             final.append(gettingResult(browser.page_source))
#             browser.close()
        except TimeoutException:
            print ("Loading took too much time!")
            browser.close()

1/216 Brows to https://www.lazada.com.my/korean-style-solid-spring-and-elastic-waist-wide-leg-pants-40956105.html?ff=1&sc=IYsE
2/216 Brows to https://www.lazada.com.my/summer-icy-sleeves-sun-protection-gloves-for-men-and-women-anti-uv-thin-long-ice-silk-sun-protection-sleeve-sleeves-driving-arm-sleeve-black-thumb-models-90562797.html?ff=1&sc=IYsE
3/216 Brows to https://www.lazada.com.my/kitchen-three-layer-multi-function-storage-rack-dish-rack-82660746.html?ff=1&sc=IYsE
4/216 Brows to https://www.lazada.com.my/men-backpack-mens-bag-kangaroo-business-vertical-version-messengerbag-70401280.html?ff=1&sc=IYsE
5/216 Brows to https://www.lazada.com.my/yoga-exercise-shoes-2017-new-style-korean-style-breathable-casual-flat-mesh-shoes-couple-sports-shoes-women-running-shoes-pink-color-73933767.html?ff=1&sc=IYsE
6/216 Brows to https://www.lazada.com.my/couple-shoes-2017-new-style-korean-style-fashion-wild-casual-shoeshong-kong-style-breathable-running-shoes-street-shooting-sportsshoes-female-sum

34/216 Brows to https://www.lazada.com.my/black-female-thin-section-summer-pencil-pants-bottoming-pants-black-black-40040450.html?ff=1&sc=IYsE
35/216 Brows to https://www.lazada.com.my/cat-bath-cage-supplies-two-door-pet-wash-cat-cage-cat-hairinjections-anti-catch-bite-cat-cage-free-shipping-87222652.html?ff=1&sc=IYsE
36/216 Brows to https://www.lazada.com.my/teapot-old-bamboo-pot-large-capacity-ore-famous-suit-58214305.html?ff=1&sc=IYsE
37/216 Brows to https://www.lazada.com.my/lauva-3-way-cat-tunnel-toy-for-cats-kittens-puppy-bunny-90873553.html?ff=1&sc=IYsE
38/216 Brows to https://www.lazada.com.my/han-feng-chic-top-female-wild-retro-hoodie-female-hong-kong-stylelong-sleeved-t-shirt-female-ulzzang-social-female-clothes-158-white-thin-section-81099863.html?ff=1&sc=IYsE
39/216 Brows to https://www.lazada.com.my/shuangqing-single-layer-kitchen-soap-rack-soap-dish-110928486.html?ff=1&sc=IYsE
40/216 Brows to https://www.lazada.com.my/evening-dresses-2016-summer-long-party-dress-big-size-

70/216 Brows to https://www.lazada.com.my/2017-summer-new-style-plus-sized-loose-short-sleeved-t-shirt-femaleround-neck-striped-top-wild-bottoming-shirt-female-small-effort579-red-striped-73932750.html?ff=1&sc=IYsE
71/216 Brows to https://www.lazada.com.my/children39s-3-6-12-a-month-newborn-children-music-toys-baby-toys-fitness-frame-baby-0-1-year-old-fitness-device-52517767.html?ff=1&sc=IYsE
72/216 Brows to https://www.lazada.com.my/i-the-world-people-aberdeen-minecraft-building-blocks-people-cooliecave-scene-dolls-doll-hand-to-do-toys-59001308.html?ff=1&sc=IYsE
73/216 Brows to https://www.lazada.com.my/ms-new-style-shoulder-tide-wild-chest-pack-67291575.html?ff=1&sc=IYsE
74/216 Brows to https://www.lazada.com.my/milk-tea-barrel-insulation-barrel-not-stainless-steel-commercialmilk-bucket-herbal-tea-barrel-juice-bucket-drink-faucet-doublelarge-capacity-90941948.html?ff=1&sc=IYsE
75/216 Brows to https://www.lazada.com.my/home-folding-hanging-scarf-the-28-circle-single-scarf-rack-4345868

104/216 Brows to https://www.lazada.com.my/hino-300-gogo-van-hino-tail-plate-truck-model-64617148.html?ff=1&sc=IYsE
105/216 Brows to https://www.lazada.com.my/color-diana-2017-autumn-and-winter-new-korean-women-slim-wildbottoming-long-sleeved-shirt-fashion-shirt-chiffon-shirt-pinkshort-sleeve-43439119.html?ff=1&sc=IYsE
106/216 Brows to https://www.lazada.com.my/byl-12-hanging-clothes-hanger-coat-stand-storage-popheko-96044111.html?ff=1&sc=IYsE
107/216 Brows to https://www.lazada.com.my/lucky-0780-natural-gold-feng-shui-home-accessories-furnishings-39645522.html?ff=1&sc=IYsE
108/216 Brows to https://www.lazada.com.my/childrens-multi-function-educational-nut-car-50943615.html?ff=1&sc=IYsE
109/216 Brows to https://www.lazada.com.my/yongjun-athletic-flying-stack-of-cup-tournament-cup-104499818.html?ff=1&sc=IYsE
110/216 Brows to https://www.lazada.com.my/fiat-hot-wheels-wind-fire-wheel-boutique-car-51055617.html?ff=1&sc=IYsE
111/216 Brows to https://www.lazada.com.my/art-linen-plus-sized-lo

139/216 Brows to https://www.lazada.com.my/storage-rack-long-75-wide-40-kitchen-shelf-floor-microwave-ovenstorage-rack-stainless-steel-color-kitchen-shelf-82663172.html?ff=1&sc=IYsE
140/216 Brows to https://www.lazada.com.my/retro-female-long-sleeved-women-heattech-top-122-more-casual-96055217.html?ff=1&sc=IYsE
141/216 Brows to https://www.lazada.com.my/simple-solid-color-bedroom-living-room-blackout-curtains-half-curtain-58168476.html?ff=1&sc=IYsE
142/216 Brows to https://www.lazada.com.my/baby-stitching-t-shirt-2017-spring-korean-style-new-style-boy-children39s-clothing-children39s-casual-long-sleeved-shirt-bottoming-tx-7600-navy-blue-47350259.html?ff=1&sc=IYsE
143/216 Brows to https://www.lazada.com.my/c-nex-adapter-ring-movie-lens-cctv-lens-to-sony-nex6nex5ra5000-41315018.html?ff=1&sc=IYsE
144/216 Brows to https://www.lazada.com.my/week-notepad-business-work-day-notes-cute-message-stickers-70356407.html?ff=1&sc=IYsE
145/216 Brows to https://www.lazada.com.my/fall-fashion-shoes-new-

174/216 Brows to https://www.lazada.com.my/10-dress-wood-slip-hanger-seamless-wood-hangers-wooden-hotelhanger-retro-clothing-store-hanger-garment-support-62170180.html?ff=1&sc=IYsE
175/216 Brows to https://www.lazada.com.my/spring-and-autumn-men-and-women-hotel-seven-points-long-sleeveddouble-sided-cotton-towel-material-long-plus-sized-robe-bathrobe-79577938.html?ff=1&sc=IYsE
176/216 Brows to https://www.lazada.com.my/korean-style-spring-new-loose-white-shirt-40906501.html?ff=1&sc=IYsE
177/216 Brows to https://www.lazada.com.my/applicable-new-style-fujifilm-xt10-x100-x100st-x10-x20-xa2-microsingle-camera-shoulder-87111753.html?ff=1&sc=IYsE
178/216 Brows to https://www.lazada.com.my/english-enlighten-color-word-card-english-flash-cards-large-cardfruit-class-24-zhang-english-children-teacher-teaching-41905422.html?ff=1&sc=IYsE
179/216 Brows to https://www.lazada.com.my/gelert-camping-quick-drying-outdoor-comes-with-storage-bag-39851616.html?ff=1&sc=IYsE
180/216 Brows to https://www.lazad

209/216 Brows to https://www.lazada.com.my/european-style-wrought-iron-wedding-road-lead-stage-ornaments-candlestick-lantern-60173834.html?ff=1&sc=IYsE
210/216 Brows to https://www.lazada.com.my/security-sanitation-traffic-rain-pants-reflective-raincoat-navyblue-suit-52002586.html?ff=1&sc=IYsE
211/216 Brows to https://www.lazada.com.my/color-diana-2017-autumn-and-winter-new-korean-long-sleeved-knitslim-bottoming-waist-was-thin-package-hip-flounced-dress-81103149.html?ff=1&sc=IYsE
212/216 Brows to https://www.lazada.com.my/shuangqing-wall-hangers-toothbrush-hanging-box-teeth-with-seat-toothbrush-rack-110927418.html?ff=1&sc=IYsE
213/216 Brows to https://www.lazada.com.my/cartoon-cute-small-cat-kitchen-long-carpet-non-slip-bedroom-bed-front-floor-mats-absorbent-non-slip-foot-mat-77541223.html?ff=1&sc=IYsE
214/216 Brows to https://www.lazada.com.my/storage-rack-long-60-wide-45-kitchen-floor-storage-rack-microwaveoven-rack-stainless-steel-color-3-layer-4-layer-finishing-rack-potrack-8261160

In [7]:
len(final)

215

In [80]:
#Push the list into csv file
with open(csv_name, "w", encoding='utf-8') as output:
    writer = csv.writer(output, lineterminator='\n')
    for val in final:
        writer.writerow([val])  