Requirements
------------------


From this url `http://www.5metal.com.hk/ajax/pager/company_fea?view_amount=36&page=1`,

We get a list of urls for detail pages like `http://www.5metal.com.hk/wongwahkee/contact` to scrape the companies' info

------------


Info to scrape:

1. address
2. company name
3. company website
4. contact number
5. contact person
6. email
7. mobile phone

------------

In [1]:
from IPython.core.display import display, HTML
import urllib2
import bs4
import urlparse
import pandas as pd
import numpy as np

Observation
------------

Since the pagination of the website is crappy (or convenient), you can actually set `view_count` to a super large number to **avoid handling pagination issue**, also note that the `page` parameter is **zero-indexed**

In [2]:
index_page_url = 'http://www.5metal.com.hk/ajax/pager/company_fea?view_amount=9999999'

In [3]:
index_page_src = urllib2.urlopen(index_page_url).read()
index_page_soup = bs4.BeautifulSoup(index_page_src, 'html.parser')

The index page HTML source:

In [4]:
print(index_page_soup.prettify())

<div class="view view-company-list view-id-company_list view-display-id-block_1 view-dom-id-e38206ac6bd084b90ccf4c18d81e44a7">
 <div class="view-content">
  <div>
   <div class="views-field views-field-nothing">
    <span class="field-content">
     <div class="company">
      <div class="logo">
       <a href="/node/10624" target="_blank">
        <img alt="" height="113" src="https://www.5metal.com.hk/sites/default/files/styles/company_list_logo/public/logo%20my%20home.jpg?itok=IJL9ebcH" width="152"/>
       </a>
      </div>
      <div class="title">
       <a href="/node/10624" target="_blank">
        愛家潔具有限公司
       </a>
      </div>
     </div>
    </span>
   </div>
  </div>
  <div>
   <div class="views-field views-field-nothing">
    <span class="field-content">
     <div class="company">
      <div class="logo">
       <a href="/node/13112" target="_blank">
        <img alt="" height="113" src="https://www.5metal.com.hk/sites/default/files/styles/company_list_logo/public/Kwong

The index page looks like:

In [5]:
display(HTML(index_page_src))

Thank you for the nice website design =3

In [6]:

detail_pages_links = [anchor.get('href') for anchor in index_page_soup.select('.views-field .title a[href]')]
detail_pages_links = [urlparse.urljoin(index_page_url, link) for link in detail_pages_links if link is not None]
scraped_data = pd.DataFrame(data={'detail_page_url': detail_pages_links})
scraped_data['address'] = None
scraped_data['company_name'] = None
scraped_data['company_website'] = None
scraped_data['contact_number'] = None
scraped_data['contact_person'] = None
scraped_data['fax'] = None
scraped_data['email'] = None
scraped_data['mobile_phone'] = None

print('No. of links got: ' + str(scraped_data.shape[0]))
print(scraped_data.head(10))

No. of links got: 36
                       detail_page_url address company_name company_website  \
0  http://www.5metal.com.hk/node/10624    None         None            None   
1  http://www.5metal.com.hk/node/13112    None         None            None   
2      http://www.5metal.com.hk/node/1    None         None            None   
3   http://www.5metal.com.hk/node/3869    None         None            None   
4   http://www.5metal.com.hk/node/8422    None         None            None   
5   http://www.5metal.com.hk/node/3652    None         None            None   
6    http://www.5metal.com.hk/node/149    None         None            None   
7  http://www.5metal.com.hk/node/14513    None         None            None   
8   http://www.5metal.com.hk/node/6018    None         None            None   
9  http://www.5metal.com.hk/node/15998    None         None            None   

  contact_number contact_person   fax email mobile_phone  
0           None           None  None  None       

Develop scraping function for detail page
-----------

Now we work on getting company info from detail page url, it turns out that each detail page url points to a company website of standard format (after url redirection), and all the info we need can be found in `Contact` page of the detail page

Here's the process for figuring out contact page url & getting its HTML source:

In [7]:
sample_detail_page_url = detail_pages_links[0]
print('sample_detail_page_url:')
print(sample_detail_page_url)

redirected_sample_detail_page_url = urllib2.urlopen(sample_detail_page_url).geturl()
print('redirected_sample_detail_page_url:')
print(redirected_sample_detail_page_url)

sample_contact_page_url = redirected_sample_detail_page_url + '/contact'
print('sample_contact_page_url:')
print(sample_contact_page_url)

sample_contact_page_src = urllib2.urlopen(sample_contact_page_url).read()
sample_contact_page_soup = bs4.BeautifulSoup(sample_contact_page_src, 'html.parser')

sample_detail_page_url:
http://www.5metal.com.hk/node/10624
redirected_sample_detail_page_url:
http://www.5metal.com.hk/myhomesanitaryware
sample_contact_page_url:
http://www.5metal.com.hk/myhomesanitaryware/contact


Contact page's source:

In [8]:
print(sample_contact_page_soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
  "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html dir="ltr" version="XHTML+RDFa 1.0" xml:lang="zh-hant" xmlns="http://www.w3.org/1999/xhtml">
 <head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta content="noindex" name="robots">
   <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
   <link href="https://www.5metal.com.hk/sites/default/files/img_watermark_0_0.png" rel="shortcut icon" type="image/png"/>
   <meta content="Drupal 7 (http://drupal.org)" name="Generator"/>
   <title>
    愛家潔具有限公司 - 聯絡我們 | 香港五金網
   </title>
   <meta content="" http-equiv="Content-Security-Policy">
    <style media="all" type="text/css">
     @import url("https://www.5metal.com.hk/modules/system/system.base.css?o4ncg3");
@import url("https://www.5metal.com.hk/modules/system/system.menus.css?o4ncg3");
@import url("https://www.5metal.com.hk/modules/system/system.messages.css?o4ncg3");
@import url("https://www.5metal.com.hk/m

Contact page's look:

In [9]:
display(HTML(sample_contact_page_src))

The section **we're interested in**:

In [10]:
contact_info_soup = sample_contact_page_soup.select_one('#cp-content .cp-box-content')
print contact_info_soup.prettify()

<div class="cp-box-content">
 <div class="row company_name">
  <span class="val">
   愛家潔具有限公司
  </span>
 </div>
 <div class="row">
  <label>
   地址：
  </label>
  <span class="val">
   新界火炭禾寮坑路2-16號安盛工業大廈1樓A室
  </span>
 </div>
 <div class="row">
  <label>
   聯絡人：
  </label>
  <span class="val">
   Ms Kathy Choi
  </span>
 </div>
 <div class="row">
  <label>
   電話(公司)：
  </label>
  <span class="val">
   26689782
  </span>
 </div>
 <div class="row">
  <label>
   電話(手提)：
  </label>
  <span class="val">
  </span>
 </div>
 <div class="row">
  <label>
   傳真：
  </label>
  <span class="val">
   26689810
  </span>
 </div>
 <div class="row">
  <label>
   電郵：
  </label>
  <span class="val">
   <a href="mailto:myhomesanitaryware@yahoo.com.hk">
    myhomesanitaryware@yahoo.com.hk
   </a>
  </span>
 </div>
 <div class="row">
  <label>
   網址：
  </label>
  <span class="val">
   <a href="www.myhome-hk.com" target="_blank">
    www.myhome-hk.com
   </a>
  </span>
 </div>
</div>



In [11]:
fields = ['company_name', 'address', 'contact_person', 'contact_number', 'mobile_phone', 'fax', 'email', 'company_website']
values = [contact_info_soup.select_one('.row.company_name' + ' + .row' * i + ' .val').getText() for i in xrange(len(fields))]
company_info = pd.DataFrame(data={'values': values}, index=fields)
print company_info

                                          values
company_name                            愛家潔具有限公司
address                  新界火炭禾寮坑路2-16號安盛工業大廈1樓A室
contact_person                     Ms Kathy Choi
contact_number                          26689782
mobile_phone                                    
fax                                     26689810
email            myhomesanitaryware@yahoo.com.hk
company_website                www.myhome-hk.com


Combining all the process above, we can create a function that given a detail page url, return the company's info:

In [12]:
def get_company_info_from_detail_page_url(detail_page_url):
    redirected_detail_page_url = urllib2.urlopen(detail_page_url).geturl()

    contact_page_url = redirected_detail_page_url + '/contact'

    contact_page_src = urllib2.urlopen(contact_page_url).read()
    contact_page_soup = bs4.BeautifulSoup(contact_page_src, 'html.parser')
    contact_info_soup = contact_page_soup.select_one('#cp-content .cp-box-content')
    fields = ['company_name', 'address', 'contact_person', 'contact_number', 'mobile_phone', 'fax', 'email', 'company_website']
    values = [contact_info_soup.select_one('.row.company_name' + ' + .row' * i + ' .val').getText() for i in xrange(len(fields))]
    return {field:value for (field, value) in zip(fields, values)}

test run :) :

In [13]:
test_run_result = get_company_info_from_detail_page_url(detail_pages_links[0])
print(test_run_result)

{'contact_number': u'26689782', 'mobile_phone': u'', 'fax': u'26689810', 'company_website': u'www.myhome-hk.com', 'company_name': u'\u611b\u5bb6\u6f54\u5177\u6709\u9650\u516c\u53f8', 'address': u'\u65b0\u754c\u706b\u70ad\u79be\u5bee\u5751\u8def2-16\u865f\u5b89\u76db\u5de5\u696d\u5927\u5ec81\u6a13A\u5ba4', 'contact_person': u'Ms Kathy Choi', 'email': u'myhomesanitaryware@yahoo.com.hk'}


The main scraping
------------

Now, let's go and sc**rape** them all:

In [14]:
final_scraped_data = scraped_data.copy()

In [15]:
n = final_scraped_data.shape[0]
start_index = 0
end_index = n

print('%d detail pages to scrape from' % (end_index - start_index))
for i in xrange(start_index, end_index):
    url_to_scrape_from = final_scraped_data.iloc[i]['detail_page_url']
    print('[%4d/%4d] scraping url: %s' % (i, n,  url_to_scrape_from))
    try:
        company_info = get_company_info_from_detail_page_url(url_to_scrape_from)
    
        for col in final_scraped_data.columns:
            if col is not 'detail_page_url':
                final_scraped_data[col].iloc[i] = company_info[col]
    except Exception:
        print('there was an exception when scraping %dth url: %s' % (i, url_to_scrape_from))

36 detail pages to scrape from
[   0/  36] scraping url: http://www.5metal.com.hk/node/10624
[   1/  36] scraping url: http://www.5metal.com.hk/node/13112
[   2/  36] scraping url: http://www.5metal.com.hk/node/1
[   3/  36] scraping url: http://www.5metal.com.hk/node/3869
[   4/  36] scraping url: http://www.5metal.com.hk/node/8422
[   5/  36] scraping url: http://www.5metal.com.hk/node/3652
[   6/  36] scraping url: http://www.5metal.com.hk/node/149
[   7/  36] scraping url: http://www.5metal.com.hk/node/14513
[   8/  36] scraping url: http://www.5metal.com.hk/node/6018
[   9/  36] scraping url: http://www.5metal.com.hk/node/15998
[  10/  36] scraping url: http://www.5metal.com.hk/node/3563
[  11/  36] scraping url: http://www.5metal.com.hk/node/13379
[  12/  36] scraping url: http://www.5metal.com.hk/node/8507
[  13/  36] scraping url: http://www.5metal.com.hk/node/3568
[  14/  36] scraping url: http://www.5metal.com.hk/node/3705
[  15/  36] scraping url: http://www.5metal.com.hk/no

In [16]:
final_scraped_data

Unnamed: 0,detail_page_url,address,company_name,company_website,contact_number,contact_person,fax,email,mobile_phone
0,http://www.5metal.com.hk/node/10624,新界火炭禾寮坑路2-16號安盛工業大廈1樓A室,愛家潔具有限公司,www.myhome-hk.com,26689782,Ms Kathy Choi,26689810,myhomesanitaryware@yahoo.com.hk,
1,http://www.5metal.com.hk/node/13112,九龍深水埗福榮街15號地下,港城五金建材公司,http://www.5metal.com.hk/kwongshing,2763 0844,余小姐,26433381,info@kshw.com.hk,
2,http://www.5metal.com.hk/node/1,香港九龍油麻地新填地街226號地下 (油麻地港鐵站B2出口直行窩打老道口),李維記(寶源)五金工程,http://www.lwkpy.com.hk/,34223539,Ben Lee,34223790,leewaikee01@gmail.com,94228002
3,http://www.5metal.com.hk/node/3869,---------- 九龍旺角道6號K地下 ---------- 九龍旺角道6號N地下 --...,誠信建築材料有限公司,http://www.yp.com.hk/sinceredecoration,"23976061 , 23977365 (6K舖) / 23938404 , 2390363...",許生,------------ 23802867 (6K舖) ------------ 27899...,sincerebldg@gmail.com,
4,http://www.5metal.com.hk/node/8422,九龍大角咀通州街131號地下,準誠建材公司,5metal.com.hk/Goodwill,28513869,梁德成,25423094,goodwillconstruct@gmail.com,
5,http://www.5metal.com.hk/node/3652,九龍旺角新填地街461B號地下,同利號,http://kwlaser.net/,23949778,張生,23949773,winwinco@hotmail.com,
6,http://www.5metal.com.hk/node/149,新界元朗大旗嶺737號 (大棠路加德士油站對面),萬昌五金建材有限公司,http://www.man-cheong.com.hk/,24783868,李先生,24747057,jacksonli@mcmhk.com,
7,http://www.5metal.com.hk/node/14513,(A)九龍長沙灣青山道442號A鋪 / (B)香港筲箕灣道186號D鋪,黃華記裝修工程,5metal.com.hk/wongwahkee,31118648,黃逸明,31118649,kat90313806@gmail.com,"90313806 , 92728436"
8,http://www.5metal.com.hk/node/6018,九龍旺角砵蘭街351號地下,環球五金製品公司,http://5metal.com.hk/ump694/,"23952006 , 23813930",劉先生,23951716,ump694@netvigator.com,
9,http://www.5metal.com.hk/node/15998,新界火炭桂地街10號華麗工業中心10樓9室,津滙貿易公司,,,Sarah Chan,,jinhui11115@gmail.com,54482368


Profit
---------

save our important information :)

In [17]:
import datetime

timestamp = str(datetime.datetime.now()).split('.')[0]
filename = './scraped_result %s.csv' % timestamp
print('saving data...')
final_scraped_data.to_csv(filename, encoding='utf-8')
print('saved data to %s'%filename)

saving data...
saved data to ./scraped_result 2016-04-06 09:25:32.csv


In [18]:
!cat '$filename'

,detail_page_url,address,company_name,company_website,contact_number,contact_person,fax,email,mobile_phone
0,http://www.5metal.com.hk/node/10624,新界火炭禾寮坑路2-16號安盛工業大廈1樓A室,愛家潔具有限公司,www.myhome-hk.com,26689782,Ms Kathy Choi,26689810,myhomesanitaryware@yahoo.com.hk,
1,http://www.5metal.com.hk/node/13112,九龍深水埗福榮街15號地下,港城五金建材公司,http://www.5metal.com.hk/kwongshing,2763 0844,余小姐,26433381,info@kshw.com.hk,
2,http://www.5metal.com.hk/node/1,香港九龍油麻地新填地街226號地下 (油麻地港鐵站B2出口直行窩打老道口),李維記(寶源)五金工程,http://www.lwkpy.com.hk/,34223539,Ben Lee,34223790,leewaikee01@gmail.com,94228002
3,http://www.5metal.com.hk/node/3869,---------- 九龍旺角道6號K地下 ---------- 九龍旺角道6號N地下 ----------,誠信建築材料有限公司,http://www.yp.com.hk/sinceredecoration,"23976061 , 23977365 (6K舖) / 23938404 , 23903638 (6N舖)",許生,------------ 23802867 (6K舖) ------------ 27899043 (6N舖) ------------,sincerebldg@gmail.com,
4,http://www.5metal.com.hk/node/8422,九龍大角咀通州街131號地下,準誠建材公司,5metal.com.hk/Goodwill,28513869,梁德成,25423094,goodwillconstruct@gmail.com,
5,h

Afterthought
-------------------
Some of the scraping results in exception. After checking it in the browser, I found that these websites use a new design (or i should say new template since they look pretty similar to each other among this group) with a different HTML structure. By distinguishing old and new template, we can certainly further improve the scraping result