Requirements
------------------


From this url `http://www.5metal.com.hk/ajax/pager/company_fea?view_amount=36&page=1`,

We get a list of urls for detail pages like `http://www.5metal.com.hk/wongwahkee/contact` to scrape the companies' info

------------


Info to scrape:

1. address
2. company name
3. company website
4. contact number
5. contact person
6. email
7. mobile phone

------------

In [None]:
from IPython.core.display import display, HTML
import urllib2
import bs4
import urlparse
import pandas as pd
import numpy as np

Get all the detail page links to scrape
------------

Since the index page has multiple pages (10 pages for now), we'll need to handle pagination, and note that the `page` parameter is **zero-indexed**

In [None]:
index_page_url = 'http://www.5metal.com.hk/ajax/pager/company?keyword=&page=0&tid=1'

In [None]:
index_page_src = urllib2.urlopen(index_page_url).read()
index_page_soup = bs4.BeautifulSoup(index_page_src, 'html.parser')

The index page HTML source:

In [None]:
print(index_page_soup.prettify())

The index page looks like:

In [None]:
display(HTML(index_page_src))

In [None]:
number_of_index_pages = int(index_page_soup.select_one('.pager-current').getText().split('of')[1])
print(number_of_index_pages)

In [None]:
detail_pages_links = []

for i in xrange(number_of_index_pages):
    current_index_page_url = ('http://www.5metal.com.hk/ajax/pager/company?keyword=&page=%d&tid=1' % i)
    print('scraping index page: %s' % current_index_page_url)
    try:
        current_index_page_src = urllib2.urlopen(current_index_page_url).read()
        current_index_page_soup = bs4.BeautifulSoup(index_page_src, 'html.parser')

        detail_pages_links_in_current_page = [anchor.get('href') for anchor in current_index_page_soup.select('.views-field .company .logo a[href]')]
        detail_pages_links_in_current_page = [urlparse.urljoin(current_index_page_url, link) for link in detail_pages_links_in_current_page if link is not None]
        detail_pages_links += detail_pages_links_in_current_page
    except Exception:
        print('there was an exception when scraping index page: %s' % index_page_url)
    
scraped_data = pd.DataFrame(data={'detail_page_url': detail_pages_links})
scraped_data['address'] = None
scraped_data['company_name'] = None
scraped_data['company_website'] = None
scraped_data['contact_number'] = None
scraped_data['contact_person'] = None
scraped_data['fax'] = None
scraped_data['email'] = None
scraped_data['mobile_phone'] = None

print('No. of links got: ' + str(scraped_data.shape[0]))
print(scraped_data.head(10))

Develop scraping function for detail page
-----------

Now we work on getting company info from detail page url, it turns out that each detail page url points to a company website of **nearly** standard format (**some** after url redirection), and all the info we need can be found in `Contact` page of the detail page

Note that currently there're **2 designs**, so we need two ways for scraping

This is how we handle redirection:

In [None]:
sample_detail_page_url = detail_pages_links[0]
print('sample_detail_page_url:')
print(sample_detail_page_url)

redirected_sample_detail_page_url = urllib2.urlopen(sample_detail_page_url).geturl()
print('redirected_sample_detail_page_url:')
print(redirected_sample_detail_page_url)

sample_contact_page_url = redirected_sample_detail_page_url + '/contact'
print('sample_contact_page_url:')
print(sample_contact_page_url)

so we create his helper function:

In [None]:
def get_contact_page_soup_from_detail_page_url(detail_page_url):

    redirected_detail_page_url = urllib2.urlopen(detail_page_url).geturl()
    contact_page_url = redirected_detail_page_url + '/contact'
    contact_page_src = urllib2.urlopen(contact_page_url).read()
    return bs4.BeautifulSoup(contact_page_src, 'html.parser')

test~

In [None]:
get_contact_page_soup_from_detail_page_url(detail_pages_links[0])

handle design 1 (eg. http://www.5metal.com.hk/node/562):
----------

In [None]:
detail_page_url_design1 = 'http://www.5metal.com.hk/node/562'
sample_contact_page_soup_design1 = get_contact_page_soup_from_detail_page_url(detail_page_url_design1)
print(sample_contact_page_soup_design1.prettify())

The section **we're interested in**:

In [None]:
contact_info_soup = sample_contact_page_soup_design1.select_one('.company_right')
print contact_info_soup.prettify()

In [None]:
fields_and_selectors = [
   ( 'company_name', '.node_title'),
    ('address', '.field-name-field-company-address .field-item'), 
    ('contact_person', '.field-name-field-contact-person .field-item'), 
    ('contact_number', '.field-name-field-company-tel .field-item'), 
    ('mobile_phone', '.field-name-field-mobile .field-item'), 
    ('fax', '.field-name-field-company-fax .field-item'), 
    ('email', '.field-name-field-email .field-item'), 
    ('company_website', '.field-name-field-company-url .field-item')
]
values = [(value and value.getText().strip()) for value in [contact_info_soup.select_one(x[1]) for x in fields_and_selectors]]
company_info = pd.DataFrame(data={'values': values}, index=[field for (field, css_selector) in fields_and_selectors])
print company_info

Combining all the process above, we can create a function that given a detail page url, return the company's info:

In [None]:
def get_company_info_from_soup_design1(contact_page_soup):
    contact_info_soup = contact_page_soup.select_one('.company_right')
    
    fields_and_selectors = [
       ( 'company_name', '.node_title'),
        ('address', '.field-name-field-company-address .field-item'), 
        ('contact_person', '.field-name-field-contact-person .field-item'), 
        ('contact_number', '.field-name-field-company-tel .field-item'), 
        ('mobile_phone', '.field-name-field-mobile .field-item'), 
        ('fax', '.field-name-field-company-fax .field-item'), 
        ('email', '.field-name-field-email .field-item'), 
        ('company_website', '.field-name-field-company-url .field-item')
    ]
    fields = [field for (field, css_selector) in fields_and_selectors]
    
    values = [(value and value.getText().strip()) for value in [contact_info_soup.select_one(x[1]) for x in fields_and_selectors]]
    return {field:value for (field, value) in zip(fields, values)}

test run :) :

In [None]:
test_run_result = get_company_info_from_soup_design1(get_contact_page_soup_from_detail_page_url(detail_page_url_design1))
print(test_run_result)

handle design 2 (eg. http://www.5metal.com.hk/node/13112):
----------

In [None]:
detail_page_url_design2 = 'http://www.5metal.com.hk/node/13112'
sample_contact_page_soup_design2 = get_contact_page_soup_from_detail_page_url(detail_page_url_design2)
print(sample_contact_page_soup_design2.prettify())

The section **we're interested in**:

In [None]:
contact_info_soup = sample_contact_page_soup_design2.select_one('.cp-contact .cp-box-content')
print contact_info_soup.prettify()

In [None]:
contact_info_soup = sample_contact_page_soup_design2.select_one('.cp-contact .cp-box-content')
fields = ['company_name', 'address', 'contact_person', 'contact_number', 'mobile_phone', 'fax', 'email', 'company_website']
values = [contact_info_soup.select_one('.row.company_name' + ' + .row' * i + ' .val').getText().strip() for i in xrange(len(fields))]
{field:value for (field, value) in zip(fields, values)}

Scraping function for design 2

In [None]:
def get_company_info_from_soup_design2(contact_page_soup):
    contact_info_soup = contact_page_soup.select_one('.cp-contact .cp-box-content')
    
    fields = ['company_name', 'address', 'contact_person', 'contact_number', 'mobile_phone', 'fax', 'email', 'company_website']
    values = [contact_info_soup.select_one('.row.company_name' + ' + .row' * i + ' .val').getText().strip() for i in xrange(len(fields))]
    return {field:value for (field, value) in zip(fields, values)}

test run:

In [None]:
test_run_result = get_company_info_from_soup_design2(get_contact_page_soup_from_detail_page_url(detail_page_url_design2))
print(test_run_result)

The main scraping
------------

Now, let's go and sc**rape** them all:

In [None]:
final_scraped_data = scraped_data.copy()

In [None]:
n = final_scraped_data.shape[0]
start_index = 0
end_index = n

print('%d detail pages to scrape from' % (end_index - start_index))
for i in xrange(start_index, end_index):
    url_to_scrape_from = final_scraped_data.iloc[i]['detail_page_url']
    print('[%4d/%4d] scraping url: %s' % (i, n,  url_to_scrape_from))
    try:
        contact_page_soup = get_contact_page_soup_from_detail_page_url(url_to_scrape_from)
        is_design1 = (contact_page_soup.select_one('.company_right') is not None)
        
        if is_design1:
            company_info = get_company_info_from_soup_design1(contact_page_soup)
        else:
            company_info = get_company_info_from_soup_design2(contact_page_soup)
    
        for col in final_scraped_data.columns:
            if col is not 'detail_page_url':
                final_scraped_data[col].iloc[i] = company_info[col]
    except Exception:
        print('there was an exception when scraping %dth url: %s' % (i, url_to_scrape_from))

In [None]:
final_scraped_data

Profit
---------

save our important information :)

In [None]:
import datetime

timestamp = str(datetime.datetime.now()).split('.')[0]
filename = './scraped_result %s.csv' % timestamp
print('saving data...')
final_scraped_data.to_csv(filename, encoding='utf-8')
print('saved data to %s'%filename)

In [None]:
!cat '$filename'

Afterthought
-------------------
Some of the scraping results in exception. After checking it in the browser, I found that these websites use a new design (or i should say new template since they look pretty similar to each other among this group) with a different HTML structure. By distinguishing old and new template, we can certainly further improve the scraping result