Requirements
------------------


From this url `http://www.5metal.com.hk/ajax/pager/company_fea?view_amount=36&page=1`,

We get a list of urls for detail pages like `http://www.5metal.com.hk/wongwahkee/contact` to scrape the companies' info

------------


Info to scrape:

1. address
2. company name
3. company website
4. contact number
5. contact person
6. email
7. mobile phone

------------

In [1]:
from IPython.core.display import display, HTML
import urllib2
import bs4
import urlparse
import pandas as pd
import numpy as np

Get all the detail page links to scrape
------------

Since the index page has multiple pages (10 pages for now), we'll need to handle pagination, and note that the `page` parameter is **zero-indexed**

In [2]:
index_page_url = 'http://www.5metal.com.hk/ajax/pager/company?keyword=&page=0&tid=1'

In [3]:
index_page_src = urllib2.urlopen(index_page_url).read()
index_page_soup = bs4.BeautifulSoup(index_page_src, 'html.parser')

The index page HTML source:

In [4]:
print(index_page_soup.prettify())

<div class="view view-company-list view-id-company_list view-display-id-page_1 view-dom-id-4d100f87fa2f9c3a789d11e8e8bd67be">
 <div class="view-content">
  <div>
   <div class="views-field views-field-nothing">
    <span class="field-content">
     <div class="company">
      <div class="logo">
       <a href="/node/13112" target="_blank">
        <img alt="" height="113" src="https://www.5metal.com.hk/sites/default/files/styles/company_list_logo/public/KwongShingHW%20Capture%20Logo.PNG?itok=FeWMN_D-" width="180"/>
       </a>
      </div>
      <div class="title">
       <a href="/node/[nid_1]" target="_blank">
        港城五金建材公司
       </a>
      </div>
     </div>
    </span>
   </div>
  </div>
  <div>
   <div class="views-field views-field-nothing">
    <span class="field-content">
     <div class="company">
      <div class="logo">
       <a href="/node/1" target="_blank">
        <img alt="" height="59" src="https://www.5metal.com.hk/sites/default/files/styles/company_list_logo/pub

The index page looks like:

In [5]:
display(HTML(index_page_src))

In [6]:
number_of_index_pages = int(index_page_soup.select_one('.pager-current').getText().split('of')[1])
print(number_of_index_pages)

10


In [7]:
detail_pages_links = []

for i in xrange(number_of_index_pages):
    current_index_page_url = ('http://www.5metal.com.hk/ajax/pager/company?keyword=&page=%d&tid=1' % i)
    print('scraping index page: %s' % current_index_page_url)
    try:
        current_index_page_src = urllib2.urlopen(current_index_page_url).read()
        current_index_page_soup = bs4.BeautifulSoup(index_page_src, 'html.parser')

        detail_pages_links_in_current_page = [anchor.get('href') for anchor in current_index_page_soup.select('.views-field .company .logo a[href]')]
        detail_pages_links_in_current_page = [urlparse.urljoin(current_index_page_url, link) for link in detail_pages_links_in_current_page if link is not None]
        detail_pages_links += detail_pages_links_in_current_page
    except Exception:
        print('there was an exception when scraping index page: %s' % index_page_url)
    
scraped_data = pd.DataFrame(data={'detail_page_url': detail_pages_links})
scraped_data['address'] = None
scraped_data['company_name'] = None
scraped_data['company_website'] = None
scraped_data['contact_number'] = None
scraped_data['contact_person'] = None
scraped_data['fax'] = None
scraped_data['email'] = None
scraped_data['mobile_phone'] = None

print('No. of links got: ' + str(scraped_data.shape[0]))
print(scraped_data.head(10))

scraping index page: http://www.5metal.com.hk/ajax/pager/company?keyword=&page=0&tid=1
scraping index page: http://www.5metal.com.hk/ajax/pager/company?keyword=&page=1&tid=1
scraping index page: http://www.5metal.com.hk/ajax/pager/company?keyword=&page=2&tid=1
scraping index page: http://www.5metal.com.hk/ajax/pager/company?keyword=&page=3&tid=1
scraping index page: http://www.5metal.com.hk/ajax/pager/company?keyword=&page=4&tid=1
scraping index page: http://www.5metal.com.hk/ajax/pager/company?keyword=&page=5&tid=1
scraping index page: http://www.5metal.com.hk/ajax/pager/company?keyword=&page=6&tid=1
scraping index page: http://www.5metal.com.hk/ajax/pager/company?keyword=&page=7&tid=1
scraping index page: http://www.5metal.com.hk/ajax/pager/company?keyword=&page=8&tid=1
scraping index page: http://www.5metal.com.hk/ajax/pager/company?keyword=&page=9&tid=1
No. of links got: 360
                       detail_page_url address company_name company_website  \
0  http://www.5metal.com.hk/n

Develop scraping function for detail page
-----------

Now we work on getting company info from detail page url, it turns out that each detail page url points to a company website of **nearly** standard format (**some** after url redirection), and all the info we need can be found in `Contact` page of the detail page

Note that currently there're **2 designs**, so we need two ways for scraping

This is how we handle redirection:

In [8]:
sample_detail_page_url = detail_pages_links[0]
print('sample_detail_page_url:')
print(sample_detail_page_url)

redirected_sample_detail_page_url = urllib2.urlopen(sample_detail_page_url).geturl()
print('redirected_sample_detail_page_url:')
print(redirected_sample_detail_page_url)

sample_contact_page_url = redirected_sample_detail_page_url + '/contact'
print('sample_contact_page_url:')
print(sample_contact_page_url)

sample_detail_page_url:
http://www.5metal.com.hk/node/13112
redirected_sample_detail_page_url:
http://www.5metal.com.hk/kwongshing
sample_contact_page_url:
http://www.5metal.com.hk/kwongshing/contact


so we create his helper function:

In [9]:
def get_contact_page_soup_from_detail_page_url(detail_page_url):

    redirected_detail_page_url = urllib2.urlopen(detail_page_url).geturl()
    contact_page_url = redirected_detail_page_url + '/contact'
    contact_page_src = urllib2.urlopen(contact_page_url).read()
    return bs4.BeautifulSoup(contact_page_src, 'html.parser')

test~

In [10]:
get_contact_page_soup_from_detail_page_url(detail_pages_links[0])

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"\r\n  "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">\n\n<html dir="ltr" version="XHTML+RDFa 1.0" xml:lang="zh-hant" xmlns="http://www.w3.org/1999/xhtml">\n<head profile="http://www.w3.org/1999/xhtml/vocab">\n<meta content="noindex" name="robots">\n<meta content="text/html; charset=unicode-escape" http-equiv="Content-Type"/>\n<link href="https://www.5metal.com.hk/sites/default/files/img_watermark_0_0.png" rel="shortcut icon" type="image/png"/>\n<meta content="Drupal 7 (http://drupal.org)" name="Generator"/>\n<title>\u6e2f\u57ce\u4e94\u91d1\u5efa\u6750\u516c\u53f8 - \u806f\u7d61\u6211\u5011 | \u9999\u6e2f\u4e94\u91d1\u7db2</title>\n<meta content="" http-equiv="Content-Security-Policy">\n<style media="all" type="text/css">@import url("https://www.5metal.com.hk/modules/system/system.base.css?o58yvz");\n@import url("https://www.5metal.com.hk/modules/system/system.menus.css?o58yvz");\n@import url("https://www.5metal.com.hk/modules/syste

handle design 1 (eg. http://www.5metal.com.hk/node/562):
----------

In [11]:
detail_page_url_design1 = 'http://www.5metal.com.hk/node/562'
sample_contact_page_soup_design1 = get_contact_page_soup_from_detail_page_url(detail_page_url_design1)
print(sample_contact_page_soup_design1.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
  "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html dir="ltr" version="XHTML+RDFa 1.0" xml:lang="zh-hant" xmlns="http://www.w3.org/1999/xhtml">
 <head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta content="noindex" name="robots">
   <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
   <link href="https://www.5metal.com.hk/sites/default/files/img_watermark_0_0.png" rel="shortcut icon" type="image/png"/>
   <link href="/node/562" rel="shortlink"/>
   <link href="/content/%E5%BB%A3%E6%A5%AD%E6%A9%9F%E5%99%A8%E5%BB%A0" rel="canonical"/>
   <meta content="Drupal 7 (http://drupal.org)" name="Generator"/>
   <title>
    廣業機器廠 | 香港五金網
   </title>
   <meta content="" http-equiv="Content-Security-Policy">
    <style media="all" type="text/css">
     @import url("https://www.5metal.com.hk/modules/system/system.base.css?o58yvz");
@import url("https://www.5metal.com.hk/modules/system/system.menus.css?o58yvz");
@

The section **we're interested in**:

In [12]:
contact_info_soup = sample_contact_page_soup_design1.select_one('.company_right')
print contact_info_soup.prettify()

<div class="company_right">
 <div class="node_title">
  廣業機器廠
 </div>
 <div class="company_detail">
  <!-- google_ad_section_start -->
  <div class="field field-name-field-company-type field-type-taxonomy-term-reference field-label-inline clearfix">
   <div class="field-label">
    行業類別:
   </div>
   <div class="field-items">
    <div class="field-item even">
     五金工具
    </div>
    <div class="field-item odd">
     手工具
    </div>
    <div class="field-item even">
     泵類工具
    </div>
    <div class="field-item odd">
     建築與安全設備
    </div>
    <div class="field-item even">
     個人護理用具
    </div>
   </div>
  </div>
  <!-- google_ad_section_end -->
  <div class="field field-name-field-contact-method field-type-list-text field-label-inline clearfix">
   <div class="field-label">
    常用聯絡方法:
   </div>
   <div class="field-items">
    <div class="field-item even">
     電話(公司)
    </div>
   </div>
  </div>
  <div class="field field-name-field-email field-type-text field-label-inline clearf

In [13]:
fields_and_selectors = [
   ( 'company_name', '.node_title'),
    ('address', '.field-name-field-company-address .field-item'), 
    ('contact_person', '.field-name-field-contact-person .field-item'), 
    ('contact_number', '.field-name-field-company-tel .field-item'), 
    ('mobile_phone', '.field-name-field-mobile .field-item'), 
    ('fax', '.field-name-field-company-fax .field-item'), 
    ('email', '.field-name-field-email .field-item'), 
    ('company_website', '.field-name-field-company-url .field-item')
]
values = [(value and value.getText().strip()) for value in [contact_info_soup.select_one(x[1]) for x in fields_and_selectors]]
company_info = pd.DataFrame(data={'values': values}, index=[field for (field, css_selector) in fields_and_selectors])
print company_info

                                     values
company_name                          廣業機器廠
address                        九龍旺角甘霖街12號地下
contact_person                         None
contact_number          23230733 / 27711305
mobile_phone                           None
fax                     21914243 / 23854177
email            kwongyipmetal@yahoo.com.hk
company_website     http://www.kwongyip.com


Combining all the process above, we can create a function that given a detail page url, return the company's info:

In [14]:
def get_company_info_from_soup_design1(contact_page_soup):
    contact_info_soup = contact_page_soup.select_one('.company_right')
    
    fields_and_selectors = [
       ( 'company_name', '.node_title'),
        ('address', '.field-name-field-company-address .field-item'), 
        ('contact_person', '.field-name-field-contact-person .field-item'), 
        ('contact_number', '.field-name-field-company-tel .field-item'), 
        ('mobile_phone', '.field-name-field-mobile .field-item'), 
        ('fax', '.field-name-field-company-fax .field-item'), 
        ('email', '.field-name-field-email .field-item'), 
        ('company_website', '.field-name-field-company-url .field-item')
    ]
    fields = [field for (field, css_selector) in fields_and_selectors]
    
    values = [(value and value.getText().strip()) for value in [contact_info_soup.select_one(x[1]) for x in fields_and_selectors]]
    return {field:value for (field, value) in zip(fields, values)}

test run :) :

In [15]:
test_run_result = get_company_info_from_soup_design1(get_contact_page_soup_from_detail_page_url(detail_page_url_design1))
print(test_run_result)

{'contact_number': u'23230733 / 27711305', 'mobile_phone': None, 'fax': u'21914243 / 23854177', 'company_website': u'http://www.kwongyip.com', 'company_name': u'\u5ee3\u696d\u6a5f\u5668\u5ee0', 'address': u'\u4e5d\u9f8d\u65fa\u89d2\u7518\u9716\u885712\u865f\u5730\u4e0b', 'contact_person': None, 'email': u'kwongyipmetal@yahoo.com.hk'}


handle design 2 (eg. http://www.5metal.com.hk/node/13112):
----------

In [16]:
detail_page_url_design2 = 'http://www.5metal.com.hk/node/13112'
sample_contact_page_soup_design2 = get_contact_page_soup_from_detail_page_url(detail_page_url_design2)
print(sample_contact_page_soup_design2.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
  "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html dir="ltr" version="XHTML+RDFa 1.0" xml:lang="zh-hant" xmlns="http://www.w3.org/1999/xhtml">
 <head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta content="noindex" name="robots">
   <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
   <link href="https://www.5metal.com.hk/sites/default/files/img_watermark_0_0.png" rel="shortcut icon" type="image/png"/>
   <meta content="Drupal 7 (http://drupal.org)" name="Generator"/>
   <title>
    港城五金建材公司 - 聯絡我們 | 香港五金網
   </title>
   <meta content="" http-equiv="Content-Security-Policy">
    <style media="all" type="text/css">
     @import url("https://www.5metal.com.hk/modules/system/system.base.css?o58yvz");
@import url("https://www.5metal.com.hk/modules/system/system.menus.css?o58yvz");
@import url("https://www.5metal.com.hk/modules/system/system.messages.css?o58yvz");
@import url("https://www.5metal.com.hk/m

The section **we're interested in**:

In [17]:
contact_info_soup = sample_contact_page_soup_design2.select_one('.cp-contact .cp-box-content')
print contact_info_soup.prettify()

<div class="cp-box-content">
 <div class="row company_name">
  <span class="val">
   港城五金建材公司
  </span>
 </div>
 <div class="row">
  <label>
   地址：
  </label>
  <span class="val">
   九龍深水埗福榮街15號地下
  </span>
 </div>
 <div class="row">
  <label>
   聯絡人：
  </label>
  <span class="val">
   余小姐
  </span>
 </div>
 <div class="row">
  <label>
   電話(公司)：
  </label>
  <span class="val">
   2763 0844
  </span>
 </div>
 <div class="row">
  <label>
   電話(手提)：
  </label>
  <span class="val">
  </span>
 </div>
 <div class="row">
  <label>
   傳真：
  </label>
  <span class="val">
   26433381
  </span>
 </div>
 <div class="row">
  <label>
   電郵：
  </label>
  <span class="val">
   <a href="mailto:info@kshw.com.hk">
    info@kshw.com.hk
   </a>
  </span>
 </div>
 <div class="row">
  <label>
   網址：
  </label>
  <span class="val">
   <a href="http://www.5metal.com.hk/kwongshing" target="_blank">
    http://www.5metal.com.hk/kwongshing
   </a>
  </span>
 </div>
</div>



In [18]:
contact_info_soup = sample_contact_page_soup_design2.select_one('.cp-contact .cp-box-content')
fields = ['company_name', 'address', 'contact_person', 'contact_number', 'mobile_phone', 'fax', 'email', 'company_website']
values = [contact_info_soup.select_one('.row.company_name' + ' + .row' * i + ' .val').getText().strip() for i in xrange(len(fields))]
{field:value for (field, value) in zip(fields, values)}

{'address': u'\u4e5d\u9f8d\u6df1\u6c34\u57d7\u798f\u69ae\u885715\u865f\u5730\u4e0b',
 'company_name': u'\u6e2f\u57ce\u4e94\u91d1\u5efa\u6750\u516c\u53f8',
 'company_website': u'http://www.5metal.com.hk/kwongshing',
 'contact_number': u'2763 0844',
 'contact_person': u'\u4f59\u5c0f\u59d0',
 'email': u'info@kshw.com.hk',
 'fax': u'26433381',
 'mobile_phone': u''}

Scraping function for design 2

In [19]:
def get_company_info_from_soup_design2(contact_page_soup):
    contact_info_soup = contact_page_soup.select_one('.cp-contact .cp-box-content')
    
    fields = ['company_name', 'address', 'contact_person', 'contact_number', 'mobile_phone', 'fax', 'email', 'company_website']
    values = [contact_info_soup.select_one('.row.company_name' + ' + .row' * i + ' .val').getText().strip() for i in xrange(len(fields))]
    return {field:value for (field, value) in zip(fields, values)}

test run:

In [20]:
test_run_result = get_company_info_from_soup_design2(get_contact_page_soup_from_detail_page_url(detail_page_url_design2))
print(test_run_result)

{'contact_number': u'2763 0844', 'mobile_phone': u'', 'fax': u'26433381', 'company_website': u'http://www.5metal.com.hk/kwongshing', 'company_name': u'\u6e2f\u57ce\u4e94\u91d1\u5efa\u6750\u516c\u53f8', 'address': u'\u4e5d\u9f8d\u6df1\u6c34\u57d7\u798f\u69ae\u885715\u865f\u5730\u4e0b', 'contact_person': u'\u4f59\u5c0f\u59d0', 'email': u'info@kshw.com.hk'}


The main scraping
------------

Now, let's go and sc**rape** them all:

In [21]:
final_scraped_data = scraped_data.copy()

In [22]:
n = final_scraped_data.shape[0]
start_index = 0
end_index = n

print('%d detail pages to scrape from' % (end_index - start_index))
for i in xrange(start_index, end_index):
    url_to_scrape_from = final_scraped_data.iloc[i]['detail_page_url']
    print('[%4d/%4d] scraping url: %s' % (i, n,  url_to_scrape_from))
    try:
        contact_page_soup = get_contact_page_soup_from_detail_page_url(url_to_scrape_from)
        is_design1 = (contact_page_soup.select_one('.company_right') is not None)
        
        if is_design1:
            company_info = get_company_info_from_soup_design1(contact_page_soup)
        else:
            company_info = get_company_info_from_soup_design2(contact_page_soup)
    
        for col in final_scraped_data.columns:
            if col is not 'detail_page_url':
                final_scraped_data[col].iloc[i] = company_info[col]
    except Exception:
        print('there was an exception when scraping %dth url: %s' % (i, url_to_scrape_from))

360 detail pages to scrape from
[   0/ 360] scraping url: http://www.5metal.com.hk/node/13112
[   1/ 360] scraping url: http://www.5metal.com.hk/node/1
[   2/ 360] scraping url: http://www.5metal.com.hk/node/3652
[   3/ 360] scraping url: http://www.5metal.com.hk/node/149
[   4/ 360] scraping url: http://www.5metal.com.hk/node/15998
[   5/ 360] scraping url: http://www.5metal.com.hk/node/3563
[   6/ 360] scraping url: http://www.5metal.com.hk/node/13379
[   7/ 360] scraping url: http://www.5metal.com.hk/node/8507
[   8/ 360] scraping url: http://www.5metal.com.hk/node/3568
[   9/ 360] scraping url: http://www.5metal.com.hk/node/3705
[  10/ 360] scraping url: http://www.5metal.com.hk/node/18680
[  11/ 360] scraping url: http://www.5metal.com.hk/node/5334
[  12/ 360] scraping url: http://www.5metal.com.hk/node/14179
[  13/ 360] scraping url: http://www.5metal.com.hk/node/3732
[  14/ 360] scraping url: http://www.5metal.com.hk/node/8620
[  15/ 360] scraping url: http://www.5metal.com.hk/n

In [23]:
final_scraped_data

Unnamed: 0,detail_page_url,address,company_name,company_website,contact_number,contact_person,fax,email,mobile_phone
0,http://www.5metal.com.hk/node/13112,九龍深水埗福榮街15號地下,港城五金建材公司,http://www.5metal.com.hk/kwongshing,2763 0844,余小姐,26433381,info@kshw.com.hk,
1,http://www.5metal.com.hk/node/1,香港九龍油麻地新填地街226號地下 (油麻地港鐵站B2出口直行窩打老道口),李維記(寶源)五金工程,http://www.lwkpy.com.hk/,34223539,Ben Lee,34223790,leewaikee01@gmail.com,94228002
2,http://www.5metal.com.hk/node/3652,九龍旺角新填地街461B號地下,同利號,http://kwlaser.net/,23949778,張生,23949773,winwinco@hotmail.com,
3,http://www.5metal.com.hk/node/149,新界元朗大旗嶺737號 (大棠路加德士油站對面),萬昌五金建材有限公司,http://www.man-cheong.com.hk/,24783868,李先生,24747057,jacksonli@mcmhk.com,
4,http://www.5metal.com.hk/node/15998,新界火炭桂地街10號華麗工業中心10樓9室,津滙貿易公司,,,Sarah Chan,,jinhui11115@gmail.com,54482368
5,http://www.5metal.com.hk/node/3563,九龍旺角甘霖街 29 號地下,興發五金(香港)有限公司,http://www.5metal.com.hk/hingfatmetal,"23843218 , 23843216","吳永裕,何卓怡,麥浩倫",23885774,hingfatmetal@hotmail.com,
6,http://www.5metal.com.hk/node/13379,九龍觀塘偉業街169號中懋工業大廈B座3樓2室,建安五金,http://5metal.com.hk/kinonmetalware,23448857,林小姐,27934747,kinonmetalware@gmail.com,97850311
7,http://www.5metal.com.hk/node/8507,九龍青山道552號地下A舖,永盛五金電器貨倉批發,,27851583,Kevin Hui,27448978,huikinfukhk@yahoo.com.hk,
8,http://www.5metal.com.hk/node/3568,香港九龍上海街359號地下,光輝鋼竹蒸籠厨具,https://facebook.com/kwongfai275/,"27809980 , 27826050",梁月珍,27826145,kwongfai275@gmail.com,
9,http://www.5metal.com.hk/node/3705,九龍旺角大南街 14-16 號地下,時茂企業有限公司,http://www.smooth.com.cn,23955297,Ronnie Cheng,27892083,enquiry@smooth.hk,69585738


Profit
---------

save our important information :)

In [24]:
import datetime

timestamp = str(datetime.datetime.now()).split('.')[0]
filename = './scraped_result %s.csv' % timestamp
print('saving data...')
final_scraped_data.to_csv(filename, encoding='utf-8')
print('saved data to %s'%filename)

saving data...
saved data to ./scraped_result 2016-04-08 09:18:02.csv


In [25]:
!cat '$filename'

,detail_page_url,address,company_name,company_website,contact_number,contact_person,fax,email,mobile_phone
0,http://www.5metal.com.hk/node/13112,九龍深水埗福榮街15號地下,港城五金建材公司,http://www.5metal.com.hk/kwongshing,2763 0844,余小姐,26433381,info@kshw.com.hk,
1,http://www.5metal.com.hk/node/1,香港九龍油麻地新填地街226號地下 (油麻地港鐵站B2出口直行窩打老道口),李維記(寶源)五金工程,http://www.lwkpy.com.hk/,34223539,Ben Lee,34223790,leewaikee01@gmail.com,94228002
2,http://www.5metal.com.hk/node/3652,九龍旺角新填地街461B號地下,同利號,http://kwlaser.net/,23949778,張生,23949773,winwinco@hotmail.com,
3,http://www.5metal.com.hk/node/149,新界元朗大旗嶺737號 (大棠路加德士油站對面),萬昌五金建材有限公司,http://www.man-cheong.com.hk/,24783868,李先生,24747057,jacksonli@mcmhk.com,
4,http://www.5metal.com.hk/node/15998,新界火炭桂地街10號華麗工業中心10樓9室,津滙貿易公司,,,Sarah Chan,,jinhui11115@gmail.com,54482368
5,http://www.5metal.com.hk/node/3563,九龍旺角甘霖街 29 號地下,興發五金(香港)有限公司,http://www.5metal.com.hk/hingfatmetal,"23843218 , 23843216","吳永裕,何卓怡,麥浩倫",23885774,hingfatmetal@hotmail.com,
6,http://www.5metal.com.hk/node

Afterthought
-------------------
Some of the scraping results in exception. After checking it in the browser, I found that these websites use a new design (or i should say new template since they look pretty similar to each other among this group) with a different HTML structure. By distinguishing old and new template, we can certainly further improve the scraping result

**Update**:

I've added code for handling the 2 different designs, seems working good (360 successful scrapes out of 360 company detail pages)