## Webscraping: the hunt for more customer information

We have customer websites that were provided when a customer registered, this is a non-mandatory field and is not always complete or has spurious information.

Step One:  Run a DNS lookup to check if listed websites are active, and confirm what e-comm platform they are on.

Step Two:  Following the above, run a webscrape against the active websites.


In [73]:
import subprocess

In [74]:
# Call the program 'host' to do a DNS lookup
# fix for future, append 'customerId' for indexing.

dns_lookups = 1 
def dns_info(website):
    website_unchanged = website
    global dns_lookups
    print dns_lookups, website, type(website)
    dns_lookups += 1
    if type(website) == float:
        return { 'dns_lookup': 'no name specified', 'website': website_unchanged,
            'wix': False, 'shopify': False, 'bigcommerce': False, 'valid_domain': False, 'expired_domain': False}        
    website = website.lower()
    if website.startswith('http://'):
        website = website[7:]
    if website.startswith('https://'):
        website = website[8:]
    website = website.split('/')[0]
    if '.' not in website:
        return { 'dns_lookup': 'missing .', 'website': website_unchanged,
            'wix': False, 'shopify': False, 'bigcommerce': False, 'valid_domain': False, 'expired_domain': False}

    try:
        dns_lookup = subprocess.check_output(['host', website])
        return { 'dns_lookup': dns_lookup,  'website': website_unchanged,
                'wix': 'wix' in dns_lookup,
                'shopify':  'shopify' in dns_lookup,
                'bigcommerce': 'bigcommerce' in dns_lookup,
                'valid_domain': True, 
                'expired_domain': False
               }
    except subprocess.CalledProcessError:
        return { 'dns_lookup': 'no such domain',  'website': website_unchanged,
            'wix': False, 'shopify': False, 'bigcommerce': False, 'valid_domain': True, 'expired_domain': True}

In [75]:
# example of the outputs from above:
dns_info('www.zopella.com')

1 www.zopella.com <type 'str'>


{'bigcommerce': False,
 'dns_lookup': 'www.zopella.com is an alias for edgetrade.myshopify.com.\nedgetrade.myshopify.com is an alias for shops.myshopify.com.\nshops.myshopify.com has address 23.227.38.64\n',
 'expired_domain': False,
 'shopify': True,
 'valid_domain': True,
 'website': 'www.zopella.com',
 'wix': False}

In [76]:
# this is only running on .head(10) for the sake of demonstration

dns_df = pd.DataFrame(list(customers.site.head(10).map(dns_info)))
dns_df

2 www.sovereignglobaladvisors.com <type 'str'>
3 http://sansche-yoga.com/ <type 'str'>
4 amazon.co.uk <type 'str'>
5 www.zopella.com <type 'str'>
6 zachhiltyphoto.com <type 'str'>
7 www.jainson.net <type 'str'>
8 http://www.illumenature.com <type 'str'>
9 paperboyshop.com <type 'str'>
10 Www.lachish-homemade-body-scrubs@myshopify.com <type 'str'>
11 glitzplugsbybee.storenvy.com <type 'str'>


Unnamed: 0,bigcommerce,dns_lookup,expired_domain,shopify,valid_domain,website,wix
0,False,www.sovereignglobaladvisors.com is an alias fo...,False,False,True,www.sovereignglobaladvisors.com,True
1,False,sansche-yoga.com has address 192.0.78.25\nsans...,False,False,True,http://sansche-yoga.com/,False
2,False,amazon.co.uk has address 176.32.108.186\namazo...,False,False,True,amazon.co.uk,False
3,False,www.zopella.com is an alias for edgetrade.mysh...,False,True,True,www.zopella.com,False
4,False,zachhiltyphoto.com has address 198.49.23.145\n,False,False,True,zachhiltyphoto.com,False
5,False,no such domain,True,False,True,www.jainson.net,False
6,True,www.illumenature.com is an alias for illumenat...,False,False,True,http://www.illumenature.com,False
7,False,paperboyshop.com has address 98.124.199.46\npa...,False,False,True,paperboyshop.com,False
8,False,no such domain,True,False,True,Www.lachish-homemade-body-scrubs@myshopify.com,False
9,False,glitzplugsbybee.storenvy.com has address 104.2...,False,False,True,glitzplugsbybee.storenvy.com,False


this can now be exported to CSV for use in a separate notebook / analysis