# Web-Scraping from Xero for profile details

Here, we go through the [search result pages](https://www.xero.com/uk/advisors/find-advisors/?type=advisors&tag[]=xero:advisor-directory/industries-served/repairs-and-personal-services&orderBy=ADVISOR_RELEVANCE&sort=ASC&pageNumber=1) on Xero.com where our filter criteria is to select advisors from UK for 'Repair and Personal Services'. From the search results, we extract certain details (described below) into a pandas dataframe and also write into a csv file.


Example Profile URL:
https://www.xero.com/uk/advisors/accountant/mha-macintyre-hudson-db7b338a4f1d/

Required fields:
- Name (e.g. MHA MacIntyre Hudson)
- Type (e.g. Accountant)
- Address (e.g. 1 The Forum, Minerva Business Park, Lynchwood, Peterborough, England)
- About us text
- Website
- Phone number
- Facebook address (if available)
- Twitter address (if available)
- Linkedin address (if available)

_Through out the worksheet, I have commented some code lines to avoid the workbook from getting too long while viewing in github (due to huge html listings). These can be uncommented and run if needed._

In [1]:
# import necessary libraries

import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup
import re

### Exploration and Preparation

**Let's have a look at the search page html and see what we find.**

In [2]:
link = 'https://www.xero.com/uk/advisors/find-advisors/?type=advisors&tag[]=xero:advisor-directory/industries-served/repairs-and-personal-services&orderBy=ADVISOR_RELEVANCE&sort=ASC&pageNumber=1'

req = requests.get(link)
soup = BeautifulSoup(req.content)

In [3]:
# soup

In [4]:
#soup.find_all('div', class_ = "advisors-result-card-view-profile")

soup.find_all('a', class_ = re.compile('btn-primary-alt'))

[<a class="btn btn-primary-alt" href="https://www.xero.com/uk/advisors/Accountant/johnston-carmichael-11033f669adb/">View Profile</a>,
 <a class="btn btn-primary-alt" href="https://www.xero.com/uk/advisors/Accountant/pkf-francis-clark-af93f9b08c7f/">View Profile</a>,
 <a class="btn btn-primary-alt" href="https://www.xero.com/uk/advisors/Accountant/armstrong-watson-156b74095297/">View Profile</a>,
 <a class="btn btn-primary-alt" href="https://www.xero.com/uk/advisors/Accountant/armstrong-watson-northallerton-3bfd8b4ba1af/">View Profile</a>,
 <a class="btn btn-primary-alt" href="https://www.xero.com/uk/advisors/Accountant/tc-group-5a2406a09705/">View Profile</a>,
 <a class="btn btn-primary-alt" href="https://www.xero.com/uk/advisors/Accountant/mha-macintyre-hudson-db7b338a4f1d/">View Profile</a>,
 <a class="btn btn-primary-alt" href="https://www.xero.com/uk/advisors/Accountant/armstrong-watson-dcde9e76086a/">View Profile</a>,
 <a class="btn btn-primary-alt" href="https://www.xero.com/uk/

**Let's extract the profile links.**

In [5]:
profile_links = []
for text in soup.find_all('a', class_ = re.compile('btn-primary-alt')):
    profile_links.append(text.get('href'))
profile_links

['https://www.xero.com/uk/advisors/Accountant/johnston-carmichael-11033f669adb/',
 'https://www.xero.com/uk/advisors/Accountant/pkf-francis-clark-af93f9b08c7f/',
 'https://www.xero.com/uk/advisors/Accountant/armstrong-watson-156b74095297/',
 'https://www.xero.com/uk/advisors/Accountant/armstrong-watson-northallerton-3bfd8b4ba1af/',
 'https://www.xero.com/uk/advisors/Accountant/tc-group-5a2406a09705/',
 'https://www.xero.com/uk/advisors/Accountant/mha-macintyre-hudson-db7b338a4f1d/',
 'https://www.xero.com/uk/advisors/Accountant/armstrong-watson-dcde9e76086a/',
 'https://www.xero.com/uk/advisors/Accountant/bdo-llp-f9cfe0847a5c/',
 'https://www.xero.com/uk/advisors/Accountant/silver-levene-llp-54755ecbb05c/',
 'https://www.xero.com/uk/advisors/Accountant/kreston-reeves-llp-0dc0404b7c45/']

> **There are 10 profiles listed in one page. If we find out the total number of profiles in the search result, we will know how many pages we need to read.**

**Let's see the total number of profiles in search result.**

In [6]:
# get total number of profiles in 'results'
for text in soup.find_all('div', class_ = re.compile('globalsearch-results')):
    results = text.get('data-global-search-total')

results = int(results)
results

953

**Now that we got the number of profiles, let's calculate how many pages we need to read in order to get all profiles.**

In [7]:
# Each page has 10 profiles. To calculate total number of pages, we will divide total number of profiles by 10

# If the total number of profiles are exactly divisible by 10, we will take the division result as it is.
# If the total number of profiles are not exactly divisible by 10, we will add 1 to the division result.


if results % 10 == 0:
    page_num = results / 10
else:
    page_num = int(results / 10) + 1

page_num

96

**Make a list that contains all the search page links**

In [8]:
# The below link does not contain the page number at the end. We will be adding it through computation.
link_ = 'https://www.xero.com/uk/advisors/find-advisors/?type=advisors&tag[]=xero:advisor-directory/industries-served/repairs-and-personal-services&orderBy=ADVISOR_RELEVANCE&sort=ASC&pageNumber='

# create empty list
search_pages = []

# add page numbers to the link and append in the list
for i in range(page_num):
    link = link_ + str(i+1)
    search_pages.append(link)

In [9]:
# search_pages

**Now that we have the search-page links, we will go through them one by one and extract the links for all the profiles that are listed.**

In [10]:
profile_links = []      # create empty list
i = 0                   # counter that we will be using in the loop


# go through search page links and extract profile links.
for page in search_pages:
    req = requests.get(page)
    soup = BeautifulSoup(req.content)
    
    for text in soup.find_all('a', class_ = re.compile('btn-primary-alt')):
        profile_links.append(text.get('href'))
    
    i += 1
    if i % 10 == 0:
        print(str(i) + ' search pages done')

10 search pages done
20 search pages done
30 search pages done
40 search pages done
50 search pages done
60 search pages done
70 search pages done
80 search pages done
90 search pages done


### Read Profiles and extract required fields

We will create a list of dictionaries (for each record). This list will then be used to build a pandas dataframe.

In [12]:
rows = []          # create empty list
i = 0              # counter


# Go through the profile links one by one and extract fields.

for link in profile_links:
    req = requests.get(link)
    soup = BeautifulSoup(req.content)
    
    # initialize fields
    name = np.nan
    p_type = np.nan
    address = np.nan
    website = np.nan
    phone = np.nan
    about_us = np.nan
    facebook, twitter, linkedin = np.nan, np.nan, np.nan

    # get name
    if soup.find_all('h1'):
        name = soup.find_all('h1')[0].contents[0]
    
    # get profile type - p_type
    if soup.find_all('p', class_ = "advisors-profile-hero-detailed-info-sub national"):
        p_type = soup.find_all('p', class_ = "advisors-profile-hero-detailed-info-sub national")[0].contents[0].split()[0]
    
    # get address
    try:
        address = soup.find_all('p', class_ = "advisors-profile-hero-detailed-info-sub national")[0].contents[2].split('\n')[1].lstrip()
    except:
        address = np.nan
    
    # get website address
    if soup.find_all(class_ = re.compile('advisors-profile-hero-detailed-contact-website')):
        for text in soup.find_all(class_ = re.compile('advisors-profile-hero-detailed-contact-website')):
            website = text.get('href')
    
    # get phone number
    if soup.find_all(class_ = re.compile('advisors-profile-hero-detailed-contact-phone')):
        for text in soup.find_all(class_ = re.compile('advisors-profile-hero-detailed-contact-phone')):
            phone = text.get('data-phone')
    
    # extract 'about us' text
    if soup.find_all('div', class_ = "advisor-profile-practice-desc"):
        about_us = soup.find_all('div', class_ = "advisor-profile-practice-desc")[0].find('p').contents[0]
    
    # get social profile links
    for text in soup.find_all('li', class_ = "advisor-profile-practice-social-item"):
        if text.find_all('a', href = re.compile('twitter')):
            for tw in text.find_all('a', href = re.compile('twitter')):
                twitter = tw.get('href')
    
        if text.find_all('a', href = re.compile('linkedin')):
            for tw in text.find_all('a', href = re.compile('linkedin')):
                linkedin = tw.get('href')
        
        if text.find_all('a', href = re.compile('facebook')):
            for tw in text.find_all('a', href = re.compile('facebook')):
                facebook = tw.get('href')

    # append profile details in the list
    rows.append({'name' : name,
                 'type' : p_type,
                 'address' : address,
                 'website' : website,
                 'phone' : phone,
                 'about_us' : about_us,
                 'Twitter' : twitter,
                 'LinkedIn' : linkedin,
                 'Facebook' : facebook
                })
    
    # check to monitor the progress
    i += 1
    if i%20 == 0:
        print(str(i) + ' profiles retreived')


20 profiles retreived
40 profiles retreived
60 profiles retreived
80 profiles retreived
100 profiles retreived
120 profiles retreived
140 profiles retreived
160 profiles retreived
180 profiles retreived
200 profiles retreived
220 profiles retreived
240 profiles retreived
260 profiles retreived
280 profiles retreived
300 profiles retreived
320 profiles retreived
340 profiles retreived
360 profiles retreived
380 profiles retreived
400 profiles retreived
420 profiles retreived
440 profiles retreived
460 profiles retreived
480 profiles retreived
500 profiles retreived
520 profiles retreived
540 profiles retreived
560 profiles retreived
580 profiles retreived
600 profiles retreived
620 profiles retreived
640 profiles retreived
660 profiles retreived
680 profiles retreived
700 profiles retreived
720 profiles retreived
740 profiles retreived
760 profiles retreived
780 profiles retreived
800 profiles retreived
820 profiles retreived
840 profiles retreived
860 profiles retreived
880 profiles re

**Create dataframe from the list**

In [13]:
profile_df = pd.DataFrame(rows)

print(profile_df.shape)
profile_df.sample(4)

(953, 9)


Unnamed: 0,name,type,address,website,phone,about_us,Twitter,LinkedIn,Facebook
736,Riverview Portfolio Ltd,Accountant,"1 Market Hill, Calne, England",http://www.riverviewportfolio.co.uk,+44 1249 816 810,Accountants and tax advisor’s serving the Wilt...,,,
163,Wise & Co,Accountant,"Union Road, Wey Court West, Farnham, England",http://www.wiseandco.co.uk/what-we-do/online-a...,+44 01252 711244,Our dedicated cloud accounting team is on hand...,https://www.twitter.com/WiseandCo,https://www.linkedin.com/company/wise-&-co-cha...,
888,Pebbles Bookkeeping,Bookkeeper,"Brandon House, Potterne Road , Devizes, England",,+44 07759802552,"My current client base is varied, from small ...",,,
732,Harrison and Jones Accountancy,Accountant,"Fauld Lane, Unit 152, Tutbury, England",http://www.harrisonjonesaccountancy.com,+44 (0)1283240025,Harrison and Jones Accountancy is a leading fi...,,,


**Check duplicates**

In [14]:
profile_df[profile_df.duplicated()]

Unnamed: 0,name,type,address,website,phone,about_us,Twitter,LinkedIn,Facebook


In [15]:
profile_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 953 entries, 0 to 952
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   name      953 non-null    object
 1   type      953 non-null    object
 2   address   943 non-null    object
 3   website   946 non-null    object
 4   phone     946 non-null    object
 5   about_us  946 non-null    object
 6   Twitter   565 non-null    object
 7   LinkedIn  467 non-null    object
 8   Facebook  498 non-null    object
dtypes: object(9)
memory usage: 67.1+ KB


**Save dataframe into a csv file**

In [17]:
profile_df.to_csv('xero_profiles.csv', index = False)

---

<br>

## <font color = '#e8b72e'><center> Awesome!! Did it :D  </font>

![](https://media.giphy.com/media/vFKqnCdLPNOKc/giphy.gif)