# Purplebricks - scrape using BeautifulSoup

Import the key modules:

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

### First, get the html code
On [Purple Bricks](https://www.purplebricks.co.uk/) I search for properties within 30 miles of London:

In [27]:
http='https://www.purplebricks.co.uk/search/property-for-sale/greater-london/london?page=1&sortBy=2&location=london&searchRadius=30&searchType=ForSale&soldOrLet=false&latitude=51.5073509&longitude=-0.1277583&betasearch=true'
source = requests.get(http).text
soup=BeautifulSoup(source,'lxml')

Through manual inspection, find that the below is the class within which the key details of each listing can be found:

In [28]:
listing=soup.find('div', class_="property-cardstyled__StyledPropertyCard-sc-15g6092-0 fwvNIe")
print(listing.prettify())

<div class="property-cardstyled__StyledPropertyCard-sc-15g6092-0 fwvNIe">
 <a aria-label="5 bedroom detached house - £4,250,000" class="property-cardstyled__StyledLink-sc-15g6092-1 eQIvCR" href="/property-for-sale/5-bedroom-detached-house-chigwell-1086709">
 </a>
 <div>
  <div class="property-cardstyled__StyledImageAndStatusContainer-sc-15g6092-2 creKyf">
   <div class="property-cardstyled__StyledImageContainer-sc-15g6092-3 hqlVhF">
    <div data-testid="lazy-loading-image-container">
     <div class="skeletonstyled__StyledSkeleton-sc-15jgb87-0 XujuQ ghost-property-cardstyled__StyledSkeletonCard-jm127k-3 jybbKl">
      <div class="ghost-property-cardstyled__StyledSkeletonTopContainer-jm127k-4 jhbtuW">
       <div class="ghost-property-cardstyled__StyledSkeletonImage-jm127k-5 inaEIe">
       </div>
      </div>
     </div>
    </div>
   </div>
   <span class="property-cardstyled__StyledStatus-sc-15g6092-5 cXQHXV">
   </span>
  </div>
 </div>
 <div class="property-cardstyled__StyledPrima

In [29]:
print(listing.a['aria-label'])

5 bedroom detached house - £4,250,000


Need to change these details into some data:

In [30]:
listing_details=re.split(' bedroom | - |£',listing.a['aria-label'])
print(listing_details)
bedrooms=int(listing_details[0])
details=(listing_details[1])
price=int(listing_details[3].replace(',',''))
print(bedrooms, details, price)

['5', 'detached house', '', '4,250,000']
5 detached house 4250000


In [6]:
for listing in soup.find_all('div', class_="property-cardstyled__StyledPropertyCard-sc-15g6092-0 fwvNIe"):
    print(listing.a['aria-label'])

5 bedroom detached house - £4,250,000
5 bedroom semi-detached house - £4,000,000
3 bedroom flat - £3,250,000
4 bedroom end of terrace house - £2,850,000
4 bedroom flat - £1,950,000
6 bedroom terraced house - £1,850,000
5 bedroom detached house - £1,850,000
6 bedroom detached house - £1,840,000
3 bedroom town house - £1,750,000
6 bedroom detached house - £1,750,000


### Troubleshooting
Initial attempts at this scrape found that the above worked fine for 90% of listings - but sometimes encountered an error for a listing where there were no bedrooms - i.e. in the format of the below: 

In [7]:
details2="apartment - £470,000"

And so running the below encountered an error:

In [8]:
price=int(re.split(' bedroom | - |£',details2)[0])

ValueError: invalid literal for int() with base 10: 'apartment'

Therefore build in a try & except:

In [32]:
try:
    listing_details=re.split(' bedroom | - |£',details2)
    bedrooms=int(listing_details[0])
    details=listing_details[1]
    price=int(listing_details[3].replace(',',''))
except:
    listing_details=re.split(' - |£',details2)
    bedrooms=1
    details=listing_details[0]
    price=int(listing_details[2].replace(',',''))
print(bedrooms, details, price)

1 apartment 470000


Check it still works for the orignial details type:

In [31]:
try:
    bedrooms=int(re.split(' bedroom | - |£',listing.a['aria-label'])[0])
    details=(re.split(' bedroom | - |£',listing.a['aria-label'])[1])
    price=int(re.split(' bedroom | - |£',listing.a['aria-label'])[3].replace(',',''))
except Exception as e:
    listing_details=re.split(' - |£',listing.a['aria-label'])
    bedrooms=0
    details=listing_details[0]
    price=int(listing_details[2].replace(',',''))
print(bedrooms, details, price)

5 detached house 4250000


## Need to find the total number of properties for a given search to determine number of pages to scrape across

In [55]:
properties=soup.find('span',class_='pagination-locationstyled__StyledContainer-sc-1wpe7y8-0 gTivhm pagination-barstyled__StyledPaginationLocation-sc-1t8z84i-2 cHdiez')
print(properties.prettify())

<span class="pagination-locationstyled__StyledContainer-sc-1wpe7y8-0 gTivhm pagination-barstyled__StyledPaginationLocation-sc-1t8z84i-2 cHdiez" data-testid="search-results-number">
 1
 <!-- -->
 -
 <!-- -->
 10
 <!-- -->
 of
 <!-- -->
 <strong>
  3056
 </strong>
 properties
</span>


In [12]:
properties2=properties.strong.text
print(properties2)

3056


Combine this code into one line:

In [13]:
number_properties=int(soup.find('span',class_='pagination-locationstyled__StyledContainer-sc-1wpe7y8-0 gTivhm pagination-barstyled__StyledPaginationLocation-sc-1t8z84i-2 cHdiez').strong.text)

The max number of listings per page is 10, so the number of pages needing scraping is:

In [56]:
(number_properties // 10)+1

306

### URL deconstruction and buidling:

Through some manual observation, the URL can be deconstructed into location, radius, already sold / let , for sale or for renting.

In [15]:
base_http='https://www.purplebricks.co.uk/search/property-for-sale/'
location = 'london'
radius='30'
sold_let='false'
type_='ForSale' # Change to 'ForRent' if wanting rental properties
page=1 #I will loop over this, but for illustration purposes include

In [57]:
url=f'{base_http}{location}?page={page}&searchRadius={radius}&searchType={type_}&soldOrLet={sold_let}'
print(url)

https://www.purplebricks.co.uk/search/property-for-sale/london?page=1&searchRadius=30&searchType=ForSale&soldOrLet=false


### Define a function to retrieve number of pages to scrape over:

In [58]:
def get_max_pages(base_http,location,radius,type_,sold_let): 
    url=f'{base_http}{location}?page=1&searchRadius={radius}&searchType={type_}&soldOrLet={sold_let}'
    source = requests.get(url).text
    soup=BeautifulSoup(source,'lxml')
    number_properties=int(soup.find('span',class_='pagination-locationstyled__StyledContainer-sc-1wpe7y8-0 gTivhm pagination-barstyled__StyledPaginationLocation-sc-1t8z84i-2 cHdiez').strong.text)
    max_pages=int((number_properties//10)+1)
    return max_pages

And check it works:

In [59]:
max_pages=get_max_pages(base_http,location,radius,type_,sold_let)
print(max_pages)

306


## Getting the data

Set up the dataframe to store the results in:

In [24]:
purplebricks_df=pd.DataFrame(columns=['Bedrooms','Type','Price'])

Define a function to scrape - print functions silenced, there to check loop funcitonality

In [25]:
def get_data(base_http, location,radius,type_,sold_let,max_pages,cols):
    purplebricks_df=pd.DataFrame(columns=cols)
    i=0
    for page in range(1,max_pages+1):
        url=f'{base_http}{location}?page={page}&sortBy=2&searchRadius={radius}&searchType={type_}&soldOrLet={sold_let}'
        #print(url)
        source = requests.get(url).text
        soup=BeautifulSoup(source,'lxml')
        #print(i)
        for listing in soup.find_all('div', class_="property-cardstyled__StyledPropertyCard-sc-15g6092-0 fwvNIe"):
            try:
                bedrooms=int(re.split(' bedroom | - |£',listing.a['aria-label'])[0])
                details=(re.split(' bedroom | - |£',listing.a['aria-label'])[1])
                price=int(re.split(' bedroom | - |£',listing.a['aria-label'])[3].replace(',',''))
            except:
                listing_details=re.split(' - |£',listing.a['aria-label'])
                bedrooms=1
                details=listing_details[0]
                price=int(listing_details[2].replace(',',''))
            purplebricks_df.loc[i]=bedrooms, details, price
            #print(i)
            i+=1
    return purplebricks_df

Set out the base url, location, radius (miles), listing type and fetch the number of pages to scrape over - then feed this data into a dataframe.

In [26]:
base_http='https://www.purplebricks.co.uk/search/property-for-sale/'
location = 'london'
radius='30'
sold_let='false'
type_='ForSale'
max_pages=get_max_pages(base_http,location,radius,type_,sold_let)
print(f'Max pages ={max_pages}')
cols=['Bedrooms','Type','Price']
purplebricks_df=get_data(base_http, location,radius,type_,sold_let,max_pages,cols)
print(len(purplebricks_df))
purplebricks_df

Max pages =306
3060


Unnamed: 0,Bedrooms,Type,Price
0,5,detached house,4250000
1,5,semi-detached house,4000000
2,3,flat,3250000
3,4,end of terrace house,2850000
4,4,flat,1950000
...,...,...,...
3055,1,retirement property,120000
3056,2,park home,120000
3057,1,retirement property,120000
3058,1,retirement property,115000


Note that there are 4 more properties now than there were when running the code above - believe this is due to slight time delay in running of my code and new properties being uploaded.

In [60]:
purplebricks_df.to_csv(r'C:\Users\maxan\Documents\Python\purplebricks.csv', index = False)

In [61]:
purplebricks_df.to_json(r'C:\Users\maxan\Documents\Python\purplebricks.json')

## Potential extensions

Postcodes

Adding minimum prices (this is just a case of adding another URL argument of '&priceFrom=X'

Using the latitude & longitude