# Welcome to Zoopla Scraper

## Purpose

This Python-based notebook is designed to accept any Zoopla search query, and automatically scrape all details from the corresponding listings into *csv* files for further analysis.

## Prerequisites

This should be fairly easy to use with no other prerequisites, although this is fairly experimental, so expect error messages. Basic technical fluency will help in diagnosing these, particuarly knowledge of Python and HTML - JavaScript could also be useful.

## Usage

Most of the cells below can be ignored by a non-expert user - Cells which can be customised are clearly marked. Enjoy!

# Code

## Dependencies

These cells install required additional packages.

* '_bs4_' is *Beautiful Soup*, a package which helps with retrieving HTML content.
* '_lxml_' is a package for efficiently parsing XML content.

In [10]:
!pip install bs4



In [11]:
!pip install lxml



## Package Imports

In addition to the installed prerequisites, there are other components of the Python library imported.

In [None]:
import requests
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np
import time
import os
import sys
import json
import pandas as pd
import datetime

## Search Link Input

Simply go to Zoopla, design a search query, and paste it below. Do note that if your search term returns more than 25 results, you'll need to customise the page size.

Note the usage of `size=100` below - This ensures that all returned results will be on one page.

`https://www.zoopla.co.uk/to-rent/flats/e14/?include_rented=true&include_shared_accommodation=false&page_`**size=100**``&polyenc=uglyHmn@{BxCVlElDlEpDz@`FFbB|Aj@`DnElA~CoBtDfArEsf@oI{CcD|EmBfOc@f@kAy@{HqGqDm@&price_frequency=per_month&view_type=list&q=E14&radius=0&results_sort=most_reduced&search_source=facets``

In [None]:
search_link = 'https://www.zoopla.co.uk/to-rent/flats/e14/?include_rented=true&include_shared_accommodation=false&page_size=100&polyenc=uglyHmn@{BxCVlElDlEpDz@`FFbB|Aj@`DnElA~CoBtDfArEsf@oI{CcD|EmBfOc@f@kAy@{HqGqDm@&price_frequency=per_month&view_type=list&q=E14&radius=0&results_sort=most_reduced&search_source=facets'

## Data Retrieval

The cells step through the following:

* Go and fetch the search results HTML
* Extract the URLs for each listing to create a 'links list'
* Visit each URL to pull the HTML of the individual listings
* Extract the 'listingDetails' data object that Zoopla retrieves as a GraphQL query via the Apollo Client (luckily the result is simply appended to the HTML of the listing...)

In [None]:
search_page = requests.get(search_link)
bsobj = soup(search_page.content,'lxml')

In [None]:
links_list = []
for result in bsobj.findAll('a',{'data-testid':'listing-details-link'}):
    root_node = ET.fromstring(str(result))
    links_list.append('https://www.zoopla.co.uk' + root_node.get('href')) 

In [None]:
listingDetails_list = []

for listing in links_list:
    listing_page = requests.get(listing)
    bsobj = soup(listing_page.content,'lxml')
    
    json_props = json.loads(bsobj.findAll('script')[-1].getText())
    listingDetails = json_props['props']['pageProps']['listingDetails']
    
    listingDetails_list.append(listingDetails) 

## Data Parsing

With all the data retrieved, it can now be split into its component parts. This is because some of them have different structures, and need slightly unique treatment to include into the main data object

The data is eventually combined into Pandas dataframes, and then converted into csv files

In [None]:
listing_df_list = []

for listingDetail in listingDetails_list:

    adTargeting = listingDetail['adTargeting']
    branch = listingDetail['branch']
    feature_bullets = listingDetail['features']['bullets']
    feature_flags = listingDetail['features']['flags']
    viewCount = listingDetail['viewCount']
    pricing = listingDetail['pricing']
    
    if 'alternateRentFrequencyPrice' in pricing:
        del pricing['alternateRentFrequencyPrice']
    
    listingId = adTargeting['listingId']

    adTargeting_df = pd.DataFrame.from_dict(adTargeting, orient='index', columns=[0]).transpose()
    branch_df = pd.DataFrame.from_dict(branch, orient='index', columns=[0]).transpose()
    feature_flags_df = pd.DataFrame.from_dict(feature_flags, orient='index', columns=[0]).transpose()
    viewCount_df = pd.DataFrame.from_dict(viewCount, orient='index', columns=[0]).transpose()
    pricing_df = pd.DataFrame.from_dict(pricing, orient='index', columns=[0]).transpose()

    master_df = pd.concat(objs = [adTargeting_df,branch_df,feature_flags_df,viewCount_df,pricing_df], axis = 1)
    
    master_df['feature_bullets'] = ', '.join(feature_bullets)
    
    listing_df_list.insert(0, master_df)


In [None]:
priceHistory_df_list = []

for listingDetail in listingDetails_list:
    
    priceChanges = listingDetail['priceHistory']['priceChanges']
    listingId = listingDetail['adTargeting']['listingId']
    
    if not priceChanges is None:
        for priceChange in priceChanges:
            priceChange['listingId'] = listingId
    
            priceChanges_df = pd.DataFrame.from_dict(priceChange, orient='index', columns=[0]).transpose()
            priceHistory_df_list.insert(0, priceChanges_df)

In [None]:
master_df = pd.concat(objs = listing_df_list, axis = 0)

In [None]:
master_priceHistory_df = pd.concat(objs = priceHistory_df_list, axis = 0)

In [None]:
date_extracted = datetime.datetime.now().strftime("%d-%m-%Y_%H%M")
print(date_extracted)

In [None]:
master_df.to_csv('master_' + date_extracted + '.csv', index=False)
master_priceHistory_df.to_csv('master_priceHistory_' + date_extracted + '.csv', index=False)