# Web Scraping for Property Sales Data from Rightmove
This project demonstrates how to scrape data for sold properties in London using Python. The scraped data includes property addresses, sale prices, transaction dates, property types, and more, organized into a structured format (CSV/Excel).

## Objectives
- Extract comprehensive property data from Rightmove.
- Handle nested JSON structures and dynamic content.
- Organize and present the data in a structured f
o### Importing Libraries
The following libraries are used:
- `BeautifulSoup`: For parsing HTML.
- `Requests`: To fetch web pages.
- `JSON`: To handle embedded JSON data.
- `re`: For regular expressions.
- `pandas`: For organizing and saving theata.
SON, Regex


In [71]:
from bs4 import BeautifulSoup
import requests
import json
import re
import pandas as pd
from IPython.display import JSON
#from time import sleep


In [72]:
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

### fetch_page_data(page_number)
Fetches the HTML content of the specified page from the Rightmove website.
- **Input**: Page number.
- **Output**: Parsed HTML using BeautifulSoup or `None` if the request fails.


In [None]:
def fetch_page_data(page_number):
    URL = f"https://www.rightmove.co.uk/house-prices/london.html?soldIn=1&pageNumber={page_number}"
    response = requests.get(URL)
    if response.status_code == 200:
        return BeautifulSoup(response.text, "html.parser")
    else:
        print(f"Error fetching page {page_number}: {response.status_code}")
        return None

### extract_json_data(soup)
Extracts the embedded JSON data from the HTML content.
- **Input**: Parsed HTML (`soup` object).
- **Output**: JSON object containing property details.


In [None]:
def extract_json_data(soup):
    html = soup.find('script', type="text/javascript")
    if html:
        match = re.search(r'window\.PAGE_MODEL\s*=\s*(\{.*?\});', html.string, re.DOTALL)
        if match:
            return json.loads(match.group(1))
    return None

### extract_properties_from_json(json_data)
Processes the JSON data and extracts property details such as:
- Address
- Bedrooms
- Property type
- Sale prices
- Sale dates
- Geolocation (latitude and longitude)
- Property URL
- **Input**: JSON data.
- **Output**: List of property records in dictionary format.


In [None]:
def extract_properties_from_json(json_data):
    all_properties = []
    for property_data in json_data['searchResult']['properties']:
        for txn in property_data['transactions']:
            record = {
                'address': property_data['address'],
                'bedrooms': property_data['bedrooms'],
                'type': property_data['propertyType'],
                'transaction_prices': txn['displayPrice'],
                'transaction_dates': txn['dateSold'],
                'tenure': txn['tenure'],
                'newBuild': txn['newBuild'],
                'locationLat': property_data['location']['lat'],
                'locationLng': property_data['location']['lng'],
                'url': property_data['detailUrl']
            }
            all_properties.append(record)
    return all_properties

### Main Scraper Logic
Loops through multiple pages to fetch and parse data:
1. Calls `fetch_page_data` to get the HTML content.
2. Extracts JSON data using `extract_json_data`.
3. Collects property details using `extract_properties_from_json`.
4. Combines data from all pages into a single DataFrame.


In [84]:

all_properties = []
for page in range(1, 41):  # Replace with dynamic page number if implemented
    soup = fetch_page_data(page)
    if not soup:
        break
    json_data = extract_json_data(soup)
    if not json_data:
        print(f"Error parsing JSON on page {page}")
        break
    all_properties.extend(extract_properties_from_json(json_data))
    #sleep(2)  # Rate limiting

In [85]:
#JSON(properties)

In [86]:
df = pd.DataFrame(all_properties)


In [87]:
df.shape

(2814, 10)

### Data Preview
Below is a preview of the scraped data:


In [88]:
df.head()

Unnamed: 0,address,bedrooms,type,transaction_prices,transaction_dates,tenure,newBuild,locationLat,locationLng,url
0,"Glencoe, Ashfield Avenue, Feltham, Greater Lon...",4.0,DETACHED,"£650,000",31 Oct 2024,FREEHOLD,False,51.44575,-0.4043,https://www.rightmove.co.uk/house-prices/detai...
1,"Glencoe, Ashfield Avenue, Feltham, Greater Lon...",4.0,DETACHED,"£133,000",15 Sep 1997,FREEHOLD,False,51.44575,-0.4043,https://www.rightmove.co.uk/house-prices/detai...
2,"8, Heathcote Grove, London, Greater London E4 6RT",2.0,SEMI_DETACHED,"£282,500",31 Oct 2024,FREEHOLD,False,51.62463,-0.00579,https://www.rightmove.co.uk/house-prices/detai...
3,"8, Heathcote Grove, London, Greater London E4 6RT",2.0,SEMI_DETACHED,"£310,000",14 Oct 2002,FREEHOLD,False,51.62463,-0.00579,https://www.rightmove.co.uk/house-prices/detai...
4,"145, Winterbourne Road, Thornton Heath, Greate...",3.0,TERRACED,"£380,000",30 Oct 2024,FREEHOLD,False,51.40127,-0.11027,https://www.rightmove.co.uk/house-prices/detai...


### Data Output
The collected data is saved to a CSV file named `property_data.csv` for further analysis.


In [89]:
df.to_csv('property_data.csv', index=False)
print("Data successfully saved to property_data.csv")

Data successfully saved to property_data.csv
