# Scrape data regarding properties for sale in Dublin area listed on daft.ie

The search result for property for provide lists of houses, apartments, etc. for sale in Dublin City, with could be gathered and stored for easy access and analyse.
This project main purpose is to analyse the web structure of daft.ie and retrieve useful data of properties for sale and there prices.

## Analyse the web structure and function testing

The website daft.ie presents all there listing of properties for sale in Dublin area in the link https://www.daft.ie/property-for-sale/dublin-city. The first aproach is to determine how many results are there to retrieve and loop through all the pages available from the result. Further investigations show that all search results are store in a unorder lists, with each list item contains most of the information needed to generally describe the properties (address, price, number of rooms (beds, baths), floor area, etc.). So, the first step is to get all the item in the list that contain the search results.

### *First, we retrieve all the list items from the search result*

In [1]:
import requests
from bs4 import BeautifulSoup
import re

base_url = "https://www.daft.ie/property-for-sale/dublin-city"

property_features = []
response = requests.get(base_url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    # get the number of search results available for scraping, remove letter and comma
    result_anouncement = soup.find("h1",attrs={"data-testid": "search-h1"}).text
    number_of_result = re.findall(r'\d+(?:,\d{3})*(?:\.\d+)?', result_anouncement)[0]
    number_of_result = int(number_of_result.replace(",",""))
    counter = 0
    
    while counter<=number_of_result:
        # get all the search results (unorder list item) 20 items at a time, until none left
        response = requests.get(base_url+"?from={}".format(counter))
        if response.status_code == 200:
            soup = BeautifulSoup(response.content,"html.parser")
            # get all list items with attribute data-testid contains 'result'
            property_features += soup.find_all("li", attrs={"data-testid": lambda value: value and "result" in value})
        
        counter+=20
else:
    print("Error")

* Now, we have all the properties (or more specific, their containers) in a list, however, there are a few exceptions. These exceptions' containers have a different structure to the rest, they also list newly built properties, with their price hidden to the public, and there are only a few of them. So, for this project, these properties will be dropped from the dataset

In [2]:
non_subunit_containers=[a for a in property_features if a.find(attrs={'data-testid':'sub-units-container'}) is None]

print('The number of search results: {}'.format(len(property_features)))
print('The number of exceptions that will be dropped from the dataset: {}'.format(len(property_features)-len(non_subunit_containers)))
print('The number of usable results: {}'.format(len(non_subunit_containers)))

The number of search results: 2640
The number of exceptions that will be dropped from the dataset: 25
The number of usable results: 2615


*As we can see, there are only a very little number of exceptional results.*

### *Now, we test the functions for extracting the targets data from the list items, using the fist item of the list of items that had been retrieved*

* **Get the link to the property page**

In [3]:
non_subunit_containers[0].a.get('href')

'/for-sale/apartment-34-ballintyre-meadows-ballinteer-ballinteer-dublin-16/5286712'

In [4]:
# list of all the property links
links = [li.a.get('href') for li in non_subunit_containers]

In [5]:
print('Number of links available: {}'.format(len(links)))

Number of links available: 2615


* **Get the physical address of the property**

In [6]:
non_subunit_containers[0].find(attrs={"data-testid": 'address'}).text

'34 Ballintyre Meadows, Ballinteer, Ballinteer, Dublin 16, D16VK61'

In [7]:
physical_addresses = [li.find(attrs={"data-testid": 'address'}).text for li in non_subunit_containers]

In [8]:
print('Number of addresses available: {}'.format(len(physical_addresses)))

Number of addresses available: 2615


* **Get the price of the property, remove any monetary symbols and puntuation marks**

In [9]:
property_features[0].find(attrs={"data-testid":'price'}).text

'€385,000 4 ONLINE OFFERS'

*As we can see, the prices is not a numarical value which can be use for analysis; so, further steps need to be done to clean and convert the prices to numarical values*

In [10]:
int(property_features[0].find(attrs={"data-testid":'price'}).text.split()[0].replace("€", "").replace(",", ""))

385000

In [11]:
price = []
for li in non_subunit_containers:
    try:
	    temp_price = int(li.find(attrs={"data-testid":'price'}).text.split()[0].replace("€", "").replace(",", ""))
    except:
        temp_price = None
    price.append(temp_price)

In [12]:
print('Number of prices available: {}'.format(len(price)))

Number of prices available: 2615


* **Get the number of beds, baths, and types of properties**

In [13]:
beds = []
for li in non_subunit_containers:
    try:
        temp_bed = int(li.find(attrs={"data-testid":'beds'}).text.split()[0])
    except:
        temp_bed = None
    beds.append(temp_bed)

In [14]:
print('Number of beds available: {}'.format(len(beds)))

Number of beds available: 2615


In [15]:
baths = []
for li in non_subunit_containers:
    try:
        temp_bath = int(li.find(attrs={"data-testid":'beds'}).text.split()[0])
    except:
        temp_bath = None
    baths.append(temp_bath)

In [16]:
print('Number of baths available: {}'.format(len(baths)))

Number of baths available: 2615


In [17]:
property_type = [li.find(attrs={"data-testid":'property-type'}).text for li in non_subunit_containers]

In [18]:
print('Number of total property types available: {}'.format(len(property_type)))

Number of total property types available: 2615


* **Get the floor area of the property. Some properties have there floor area listed in acre, which need to beconvered to square metre.**

In [19]:
non_subunit_containers[0].find(attrs={"data-testid":'floor-area'}).text

'80 m²'

In [20]:
print("Floor area units:")
print(set([a.find(attrs={"data-testid":'floor-area'}).text.split()[1] for a in non_subunit_containers if a.find(attrs={"data-testid":'floor-area'})]))

Floor area units:
{'m²', 'ac'}


In [21]:
# function to convert text representation of the floor area into numarical value.
# floor area listed in acre should also be convert to squre metre
def floor_area_num(or_text):
    splt_text = or_text.split()
    if splt_text[1]=='ac':
        return int(splt_text[0]/0.00024711)
    else:
        return int(splt_text[0])

In [22]:
floor_area = []
for li in non_subunit_containers:
    try:
        temp_area = floor_area_num(li.find(attrs={"data-testid":'floor-area'}).text)
    except:
        temp_area = None
    floor_area.append(temp_area)

In [23]:
print('Number of floor areas available: {}'.format(len(floor_area)))

Number of floor areas available: 2615


* **Get the name of the agency that post the ad**

In [24]:
agents = []
for li in non_subunit_containers:
    try:
        temp_agent = li.find(attrs={"data-testid":'agent-name'}).text
    except:
        temp_agent = None
    agents.append(temp_agent)

In [26]:
len(agents)

2615

* **Convert everything to Pandas dataframe for easy manipulation**

In [81]:
import pandas as pd
data_list = [links,agents, physical_addresses,floor_area,property_type,beds,baths,price]
col_names = ['links','agents', 'physical_addresses','floor_area','property_type','beds','baths','price']
df = pd.DataFrame(data_list).T
df.columns = col_names
df

Unnamed: 0,links,agents,physical_addresses,floor_area,property_type,beds,baths,price
0,/for-sale/apartment-34-ballintyre-meadows-ball...,Fair Deal Property Ltd -Galway,"34 Ballintyre Meadows, Ballinteer, Ballinteer,...",80,Apartment,2,2,385000
1,/for-sale/terraced-house-49-donore-avenue-sout...,Felicity Fox Auctioneers,"49 Donore Avenue, South Circular Road, South C...",100,Terrace,3,3,800000
2,/for-sale/semi-detached-house-1-kill-avenue-du...,Lisney Sotheby's International Realty (Dalkey),"1 Kill Avenue Dun Laoghaire, Dun Laoghaire, Co...",,Semi-D,3,3,775000
3,/for-sale/detached-house-thornberry-thornberry...,Ed Dempsey,"Thornberry, Thornberry, 4 Granville Road, Blac...",307,Detached,4,4,1525000
4,/for-sale/semi-detached-house-2a-south-avenue-...,Janet Carroll Estate Agent,"2A South Avenue, Blackrock, Co. Dublin, A94RH21",124,Semi-D,3,3,1075000
...,...,...,...,...,...,...,...,...
2610,/for-sale/apartment-hampton-wood-road-finglas-...,Horan Estates and Lettings,"Hampton Wood Road, Finglas, Dublin 11, D11X073",,Apartment,1,1,190000
2611,/for-sale/detached-house-28a-virginia-park-fin...,Leonard Wilson Keenan Estates & Letting Agents,"28A Virginia Park, Finglas South, Finglas, Dub...",,Detached,3,3,179950
2612,/for-sale/terraced-house-156-parnell-street-du...,DNG Phibsboro,"156 Parnell Street, Dublin 1, D01FW92",,Terrace,12,12,1500000
2613,/for-sale/semi-detached-house-13-curragh-hall-...,Horan Estates and Lettings,"13 Curragh Hall Green, Tyrrelstown, Dublin 15",,Semi-D,3,3,295000


* **Export to csv file for storage**

In [88]:
import datetime
df.to_csv('daft_{}.csv'.format(str(datetime.date.today())),index=False,sep=';', header=True, encoding='utf-8')