# ADM Homework 4 Group 11

## 1) Does basic house information reflect house's description?
In this assignment we will perform a clustering analysis of house announcements in Rome from Immobiliare.it.

Let's start preparing the enironment loading the libraries:

In [8]:
import pandas as pd
from bs4 import BeautifulSoup
from requests import get
import csv
import re

We'll scrape some data from the website starting from this url:

https://www.immobiliare.it/vendita-case/roma/?criterio=rilevanza&pag=1

In the url we can notice a parameter referring to the pagination of the results, divided in pages. Each of this pages contains 25 announces.

In order to reach at least 10.000 announces, we need to scrape at least 400 pages.

First we create the function that returns the urls of the announces inside a page.

In [2]:
def get_announces(url):
    response = get(url)

    html_soup = BeautifulSoup(response.text, 'html.parser')
    announce_containers = html_soup.find_all('p', class_ = 'titolo text-primary')
    
    urls = []
    
    for container in announce_containers:
        if "/nuove_costruzioni/" not in container.a['href']: 
            urls.append(container.a['href'])
        
    return urls

Let's create a list with all the announces urls we need. We save it in a csv file to avoid scraping all the pages again.

In [3]:
#url_list = []

#for i in range(1,450):
#    url = 'https://www.immobiliare.it/vendita-case/roma/?criterio=rilevanza&pag='
#    url_list = url_list + get_announces(url + str(i))

#with open('data/url_list.csv', 'w+', newline='') as myfile:
#    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
#    for url in url_list:
#        wr.writerow([url])

In [124]:
url_list = pd.read_csv('data/url_list.csv', header=None)
url_list = url_list[0]
url_list.head()

0    https://www.immobiliare.it/53131931-Vendita-Bi...
1    https://www.immobiliare.it/70420586-Vendita-Bi...
2    https://www.immobiliare.it/70288308-Vendita-Ap...
3    https://www.immobiliare.it/70114826-Vendita-Tr...
4    https://www.immobiliare.it/70355074-Vendita-Tr...
Name: 0, dtype: object

Now we define the function to extract the info we need from the announce page

In [189]:
def get_data(url):
    
    id = re.findall(r'(\d+)', url)[0] # Get announce ID parsing the url
    
    response = get(url)

    html_soup = BeautifulSoup(response.text, 'html.parser')
    data_container = html_soup.find('ul', class_ = 'list-inline list-piped features__list')
    
    if data_container is not None:
    
        for item in data_container.children:

            # Locate rooms number
            if item.find('div', class_= 'features__label') and item.find('div', class_= 'features__label').contents[0] == 'locali':
                rooms = item.find('span', class_ = 'text-bold').contents[0]
                rooms = re.sub('[^A-Za-z0-9]+', '', rooms)

            # Locate surface extension
            if item.find('div', class_= 'features__label') and item.find('div', class_= 'features__label').contents[0] == 'superficie':
                area = item.find('span', class_ = 'text-bold').contents[0]
                area = re.sub('[^A-Za-z0-9]+', '', area)

            # Locate bathrooms number    
            if item.find('div', class_= 'features__label') and item.find('div', class_= 'features__label').contents[0] == 'bagni':
                bathrooms = item.find('span', class_ = 'text-bold').contents[0]
                bathrooms = re.sub('[^A-Za-z0-9]+', '', bathrooms)

            # Locate floor number    
            if item.find('div', class_= 'features__label') and item.find('div', class_= 'features__label').contents[0] == 'piano':
                floor = item.find('abbr', class_ = 'text-bold').contents[0]
                floor = re.sub('[^A-Za-z0-9]+', '', floor)

            # Extract the description
            try:
                description = html_soup.find('div', class_ = 'col-xs-12 description-text text-compressed').div.contents[0]
                description = re.sub('[^a-zA-Z0-9-_*. ]', '', description) # Remove special charachters
                description = description.lstrip(' ') # Remove leading blank spaces
            except AttributeError:
                return False

            try:
                return [[id,rooms,area,bathrooms,floor],[id,description]]
            except NameError:
                return False
    else:
        return False
            

In [190]:
get_data('https://www.immobiliare.it/70370910-Vendita-Quadrilocale-via-Silvestri-Roma.html')

False

Now we can iterate the url list extracting all the data to put them in two dataframes.

In order to save execution time for the next runs, we save the two dataframse in two csv files.

In [None]:
data_df = pd.DataFrame(columns = ['ID','Rooms','Area','Bathrooms','Floor'])

description_df = pd.DataFrame(columns = ['ID','Description'])

for url in url_list:
    
    print(url)
    
    if get_data(url):
    
        # Convert list in dataframe
        row_data = pd.np.asarray(get_data(url)[0])
        row_data = pd.DataFrame(data=row_data.reshape(1,5), columns= ['ID','Rooms','Area','Bathrooms','Floor'])

        # Append results to data dataframe
        data_df = data_df.append(row_data)

        # Convert list in dataframe
        row_description = pd.np.asarray(get_data(url)[1])
        row_description = pd.DataFrame(data=row_description.reshape(1,2), columns= ['ID','Description'])

        # Append results to description dataframe
        description_df = description_df.append(row_description)
    

data_df.to_csv('data/data.csv')
description_df.to_csv('data/description.csv')

https://www.immobiliare.it/53131931-Vendita-Bilocale-viale-Italo-Calvino-Roma.html
https://www.immobiliare.it/70420586-Vendita-Bilocale-via-Prenestina-59-Roma.html
https://www.immobiliare.it/70288308-Vendita-Appartamento-via-della-Fotografia-Roma.html
https://www.immobiliare.it/70114826-Vendita-Trilocale-via-Genserico-Fontana-11-Roma.html
https://www.immobiliare.it/70355074-Vendita-Trilocale-viale-Cortina-D-Ampezzo-Roma.html
https://www.immobiliare.it/69659060-Vendita-Appartamento-via-Germanico-24-Roma.html
https://www.immobiliare.it/66479763-Vendita-Appartamento-via-Sesto-Rufo-42-Roma.html
https://www.immobiliare.it/61733354-Vendita-Appartamento-via-Dandolo-Roma.html
https://www.immobiliare.it/64762998-Vendita-Attico-Mansarda-viale-Ezra-Pound-Roma.html
https://www.immobiliare.it/70043828-Vendita-Appartamento-via-del-Calice-Roma.html
https://www.immobiliare.it/69084658-Vendita-Appartamento-via-Caroncini-Roma.html
https://www.immobiliare.it/70343364-Vendita-Trilocale-viale-della-Venezia

https://www.immobiliare.it/70376488-Vendita-Quadrilocale-viale-Tito-Labieno-Roma.html
https://www.immobiliare.it/68192795-Vendita-Attico-Mansarda-largo-Arturo-Donaggio-Roma.html
https://www.immobiliare.it/68192569-Vendita-Appartamento-via-Orazio-Roma.html
https://www.immobiliare.it/70107928-Vendita-Trilocale-via-Fontanellato-75-Roma.html
https://www.immobiliare.it/69489816-Vendita-Appartamento-via-degli-Spagnoli-Roma.html
https://www.immobiliare.it/66650211-Vendita-Appartamento-via-dei-Colli-della-Roma.html
https://www.immobiliare.it/69722980-Vendita-Quadrilocale-via-Cristoforo-Sabbadino-Roma.html
https://www.immobiliare.it/69090654-Vendita-Bilocale-via-Ezio-Sciamanna-12-Roma.html
https://www.immobiliare.it/67991129-Vendita-Loft-Open-Space-via-della-Fontana-5-Roma.html
https://www.immobiliare.it/68986837-Vendita-Bilocale-via-del-Cottanello-Roma.html
https://www.immobiliare.it/68601391-Vendita-Attico-Mansarda-via-Nicolo-Piccolomini-Roma.html
https://www.immobiliare.it/67414483-Vendita-A

https://www.immobiliare.it/68008947-Vendita-Villa-via-Cristoforo-Sabbadino-151D-Roma.html
https://www.immobiliare.it/69128304-Vendita-Villa-via-degli-Aldobrandeschi-12-Roma.html
https://www.immobiliare.it/68009635-Vendita-Bilocale-via-Michelangelo-Peroglio-Roma.html
https://www.immobiliare.it/66028417-Vendita-Appartamento-via-Nizza-Roma.html
https://www.immobiliare.it/70321062-Vendita-Villetta-a-schiera-via-Simonide-2-Roma.html
https://www.immobiliare.it/68445143-Vendita-Trilocale-via-Igea-19-Roma.html
https://www.immobiliare.it/67995033-Vendita-Quadrilocale-viale-della-Tecnica-Roma.html
https://www.immobiliare.it/70178590-Vendita-Appartamento-via-di-Monte-del-Gallo-6-Roma.html
https://www.immobiliare.it/70084040-Vendita-Bilocale-via-Ostiense-Roma.html
https://www.immobiliare.it/66987121-Vendita-Villa-via-Tommaso-Traetta-Roma.html
https://www.immobiliare.it/69907978-Vendita-Appartamento-via-Onofrio-Panvinio-Roma.html
https://www.immobiliare.it/69350052-Vendita-Appartamento-viale-Marco-

https://www.immobiliare.it/69629466-Vendita-Appartamento-via-Francesco-Giacomelli-Roma.html
https://www.immobiliare.it/69249352-Vendita-Villa-viale-Gorgia-di-Leontini-Roma.html
https://www.immobiliare.it/67586735-Vendita-Trilocale-via-rubra-Roma.html
https://www.immobiliare.it/65387600-Vendita-Trilocale-via-Simeto-12-Roma.html
https://www.immobiliare.it/69581468-Vendita-Appartamento-via-Salaria-414-Roma.html
https://www.immobiliare.it/70376062-Vendita-Bilocale-via-Australia-Roma.html
https://www.immobiliare.it/65131724-Vendita-Attico-Mansarda-via-Murisengo-Roma.html
https://www.immobiliare.it/68295663-Vendita-Bilocale-piazzale-Cina-Roma.html
https://www.immobiliare.it/69605886-Vendita-Villa-via-Teodoto-Roma.html
https://www.immobiliare.it/68559089-Vendita-Villetta-a-schiera-via-Carlo-Maria-Rosini-Roma.html
https://www.immobiliare.it/68538857-Vendita-Bilocale-via-Michele-Scherillo-Roma.html
https://www.immobiliare.it/68890657-Vendita-Villa-via-Domenico-Cortopassi-Roma.html
https://www.i