# Problem Statement

**PROJECTO 1** <br>
**Análisis exploratorio y modelado predictivo de precios de viviendas en Barcelona usando Python y SQL**

## Objective
Desarrollar un análisis completo y un modelo predictivo para los precios de viviendas en Barcelona, utilizando datos extraídos del portal Fotocasa. El objetivo es aplicar técnicas de extracción, manipulación y análisis de datos, así como algoritmos de Machine Learning, para predecir los precios de las viviendas en función de diversas características.

## Data Description
- **price**: The price of the real-state.
- **rooms**: Number of rooms.
- **bathroom**: Number of bathrooms.
- **lift**: whether a building has an elevator (also known as a lift in some regions) or not
- **terrace**: If it has a terrace or not.
- **square_meters**: Number of square meters.
- **real_state**: Kind of real-state.
- **neighborhood**: Neighborhood
- **square_meters_price**: Price of the square meter

## Importing necessary libraries

In [2]:
import pandas as pd

## Loading the Dataset

In [3]:
df=pd.read_csv('Barcelona_Fotocasa_HousingPrices.csv')

## Data Overview

In [8]:
df.head() # preview a sample first 5 rows

Unnamed: 0.1,Unnamed: 0,price,rooms,bathroom,lift,terrace,square_meters,real_state,neighborhood,square_meters_price
0,0,750,3,1,True,False,60,flat,Horta- Guinardo,12.5
1,1,770,2,1,True,False,59,flat,Sant Andreu,13.050847
2,2,1300,1,1,True,True,30,flat,Gràcia,43.333333
3,3,2800,1,1,True,True,70,flat,Ciutat Vella,40.0
4,4,720,2,1,True,False,44,flat,Sant Andreu,16.363636


In [6]:
df.tail() # preview a sample last 5 rows

Unnamed: 0.1,Unnamed: 0,price,rooms,bathroom,lift,terrace,square_meters,real_state,neighborhood,square_meters_price
8183,8183,1075,2,2,False,False,65,flat,Gràcia,16.538462
8184,8184,1500,3,2,True,False,110,flat,Eixample,13.636364
8185,8185,1500,2,2,True,True,90,flat,Sarria-Sant Gervasi,16.666667
8186,8186,1500,3,2,True,False,110,flat,Eixample,13.636364
8187,8187,1500,3,2,True,False,110,flat,Eixample,13.636364


In [7]:
df.sample(20) # preview a sample random n rows

Unnamed: 0.1,Unnamed: 0,price,rooms,bathroom,lift,terrace,square_meters,real_state,neighborhood,square_meters_price
915,915,1100,2,1,False,True,35,flat,Eixample,31.428571
6089,6089,650,1,1,False,False,34,flat,Ciutat Vella,19.117647
3373,3373,750,2,1,True,False,49,flat,Horta- Guinardo,15.306122
3847,3847,900,3,2,True,False,63,flat,Gràcia,14.285714
3075,3075,950,3,2,True,False,100,flat,Sant Martí,9.5
7820,7820,2200,2,2,True,False,68,flat,Gràcia,32.352941
2269,2269,1350,2,1,True,True,75,attic,Eixample,18.0
7775,7775,1250,4,2,True,False,95,flat,Sarria-Sant Gervasi,13.157895
1387,1387,1275,3,2,True,True,93,flat,Sarria-Sant Gervasi,13.709677
8054,8054,761,2,1,False,True,59,attic,Sant Andreu,12.898305


In [9]:
print("There are", df.shape[0], 'rows and', df.shape[1], "columns.") # number of observations and features


There are 8188 rows and 10 columns.


In [10]:
df.dtypes # data types

Unnamed: 0               int64
price                    int64
rooms                    int64
bathroom                 int64
lift                      bool
terrace                   bool
square_meters            int64
real_state              object
neighborhood            object
square_meters_price    float64
dtype: object

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8188 entries, 0 to 8187
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           8188 non-null   int64  
 1   price                8188 non-null   int64  
 2   rooms                8188 non-null   int64  
 3   bathroom             8188 non-null   int64  
 4   lift                 8188 non-null   bool   
 5   terrace              8188 non-null   bool   
 6   square_meters        8188 non-null   int64  
 7   real_state           7920 non-null   object 
 8   neighborhood         8188 non-null   object 
 9   square_meters_price  8188 non-null   float64
dtypes: bool(2), float64(1), int64(5), object(2)
memory usage: 527.9+ KB


In [13]:
df.describe(include="all").T # statistical summary of the data.

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Unnamed: 0,8188.0,,,,4093.5,2363.816335,0.0,2046.75,4093.5,6140.25,8187.0
price,8188.0,,,,1444.092574,1125.886215,320.0,875.0,1100.0,1540.0,15000.0
rooms,8188.0,,,,2.420738,1.138592,0.0,2.0,2.0,3.0,10.0
bathroom,8188.0,,,,1.508793,0.732798,1.0,1.0,1.0,2.0,8.0
lift,8188.0,2.0,True,5710.0,,,,,,,
terrace,8188.0,2.0,False,6518.0,,,,,,,
square_meters,8188.0,,,,84.610161,47.874028,10.0,56.0,73.0,95.0,679.0
real_state,7920.0,4.0,flat,6505.0,,,,,,,
neighborhood,8188.0,10.0,Eixample,2401.0,,,,,,,
square_meters_price,8188.0,,,,17.739121,9.245241,4.910714,12.790698,15.306122,19.444444,186.666667


## Consolidated notes on Data Overview

- There are 8188 rows and 10 columns.
- 'Unnamed' column represent index column and should be deleted from data
- Data types are aligned with information
- There is missing data on 'real_state'
- There are four types of real states being the most common "flat"
- Most units do not have terrace
- Most units do have lift
- The neighborhood with bigger unit count is "Eixample"
- Units size goes from 10m2 to 679m2, with a mean of 84.61m2
- Units prices goes from 320EUR to 15000EUR/month, with mean of 1444EUR/month
- price range is assumed referred to monthly rent, so considered as EUR per month
- Units prices by square meter goes from 4.9EUR/m2 to 186EUR/m2, with mean of 17.7EUR/m2
- There are units listed with cero rooms (to be investigated)
- Target variable for modeling is "price"

# Exploratory Data Analysis (EDA)

## EDA Functions