# EDA Project

In this notebook, you should perform EDA (Exploratory data analysis) on given dataset: real_estate_dataset.csv

We do not want to give you precise steps to follow, but based on previous lessons, you should have an idea what steps are needed and which should not be skipped. In order to guide you just a bit, here are some ideas:

* it is always good to check how many data are missing - and how can we solve missing data in this dataset?
* what data types have different columns?
* what could be set as index? And does it make sense here?
* price column can be for sure better formated
* can we spot some outliers? What method to see outliers can be used?
* Total area can be in metres (we are in Austria) and numeric, loft size as well
* who is the most active broker?
* what is most sold object type?
* can you do nice visualisation of the data?
* is there a correlation between price and size? Can we see it?
* is there any other patterns to discribe?

## What we want to be submitted:
* send us your notebook on github as link and we will give you feedback
* comment everything - explain your thoughts, why you think this column should be dropped, why you did this visualisation, everything.
* you do not have to follow all ideas above, but your analysis should be going from start to end with logical steps
* try to summarize with at least 3 sentences a conclusion on what we can tell about dataset.

In [11]:
# Import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline  
#the output of plotting commands is displayed inline, directly below the code cell that produced it.

In [2]:
#Load dataset
realstate = pd.read_csv("real_estate_dataset.csv")

In [8]:
#Peek into the data. I can see the data is troublesome, as there is a lot of information per column
realstate.head 

<bound method NDFrame.head of                                                     url  \
0     http://www.loopnet.com/Listing/20157634/6060-E...   
1     http://www.loopnet.com/Listing/18499430/1901-A...   
2     http://www.loopnet.com/Listing/20000996/4510-O...   
3     http://www.loopnet.com/Listing/19729524/7645-W...   
4     http://www.loopnet.com/Listing/19535739/171-Mu...   
...                                                 ...   
9071  http://www.loopnet.com/Listing/19833597/316-Br...   
9072  http://www.loopnet.com/Listing/19875908/3005-3...   
9073  http://www.loopnet.com/Listing/19627936/12-Lar...   
9074  http://www.loopnet.com/Listing/20080133/2688-A...   
9075  http://www.loopnet.com/Listing/20003807/1775-G...   

                      Address  \
0       Anchorage, AK 99518 ·   
1       Fairbanks, AK 99701 ·   
2       Anchorage, AK 99502 ·   
3         Wasilla, AK 99654 ·   
4       Anchorage, AK 99504 ·   
...                       ...   
9071  Thermopolis, WY 82443 · 

### Surely, I shoudl tidy up the dataset to convey any message about it. From a "first glance" point of view I can see there are many missing values and data inside each column is not uniformly reported, but there are few attributes to organize...

In [9]:
realstate.columns

Index(['url', 'Address', 'City', 'Owner Name', 'Mailing Address', 'Price',
       'Number of Units', 'Total area', 'Number Of Stories', 'Lot Size',
       'Type', 'Year Built', 'Other Info', 'Images', 'broker', 'phone',
       'EMAIL', 'secondary broker', 'phone, 2', 'Email'],
      dtype='object')

In [12]:
realstate.shape

(9076, 20)

What is the percentage of null values per column?

In [13]:
np.round((realstate.isna().sum() / realstate.shape[0]) * 100, 2)

url                   0.00
Address               0.00
City                  0.00
Owner Name           99.97
Mailing Address      99.97
Price                 0.00
Number of Units       0.00
Total area            0.00
Number Of Stories     0.00
Lot Size              0.00
Type                  0.00
Year Built            0.00
Other Info            1.67
Images                0.00
broker                1.67
phone                 1.67
EMAIL                88.84
secondary broker     70.38
phone, 2             70.38
Email                97.49
dtype: float64

Columns: Owner Name, mailing Address, email, secondary broker, phone, 2 and email (from the second broker) have a significant amount of null values. I presume that the absence of information doesn't not imply an information loss, as these may be casual. The important information here is the propierties of the pontential property to buy, the broker and the price... 

The other fields that contain a small percentage of missing values, might not be specially problematic. I have not decide if I will discard the items that don't have info.

In [15]:
realstate.info()
#realstate.describe().apply(lambda s: s.apply('{0:.2f}'.format))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9076 entries, 0 to 9075
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   url                9076 non-null   object
 1   Address            9076 non-null   object
 2   City               9076 non-null   object
 3   Owner Name         3 non-null      object
 4   Mailing Address    3 non-null      object
 5   Price              9076 non-null   object
 6   Number of Units    9076 non-null   object
 7   Total area         9076 non-null   object
 8   Number Of Stories  9076 non-null   object
 9   Lot Size           9076 non-null   object
 10  Type               9076 non-null   object
 11  Year Built         9076 non-null   object
 12  Other Info         8924 non-null   object
 13  Images             9076 non-null   object
 14  broker             8924 non-null   object
 15  phone              8924 non-null   object
 16  EMAIL              1013 non-null   object


In [None]:
realstate['price'] = realstate['price'].astype(int)
realstate['price'].describe().apply(lambda x: format(x, 'f'))
realstate['size(sqft)'] = realstate['size(sqft)'].astype(int)
realstate['size(sqft)'].describe().apply(lambda x: format(x, 'f'))
realstate['longitude'] = realstate['longitude'].astype(float)
realstate['latitude'] = realstate['latitude'].astype(float)
realstate['last_update'] = pd.to_datetime(realstate['last_update'], format='%Y-%m-%dT%H:%M:%SZ', errors='coerce')

In [16]:
realstate.describe()

Unnamed: 0,url,Address,City,Owner Name,Mailing Address,Price,Number of Units,Total area,Number Of Stories,Lot Size,Type,Year Built,Other Info,Images,broker,phone,EMAIL,secondary broker,"phone, 2",Email
count,9076,9076.0,9076,3,3,9076,9076.0,9076.0,9076.0,9076.0,9076,9076.0,8924,9076.0,8924,8924,1013,2688,2688,228
unique,9076,5031.0,8812,3,3,3544,155.0,3491.0,23.0,4601.0,13,137.0,8703,8873.0,5370,5034,742,1794,1645,185
top,http://www.loopnet.com/Listing/20157634/6060-E...,,Access Denied\nAccess Denied\nYou are in breac...,ODEX INVESTMENTS V LLC,"1001 4TH AVE STE 4500, SEATTLE, WA 98154","$3,500,000",,,,,Land,,"""",,Glen Kunofsky,tel:+12124305115,alvin@themansourgroup.com,Kevin Mansour,tel:+18583733187,kevin@themansourgroup.com
freq,1,152.0,152,1,1,290,8527.0,4315.0,6181.0,1026.0,3232,5831.0,44,191.0,67,67,11,33,33,9


### Images and URL lead to a web page. 
I've been researching about it and I could apply an easy and fast way to see if the webpage is available,
with the package webbrowswer


In [None]:
## import webbrowser
## url_1 = realstate.url[0]
## webbrowser.open_new_tab(url_1)
##
##

### Type, broker, place seem like correct attributes to organize the data around. Statistics about those should be interesting, the intersection between them and Price, Are, Year built are key info