This notebook was prepared by Manuel Rafael Vázquez Gandullo, source an license info is on [GitHub](https://github.com/mrvgME/Airbnb_Project).

## Madrid Airbnb Data Project Notebook

The objective of this notebook is to explore the airbnb dataset to create the database for Grafana and to get insights from the data for the later modelling. The notebook is organized as follows:
- Explore the data
- Modelling

### Setup imports and variables

In [3]:
import os
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

True

In [4]:
project_dir = os.getenv('project_dir')
data_path = r'data/raw/kaggle_data'
file = 'listings.csv'

file_path = os.path.join(project_dir, data_path, file)

### Explore the data

Read the data

In [8]:
df = pd.read_csv(file_path)
df.head(5)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,6369,"Rooftop terrace room , ensuite bathroom",13660,Simon,Chamartín,Hispanoamérica,40.45724,-3.67688,Private room,60,1,78,2020-09-20,0.58,1,180
1,21853,Bright and airy room,83531,Abdel,Latina,Cármenes,40.40381,-3.7413,Private room,31,4,33,2018-07-15,0.42,2,364
2,23001,Apartmento Arganzuela- Madrid Rio,82175,Jesus,Arganzuela,Legazpi,40.3884,-3.69511,Entire home/apt,50,15,0,,,7,1
3,24805,Gran Via Studio Madrid,346366726,A,Centro,Universidad,40.42183,-3.70529,Entire home/apt,92,5,10,2020-03-01,0.13,1,72
4,26825,Single Room whith private Bathroom,114340,Agustina,Arganzuela,Legazpi,40.38975,-3.69018,Private room,26,2,149,2020-03-12,1.12,1,365


In [10]:
df.tail(5)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
19613,49185822,Habitación con TV con Netflix en Lavapiés,172011610,Belaid,Centro,Embajadores,40.40756,-3.69937,Private room,23,30,0,,,8,349
19614,49186179,Habitación con TV con Netflix en Gaztambide,172011610,Belaid,Chamberí,Gaztambide,40.43706,-3.71364,Private room,21,30,0,,,8,350
19615,49187258,Habitación con TV con Netflix en Chamberí,172011610,Belaid,Chamberí,Arapiles,40.43857,-3.70715,Private room,22,7,0,,,8,364
19616,49187471,Habitación con TV con Netflix en Goya,172011610,Belaid,Salamanca,Guindalera,40.43027,-3.66759,Private room,19,30,0,,,8,349
19617,49187791,Habitación con TV con Netflix en Chamberí,172011610,Belaid,Chamberí,Arapiles,40.43484,-3.70667,Private room,20,30,0,,,8,349


View the data types of each column:

In [11]:
df.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

Type 'object' is a string for pandas, which poses problems with machine learning algorithms. If we want to use these as features, we'll need to convert these to number representations.

Get some basic information on the DataFrame:

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19618 entries, 0 to 19617
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              19618 non-null  int64  
 1   name                            19615 non-null  object 
 2   host_id                         19618 non-null  int64  
 3   host_name                       19091 non-null  object 
 4   neighbourhood_group             19618 non-null  object 
 5   neighbourhood                   19618 non-null  object 
 6   latitude                        19618 non-null  float64
 7   longitude                       19618 non-null  float64
 8   room_type                       19618 non-null  object 
 9   price                           19618 non-null  int64  
 10  minimum_nights                  19618 non-null  int64  
 11  number_of_reviews               19618 non-null  int64  
 12  last_review                     

Generate various descriptive statistics on the DataFrame:

In [15]:
df.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,19618.0,19618.0,19618.0,19618.0,19618.0,19618.0,19618.0,13981.0,19618.0,19618.0
mean,29122000.0,131216500.0,40.420984,-3.69404,129.27174,6.586196,31.858803,1.125958,10.229177,159.098328
std,13518390.0,116679000.0,0.022627,0.028671,484.143545,33.286582,63.938997,1.348235,23.546472,144.252803
min,6369.0,7952.0,40.33221,-3.86391,0.0,1.0,0.0,0.01,1.0,0.0
25%,19034240.0,27653130.0,40.409393,-3.7077,35.0,1.0,0.0,0.17,1.0,0.0
50%,31875060.0,99018980.0,40.419735,-3.70112,58.0,2.0,4.0,0.59,2.0,126.0
75%,40909940.0,225689800.0,40.43029,-3.68542,100.0,3.0,31.0,1.63,6.0,320.0
max,49187790.0,396428100.0,40.56274,-3.5319,9999.0,1125.0,706.0,16.22,163.0,365.0


Now that we have a general idea of the data set contents, we can dive deeper into each column. We'll be doing exploratory data analysis and cleaning data to setup 'features' we'll be using in our machine learning algorithms.

Plot a few features to get a better idea of each: