![](https://s14-eu5.ixquick.com/cgi-bin/serveimage?url=https%3A%2F%2Fwww.alojandoturistas.com%2Fwp-content%2Fuploads%2F2016%2F03%2Finsideairbnb01.jpg&sp=98223f3b39fbea79a7ea767e03a1cf15)

## INTRODUCTION

This Master project is done based on a dataset found in http://insideairbnb.com/, an independent service not
related to Airbnb that offers for free Airbnb listings data for major cities in the world.

Our dataset is a picture of the airbnb website, but is not Airbnb private data. It has been collected through webscrapping.

Airbnb website is free to list any property, this involves that not all listings are real. After thorough exploration of 
the data, we realised that many listings were just fake, other were not online anymore, and sometimes not even homes. Many appear in Airbnb website  but they not available for rent. 

Consequently, the first stage of this project consists of finding all those fake listings and delete them in order to 
have an accurate sample of the holiday rental occupancy rate in Madrid



## Find and remove inactive listings. 
- There are three types of inactive properties: 

    a) Listings that have never been available (although they turn up in website).
    
    b) Listings that were active in the past but for some reason they are not available any more.
    
    c) Semi active: properties only available a few days a month(ie, only weekends) or canceled by the host
    that removes on purpose all availability during the month.


In [2]:
import pandas as pd

In [4]:
%cd ~/Master-IV/PROYECTO_FM/

/home/dsc/Master-IV/PROYECTO_FM


In [5]:
# File available in dataset folder
aib = pd.read_csv('listings.csv')
aib.shape


## Filters  
The most important elements to measure if a property is available for rent full time and the listing is real are the following variables(columns):

 1) Number of reviews.
 
 2) Reviews per month.
 
 3) Calendar last updated.
 
 4) Availability_30/60/90.
 
 5) Last_review // First_review
 



## Filter 1. 
Rooms/apartments that appear sold out in the next 3 months (availability_30=0 & availability_60=0 & availability_90=0)
and  have no reviews at all, thus "reviews per month" and "number of reviews" columns equal to 0 or NaN.


In [7]:
# Find and delete
aib = aib.drop(aib[(aib.number_of_reviews==0) & (aib['reviews_per_month'].isnull()) & (aib.availability_30==0) & 
      (aib.availability_60==0)  & (aib.availability_90==0)].index)
aib.shape

(12662, 95)

## Filter 2
Properties not updated. Properties listed for more than a year,having less than 11 reviews with extreme low availability.

In [8]:

aib['last_review'] = pd.to_datetime(aib.last_review,format='%Y-%m-%d')
aib['first_review'] = pd.to_datetime(aib.first_review,format='%Y-%m-%d')

# Dataset as of April 2017.
aib =aib.drop(aib[(aib.number_of_reviews<11) & (aib.first_review<'2016-03-01') & (aib.availability_60<3) 
    & (aib.availability_90<5)].index)
aib.shape

(12293, 95)

In [13]:

aib.reviews_per_month.mean()


1.98623316622432

## Filter 3
Properties with very low availability, reviews per month well below the average (1.98) and last review in 2016.

In [9]:
aib = aib.drop(aib[(aib.availability_30<2) & (aib.availability_60<5) & (aib.reviews_per_month<1.5) 
    & (aib.last_review<'2017-01-01')].index)
aib.shape

(11557, 95)

In [10]:
# Final filter based on number of reviews along with 'reviews per month'
aib =  aib.drop(aib[(aib.availability_30==0) & (aib.number_of_reviews<=7) 
    & (aib.reviews_per_month<1.5)].index)
aib.shape

(11112, 95)

In [11]:
aib =  aib.drop(aib[(aib.availability_30==0)  & (aib.number_of_reviews<=7) 
    & (aib.reviews_per_month.isnull())].index)
aib.shape

(10687, 95)

## Filter 4.
Misleading listings showing 0 availability.

Properties that  turn up with average "number of reviews" and "reviews per month", but they are not booked, 
they have been just canceled by the host so that no one can book them in the current month. 

These kind of offerings are the hardest to spot, so to reduce the error margin we establish a more accurate and 
customized filter. Each listing with no availability_30 must show a number of reviews greater than
the neighbourhood average.
We assume that if a property is fully booked, it is very popular and therefore, it must show more
reviews than regular properties. 

In [13]:
# Find out average number of reviews of every district
aib.groupby('neighbourhood_group_cleansed').number_of_reviews.mean()

neighbourhood_group_cleansed
Arganzuela               24.296596
Barajas                  38.196970
Carabanchel              12.534483
Centro                   37.414127
Chamartín                15.654275
Chamberí                 18.672154
Ciudad Lineal            14.903226
Fuencarral - El Pardo     7.570423
Hortaleza                13.748428
Latina                   17.061433
Moncloa - Aravaca        13.313783
Moratalaz                15.176471
Puente de Vallecas       16.708333
Retiro                   22.683908
Salamanca                15.916306
San Blas - Canillejas    22.623762
Tetuán                   15.387931
Usera                    13.794872
Vicálvaro                 3.210526
Villa de Vallecas         7.857143
Villaverde               15.514706
Name: number_of_reviews, dtype: float64

In [14]:
# Filter applied
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Arganzuela') & (aib.number_of_reviews<=24)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Barajas') & (aib.number_of_reviews<=38)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Carabanchel') & (aib.number_of_reviews<=12)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Centro') & (aib.number_of_reviews<=37)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Chamartín') & (aib.number_of_reviews<=15)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Chamberí') & (aib.number_of_reviews<=18)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Ciudad Lineal') & (aib.number_of_reviews<=14)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Fuencarral - El Pardo') & (aib.number_of_reviews<=7)
    & (aib.availability_30==0)].index)
aib.shape

(10315, 95)

In [15]:
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Hortaleza') & (aib.number_of_reviews<=13)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Latina') & (aib.number_of_reviews<=17)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Moratalaz') & (aib.number_of_reviews<=15)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Moncloa - Aravaca') & (aib.number_of_reviews<=13)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Puente de Vallecas') & (aib.number_of_reviews<=16)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Retiro') & (aib.number_of_reviews<=22)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Salamanca') & (aib.number_of_reviews<=15)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='San Blas - Canillejas') & (aib.number_of_reviews<=22)
    & (aib.availability_30==0)].index)
aib.shape

(10236, 95)

In [16]:
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Tetuan') & (aib.number_of_reviews<=15)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Usera') & (aib.number_of_reviews<=13)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Vicálvaro') & (aib.number_of_reviews<=3)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Villa de Vallecas') & (aib.number_of_reviews<=7)
    & (aib.availability_30==0)].index)
aib = aib.drop(aib[(aib.neighbourhood_group_cleansed=='Villaverde') & (aib.number_of_reviews<=15)
    & (aib.availability_30==0)].index)
aib.shape


(10231, 95)

In [17]:
# Save df as csv.
aib.to_csv('aib.csv',sep='\t')