![](https://s14-eu5.ixquick.com/cgi-bin/serveimage?url=https%3A%2F%2Fpythonprogramming.net%2Fstatic%2Fimages%2Fcategories%2Fmachine-learning-playlist.png&sp=42e9f77a3ce871e8bbef1abd43e89ab6)
# Dataset for model implementation

The purpose of this notebook is to create a dataset ready to be implemented in R for machine learning algorithms.
#### - Data Cleaning 
#### - Variable selection for model 



In [2]:
import pandas as pd
import numpy as np


In [3]:
aibb = pd.read_csv('airbnb.csv')# file saved in Notebook FIRST_DATA_FILTERING
aibb.shape

(10077, 98)

### ROOM TYPE SELECTION.

The goal of the project is to build a decision tool for landlors in order to maximize real state income.

Airbnb  offers three types of accomodation:

- Entire Homes or Apartments.
- Rooms.
- Shared rooms.

Shared rooms are less than 200 records and out of interest for the scope of this project. It is not considered as an 
"asset" and therefore it will not be taken into account in the analysis. 

In [4]:
# Shared room category removed
aibb = aibb.drop(aibb[aibb.room_type=='Shared room'].index)
aibb.shape

(9916, 98)

### DATA CLEANING

CANCELATION POLICY.

In order to ease calculations (it will be a dummy variable) Super_strict_30 and 60 are reclassified  as strict.


In [6]:
# CANCELATION POLICY
 
aibb.cancellation_policy.value_counts()

strict             3879
moderate           3145
flexible           2888
super_strict_30       2
super_strict_60       2
Name: cancellation_policy, dtype: int64

In [7]:
aibb['cancellation_policy'] = aibb.cancellation_policy.astype(str)
aibb['cancellation_policy']= aibb.cancellation_policy.str.replace('super_strict_30','strict')
aibb['cancellation_policy']= aibb.cancellation_policy.str.replace('super_strict_60','strict')

In [8]:
aibb.cancellation_policy.value_counts()

strict      3883
moderate    3145
flexible    2888
Name: cancellation_policy, dtype: int64

ZIPCODE.

Amend wrong values (input mistakes)

In [9]:
# ZIPCODE
aibb.zipcode.value_counts()

28012           1580
28004           1383
28005            918
28013            779
28014            528
28015            422
28045            350
28010            288
28008            218
28001            187
28028            185
28011            182
28009            161
28019            156
28006            152
28007            148
28020            142
28003            126
28039            112
28002            106
28027             89
28017             88
28025             87
28043             81
28029             80
28033             72
28042             69
28026             67
28053             60
28038             54
                ... 
28040             10
28054             10
28830              2
28224              1
28730              1
08015              1
28105              1
28058              1
28052              1
28056              1
                   1
10100              1
28290              1
28060              1
2804               1
29012              1
28660        

In [10]:
# Correction of wrong values
aibb['zipcode'] = aibb.zipcode.replace('2802\n28012',28012)
aibb['zipcode'] = aibb.zipcode.replace('28051\n28051',28051)
aibb['zipcode'] = aibb.zipcode.replace('28002\n28002',28002)

In [11]:
# remove space from room type name so that R can perform Linear regression
aibb['room_type'] =aibb.room_type.str.replace('Entire home/apt','Entire_home/apt')
aibb.room_type.value_counts()

Entire_home/apt    6271
Private room       3645
Name: room_type, dtype: int64

## Missing Values. NANs

Some Machine Learning  algorithms are not configured to cope with NaNs, and may crash if they encounter a missing value.
Provided that many columns have multiple NANs and we cannot delete them (removal of those observations may affect final results) we will replace them with the average value of the feature or with 0.

In [14]:
aibb['bathrooms'] = aibb.bathrooms.replace(np.nan,1.23)
aibb['bedrooms'] = aibb.bedrooms.replace(np.nan,1.31)
aibb['beds'] = aibb.beds.replace(np.nan,2.03)
aibb['security_deposit'] = aibb.security_deposit.replace(np.nan,0) # NaN implies no deposit required so replaced by 0.
aibb['reviews_per_month'] = aibb.reviews_per_month.replace(np.nan,0)
aibb['review_scores_value'] = aibb.review_scores_value.replace(np.nan,0)
aibb['review_scores_accuracy'] = aibb.review_scores_accuracy.replace(np.nan,0)
aibb['review_scores_rating'] = aibb.review_scores_rating.replace(np.nan,0)
aibb['cleaning_fee'] = aibb.cleaning_fee.replace(np.nan,0)


### VARIABLE SELECTION FOR ML MODEL

Explore all columns to find features that may improve model accuracy.  

In [19]:

pd.set_option('display.max_columns',None)
aibb.head(2)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,Price_Person_Night
0,0,0,7830063,https://www.airbnb.com/rooms/7830063,20170407214119,2017-04-08,Quiet room in Plaza Mayor,Room in magnificent property in the historic c...,"Based in the historic center of Madrid, to 1 m...",Room in magnificent property in the historic c...,none,"In the historic heart of Madrid, near the Plaz...",Check-in from 21:00 to 23:00 has an extra char...,Very well connected by buses and subwa (Tirso ...,Quiet room on the first floor of historic buil...,Maps and all kind of information and recommend...,Please respect the silence of the other guests...,https://a0.muscache.com/im/pictures/99669989/f...,https://a0.muscache.com/im/pictures/99669989/f...,https://a0.muscache.com/im/pictures/99669989/f...,https://a0.muscache.com/im/pictures/99669989/f...,41032693,https://www.airbnb.com/users/show/41032693,Bluekitchen,2015-08-10,"Madrid, Community of Madrid, Spain",Somos dos amigos de la infancia de Madrid que ...,within an hour,100%,,f,https://a0.muscache.com/im/users/41032693/prof...,https://a0.muscache.com/im/users/41032693/prof...,La Latina,7.0,7.0,"['email', 'phone', 'reviews', 'jumio']",t,t,"La Latina, Madrid, Comunidad de Madrid 28005, ...",La Latina,Palacio,Centro,Madrid,Comunidad de Madrid,28005,Madrid,"Madrid, Spain",ES,Spain,40.412275,-3.708718,t,House,Private room,2,2.0,1.0,2.0,Real Bed,"{TV,Internet,""Wireless Internet"",Kitchen,Doorm...",,44.7342,281.663694,"$1,200.00",93.887898,14.083185,2,$0.00,2,30,2 weeks ago,,4,13,37,165,2017-04-08,100,2015-08-23,2017-04-02,92.0,9.0,9.0,10.0,10.0,10.0,9.0,t,,,t,moderate,f,f,7,5.04,22.3671
1,1,1,9898596,https://www.airbnb.com/rooms/9898596,20170407214119,2017-04-08,Homely apartment in the heart of Madrid,"Spacious apartment for up to 10 people, with a...",This spacious apartment is located in the cent...,"Spacious apartment for up to 10 people, with a...",none,The Living Santa Ana Apartments are located in...,,We are 3m walk from the La Latina subway.,,"During his stay, please contact us if you have...",The apartment is located in a community of nei...,https://a0.muscache.com/im/pictures/f3336e14-5...,https://a0.muscache.com/im/pictures/f3336e14-5...,https://a0.muscache.com/im/pictures/f3336e14-5...,https://a0.muscache.com/im/pictures/f3336e14-5...,50908516,https://www.airbnb.com/users/show/50908516,Daniel,2015-12-09,"Madrid, Community of Madrid, Spain","Somos Amanda y Dani, tenemos este establecimie...",within an hour,100%,,f,https://a0.muscache.com/im/pictures/0061d173-a...,https://a0.muscache.com/im/pictures/0061d173-a...,La Latina,1.0,1.0,"['email', 'phone', 'reviews', 'jumio']",t,t,"La Latina, Madrid, Comunidad de Madrid 28005, ...",La Latina,Palacio,Centro,Madrid,Comunidad de Madrid,28005,Madrid,"Madrid, Spain",ES,Spain,40.411093,-3.708985,t,Apartment,Entire_home/apt,10,1.0,3.0,9.0,Real Bed,"{TV,Internet,""Wireless Internet"",""Wheelchair a...",,143.7885,,,187.775796,37.555159,6,$15.00,1,1125,today,,0,5,14,230,2017-04-08,68,2015-12-31,2017-04-04,93.0,9.0,9.0,10.0,10.0,10.0,9.0,t,,,t,moderate,f,f,1,4.39,23.96475


In [20]:
# Features selected for ML model:
aibb = aibb[['neighbourhood_group_cleansed','availability_30','is_location_exact','host_identity_verified',
              'bathrooms','bedrooms','beds','minimum_nights','maximum_nights','price','Price_Person_Night',
              'room_type','security_deposit','cleaning_fee','number_of_reviews','cancellation_policy','reviews_per_month','review_scores_rating',
               'review_scores_accuracy','review_scores_value','extra_people','guests_included','accommodates','host_is_superhost',
              'instant_bookable']]

In [21]:
aibb.head()

Unnamed: 0,neighbourhood_group_cleansed,availability_30,is_location_exact,host_identity_verified,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,price,Price_Person_Night,room_type,security_deposit,cleaning_fee,number_of_reviews,cancellation_policy,reviews_per_month,review_scores_rating,review_scores_accuracy,review_scores_value,extra_people,guests_included,accommodates,host_is_superhost,instant_bookable
0,Centro,4,t,t,2.0,1.0,2.0,2,30,44.7342,22.3671,Private room,93.887898,14.083185,100,moderate,5.04,92.0,9.0,9.0,$0.00,2,2,f,t
1,Centro,0,t,t,1.0,3.0,9.0,1,1125,143.7885,23.96475,Entire_home/apt,187.775796,37.555159,68,moderate,4.39,93.0,9.0,9.0,$15.00,6,10,f,t
2,Centro,4,t,f,1.0,1.0,1.0,1,1125,86.2731,43.13655,Entire_home/apt,0.0,14.083185,34,strict,5.57,93.0,9.0,9.0,$10.00,2,2,t,t
3,Centro,1,f,f,1.0,1.0,1.0,3,1125,45.7993,45.7993,Private room,0.0,0.0,11,strict,0.24,98.0,10.0,10.0,$10.00,1,2,t,f
4,Centro,7,t,t,1.5,2.0,2.0,1,1125,53.255,26.6275,Entire_home/apt,234.719745,18.77758,6,strict,4.29,100.0,10.0,10.0,$25.00,2,4,f,f


In [22]:
aibb.describe()

Unnamed: 0,availability_30,bathrooms,bedrooms,beds,minimum_nights,maximum_nights,price,Price_Person_Night,security_deposit,cleaning_fee,number_of_reviews,reviews_per_month,review_scores_rating,review_scores_accuracy,review_scores_value,guests_included,accommodates
count,9916.0,9916.0,9916.0,9916.0,9916.0,9916.0,9916.0,9916.0,9916.0,9916.0,9916.0,9916.0,9916.0,9916.0,9916.0,9916.0,9916.0
mean,11.393304,1.233252,1.310365,2.033092,2.552945,980.922852,74.414786,52.731062,81.367555,17.17516,28.634933,1.848756,76.257765,7.82261,7.654397,1.653893,3.326039
std,8.672236,0.546748,0.85538,1.516631,8.376643,10249.174631,64.100485,54.054596,149.696621,21.273206,42.466663,1.959162,35.230043,3.627054,3.572329,1.147874,2.023835
min,0.0,0.0,0.0,1.0,1.0,1.0,9.5859,1.633153,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,5.0,1.0,1.0,1.0,1.0,360.0,37.2785,26.6275,0.0,0.0,2.0,0.28,80.0,8.0,8.0,1.0,2.0
50%,9.0,1.0,1.0,2.0,2.0,1125.0,60.7107,37.2785,0.0,14.083185,12.0,1.2,92.0,9.0,9.0,1.0,3.0
75%,16.0,1.0,2.0,2.0,3.0,1125.0,90.5335,58.5805,140.831847,28.166369,37.0,2.91,97.0,10.0,10.0,2.0,4.0
max,30.0,8.0,10.0,16.0,365.0,1000000.0,1065.1,905.335,4491.597033,469.439489,446.0,17.21,100.0,10.0,10.0,16.0,16.0


In [25]:
aibb.isnull().sum()

neighbourhood_group_cleansed    0
availability_30                 0
is_location_exact               0
host_identity_verified          0
bathrooms                       0
bedrooms                        0
beds                            0
minimum_nights                  0
maximum_nights                  0
price                           0
Price_Person_Night              0
room_type                       0
security_deposit                0
cleaning_fee                    0
number_of_reviews               0
cancellation_policy             0
reviews_per_month               0
review_scores_rating            0
review_scores_accuracy          0
review_scores_value             0
extra_people                    0
guests_included                 0
accommodates                    0
host_is_superhost               0
instant_bookable                0
dtype: int64

### Final File for Linear Regression analysis  and Machine Learning in R

In [63]:
# Saved in Dataset folder
aibb.to_csv('RLM.csv')