# Bay Area, CA - Airbnb Data


## Context

Since its inception in 2008, Airbnb has disrupted the hospitality industry by allowing almost anyone to rent out a spare room and host travelers looking for an overnight stay. 

While Airbnb has publicly available data from many locations, **Santa Clara County is of particular interest as it is one of the major counties in the Bay Area and home to many prominent tech companies and startups in the Silicon Valley.**
As such, home prices and cost of living are exceptionally high in this area, as one will see in the data.


## Content

All data is publicly available under the Creative Commons "Public Domain Dedication" license and has been updated as of June 12th, 2020.Provided are metrics that are publicly visible for each listing (e.g. name, description, price, reviews, etc.), but the dataset contains some Airbnb's internal metrics as well (review score accuracy, host acceptance rate, etc.).

## Acknowledgements

Acknowledgments go to Airbnb for their publicly released datasets that are available at this website.




## Inspiration & ideas:

1. Create a price-suggestion model for new Airbnb hosts who might not know the value of their listing.
    - Text mining with statified house price data
2. Can we predict the rating of an Airbnb listing utilizing **NLP** of the description columns?How have Airbnb
3. prices changed over time? Are prices seasonal?
   - Time series analysis?
4. Which areas in the Santa Clara County are most "popular"?
   - Region analysis
5. Which features of an Airbnb listing are important to add to its perceived value?
   - Feature analysis,PCA
   - Correlation analysis
   

## Import  libraries and data

In [529]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

In [530]:
data = pd.read_csv('./data/Airbnb_Listings.csv')
data.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,4952,https://www.airbnb.com/rooms/4952,20200530151957,2020-05-30,Butterfly Inn - Graceful Living !,Lovely garden setting in a serene and art-fill...,Very comfortable Queen bed and small desk in b...,Lovely garden setting in a serene and art-fill...,none,"Located in Professorville, Palo Alto, one of t...","Housekeeping every Monday, leave your door ope...",Walking distance to Stanford University (30 mi...,"Kitchen, laundry, family/TV room, garden, free...","Welcome and Orientation. Shared kitchen, garde...",No shoes in the house beyond the hall or laund...,,,https://a0.muscache.com/im/pictures/529448/3d7...,,7054,https://www.airbnb.com/users/show/7054,Mei Lin,2009-01-24,"Palo Alto, California, United States","I have lived and worked in Asia, Europe, Austr...",within an hour,100%,98%,t,https://a0.muscache.com/im/users/7054/profile_...,https://a0.muscache.com/im/users/7054/profile_...,Palo Alto,5.0,5.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,f,"Palo Alto, CA, United States",Palo Alto,Palo Alto,,Palo Alto,CA,94301,"South Bay, CA","Palo Alto, CA",US,United States,37.43972,-122.15672,t,Villa,Private room,2,2.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,Kitchen,Heating,W...",,$90.00,$440.00,"$1,428.00",$0.00,$35.00,1,$45.00,2,365,2,2,365,365,2.0,365.0,2 months ago,t,0,0,0,0,2020-05-30,75,0,2009-08-02,2019-01-12,96.0,10.0,10.0,10.0,10.0,10.0,10.0,f,,"{""PALO ALTO""}",f,f,moderate,f,f,5,0,5,0,0.57
1,11464,https://www.airbnb.com/rooms/11464,20200530151957,2020-05-31,Deluxe Private Studio-custom int.,Custom built Studio with exquisite design. Rea...,Description A favorite for international corpo...,Custom built Studio with exquisite design. Rea...,none,10 min. to Lucky's and Starbucks at El Camino ...,Pet Policy: - Well Behaved pet up to 25 lb. of...,Public transportation at Homestead and Pomeroy...,Complimentary Wifi-internet + Basic Cable,Available any time for support and problem res...,Can house a couple- bed type is a full size be...,,,https://a0.muscache.com/im/pictures/121321/2c5...,,42458,https://www.airbnb.com/users/show/42458,Emey,2009-09-30,"Mountain View, California, United States",I pride myself on attention to details and cus...,within an hour,100%,90%,f,https://a0.muscache.com/im/users/42458/profile...,https://a0.muscache.com/im/users/42458/profile...,Santa Clara,13.0,13.0,"['email', 'phone', 'reviews']",t,f,"Santa Clara, CA, United States",Santa Clara,Santa Clara,,Santa Clara,CA,95051,"South Bay, CA","Santa Clara, CA",US,United States,37.34342,-121.98829,t,Apartment,Entire home/apt,1,1.0,,1.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",325.0,$85.00,,"$2,100.00",$700.00,$250.00,1,$10.00,14,365,14,14,365,365,14.0,365.0,5 months ago,t,30,60,90,179,2020-05-31,6,1,2014-06-15,2019-10-11,87.0,10.0,10.0,10.0,9.0,9.0,9.0,f,,"{""SANTA CLARA""}",f,f,moderate,f,t,12,12,0,0,0.08
2,17884,https://www.airbnb.com/rooms/17884,20200530151957,2020-05-31,Silicon Valley Suite,"A guest suite for one or two, in a house in a ...",This is a private suite at the rear of a house...,"A guest suite for one or two, in a house in a ...",none,"This is a very quiet family neighborhood, but ...","I cannot accommodate cats, sorry. I can usuall...","The CalTrain station is in walking distance, a...",Private outdoor patio. Shared washer/dryer on ...,I am usually in town. Guests have a private e...,"This is a quiet neighborhood, and the suite sh...",,,https://a0.muscache.com/im/pictures/08c0f0aa-3...,,69023,https://www.airbnb.com/users/show/69023,Janet,2010-01-08,"Lynn Haven, Florida, United States","I am a software engineer, recently retired fro...",within a day,100%,50%,f,https://a0.muscache.com/im/users/69023/profile...,https://a0.muscache.com/im/users/69023/profile...,Palo Alto,2.0,2.0,"['email', 'phone', 'reviews', 'kba']",t,t,"Mountain View, CA, United States",Mountain View,Mountain View,,Mountain View,CA,94041,"South Bay, CA","Mountain View, CA",US,United States,37.38612,-122.07357,t,Guest suite,Entire home/apt,2,1.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",,$65.00,"$1,400.00","$2,800.00","$1,000.00",,1,$0.00,30,180,30,30,180,180,30.0,180.0,4 weeks ago,t,29,59,89,179,2020-05-31,2,0,2018-11-20,2019-04-27,90.0,8.0,10.0,10.0,10.0,10.0,10.0,f,,"{""Mountain View"","" CA""}",f,f,strict_14_with_grace_period,f,f,2,2,0,0,0.11
3,21373,https://www.airbnb.com/rooms/21373,20200530151957,2020-05-30,Bonsai Garden Inn in Professorville,Room in gracious home with beautiful garden. ...,"Bright, garden-facing room in beautiful home. ...",Room in gracious home with beautiful garden. ...,none,This room is in an ultra convenient location i...,The family room has a flat panel tv and desk f...,The Stanford Shopping Center is 20 minutes wal...,"Kitchen, Laundry, Garden, Family Room with TV,...",We will see you everyday as we cook meals in t...,No shoes in the house beyond hall or laundry. ...,,,https://a0.muscache.com/im/pictures/527285/0f9...,,7054,https://www.airbnb.com/users/show/7054,Mei Lin,2009-01-24,"Palo Alto, California, United States","I have lived and worked in Asia, Europe, Austr...",within an hour,100%,98%,t,https://a0.muscache.com/im/users/7054/profile_...,https://a0.muscache.com/im/users/7054/profile_...,Palo Alto,5.0,5.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,f,"Palo Alto, CA, United States",Palo Alto,Palo Alto,,Palo Alto,CA,94301,"South Bay, CA","Palo Alto, CA",US,United States,37.44001,-122.15667,t,Villa,Private room,1,2.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Paid par...",,$99.00,$600.00,"$2,000.00",$0.00,$35.00,1,$50.00,2,730,2,2,730,730,2.0,730.0,3 weeks ago,t,0,0,0,212,2020-05-30,264,39,2010-05-28,2020-03-14,97.0,10.0,10.0,10.0,10.0,10.0,10.0,f,,"{""PALO ALTO""}",f,f,moderate,f,f,5,0,5,0,2.17
4,37512,https://www.airbnb.com/rooms/37512,20200530151957,2020-05-31,Private room - Parking 3 carport,We live in a safe community close to public tr...,I have a really nice room in a quiet neighborh...,We live in a safe community close to public tr...,none,Our community is a safe environment and at nig...,Please remember that you are in my home and re...,We have a bus stop right outside our community...,The kitchen and all that it offers. The laund...,We follow our guests lead when it comes to soc...,No drunk and disorderly conduct please. You m...,,,https://a0.muscache.com/im/pictures/47698055/3...,,161741,https://www.airbnb.com/users/show/161741,Maria,2010-07-09,"San Jose, California, United States",I am in a committed relationship with my live...,within a few hours,100%,100%,t,https://a0.muscache.com/im/users/161741/profil...,https://a0.muscache.com/im/users/161741/profil...,South San Jose,2.0,2.0,"['email', 'phone', 'facebook', 'reviews', 'man...",t,t,"San Jose, CA, United States",South San Jose,San Jose,,San Jose,CA,95111,"South Bay, CA","San Jose, CA",US,United States,37.30598,-121.84036,t,Bungalow,Private room,2,1.0,1.0,1.0,Real Bed,"{Internet,Wifi,""Air conditioning"",Kitchen,""Fre...",,$52.00,$393.00,"$1,386.00",,,1,$0.00,1,7,1,1,7,7,1.0,7.0,23 months ago,t,0,0,28,303,2020-05-31,183,0,2010-08-01,2017-03-14,94.0,9.0,10.0,10.0,10.0,9.0,9.0,f,,"{""SAN JOSE""}",f,f,moderate,t,t,2,0,2,0,1.53


In [531]:
data.shape

(7221, 106)

In [532]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7221 entries, 0 to 7220
Columns: 106 entries, id to reviews_per_month
dtypes: float64(23), int64(21), object(62)
memory usage: 5.8+ MB


## Data cleaning

In [533]:
# find missing values and single value columns in data:
del_cols = []

for col in data.columns:
    if data[col].isna().all():
        del_cols.append(col)
    elif data[col].nunique() == 1:
        del_cols.append(col)

In [534]:
del_cols

['scrape_id',
 'experiences_offered',
 'thumbnail_url',
 'medium_url',
 'xl_picture_url',
 'neighbourhood_group_cleansed',
 'country_code',
 'country',
 'has_availability',
 'requires_license',
 'license',
 'is_business_travel_ready']

In [535]:
data = data.drop(del_cols, axis=1)

In [536]:
data['id'].isna().sum()

0

In [537]:
# delete url column:
urls = [
    'listing_url', 'picture_url', 'host_url', 'host_thumbnail_url',
    'host_picture_url'
]


In [538]:
data = data.drop(urls, axis=1)

In [539]:
for col in data.columns:
    list_unique = (col, len(data[col].unique()))
    print(list_unique)

('id', 7221)
('last_scraped', 2)
('name', 7078)
('summary', 6052)
('space', 4403)
('description', 6440)
('neighborhood_overview', 3509)
('notes', 2747)
('transit', 3200)
('access', 3378)
('interaction', 3166)
('house_rules', 3718)
('host_id', 3533)
('host_name', 2103)
('host_since', 2085)
('host_location', 242)
('host_about', 2030)
('host_response_time', 5)
('host_response_rate', 44)
('host_acceptance_rate', 78)
('host_is_superhost', 3)
('host_neighbourhood', 145)
('host_listings_count', 62)
('host_total_listings_count', 62)
('host_verifications', 240)
('host_has_profile_pic', 3)
('host_identity_verified', 3)
('street', 37)
('neighbourhood', 37)
('neighbourhood_cleansed', 16)
('city', 35)
('state', 5)
('zipcode', 83)
('market', 5)
('smart_location', 37)
('latitude', 5882)
('longitude', 6357)
('is_location_exact', 2)
('property_type', 28)
('room_type', 4)
('accommodates', 16)
('bathrooms', 16)
('bedrooms', 11)
('beds', 21)
('bed_type', 5)
('amenities', 6396)
('square_feet', 25)
('price'

In [540]:
data['property_type'].value_counts()

House                 3715
Apartment             1152
Serviced apartment     504
Townhouse              429
Guest suite            363
Guesthouse             318
Condominium            303
Villa                  129
Bungalow               115
Loft                    41
Cottage                 29
Camper/RV               27
Boutique hotel          23
Tiny house              16
Other                   13
Bed and breakfast       12
Cabin                    7
Farm stay                6
Tent                     5
Treehouse                3
Yurt                     3
Barn                     2
Train                    1
Lighthouse               1
Campsite                 1
Earth house              1
Aparthotel               1
Chalet                   1
Name: property_type, dtype: int64

In [541]:
data['state'].unique() # to drop

array(['CA', 'Ca', 'ca', nan, 'California '], dtype=object)

In [542]:
data_null = data.isnull().sum()

In [543]:
data_null.sort_values(ascending=False)

square_feet                                     7192
weekly_price                                    6647
monthly_price                                   6629
notes                                           3529
access                                          3117
transit                                         3085
interaction                                     2968
neighborhood_overview                           2677
host_about                                      2334
house_rules                                     2264
host_response_rate                              1807
host_response_time                              1807
space                                           1707
security_deposit                                1457
review_scores_location                          1370
review_scores_checkin                           1370
review_scores_value                             1370
review_scores_communication                     1369
review_scores_accuracy                        

In [544]:
data = data.drop(['square_feet','zipcode','state'], axis=1)

In [545]:
data.columns

Index(['id', 'last_scraped', 'name', 'summary', 'space', 'description',
       'neighborhood_overview', 'notes', 'transit', 'access', 'interaction',
       'house_rules', 'host_id', 'host_name', 'host_since', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed', 'city', 'market',
       'smart_location', 'latitude', 'longitude', 'is_location_exact',
       'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms',
       'beds', 'bed_type', 'amenities', 'price', 'weekly_price',
       'monthly_price', 'security_deposit', 'cleaning_fee', 'guests_included',
       'extra_people', 'minimum_nights', 'maximum_nights',
       'minimum_minimum_nights', 'maximum_minimum_nights',
       '

### Transfer the unit of currency to float

In [546]:
price_list = [
    'weekly_price',
    'monthly_price',
    'price',
    'security_deposit',
    'cleaning_fee',
    'extra_people',
]

In [548]:
for p in price_list: 
    print(p,data[p].isna().sum())

weekly_price 6647
monthly_price 6629
price 0
security_deposit 1457
cleaning_fee 800
extra_people 0


In [549]:
for p in price_list:
    data[p].fillna('$0.00', inplace=True)
    data[p] = data[p].str.replace('$', '').str.replace(',', '').astype(float)

  data[p] = data[p].str.replace('$', '').str.replace(',', '').astype(float)


In [550]:
data[price_list].isna().sum()

weekly_price        0
monthly_price       0
price               0
security_deposit    0
cleaning_fee        0
extra_people        0
dtype: int64

In [551]:
data.describe(include='all')

Unnamed: 0,id,last_scraped,name,summary,space,description,neighborhood_overview,notes,transit,access,interaction,house_rules,host_id,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,city,market,smart_location,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,7221.0,7221,7220,6922,5514,6983,4544,3692,4136,4104,4253,4957,7221.0,7220,7220,7198,4887,5414,5414,6507,7220,6040,7220.0,7220.0,7221,7220,7220,7221,6738,7221,7220,7210,7221,7221.0,7221.0,7221,7221,7221,7221.0,7220.0,7215.0,7153.0,7221,7221,7221.0,7221.0,7221.0,7221.0,7221.0,7221.0,7221.0,7221.0,7221.0,7221.0,7221.0,7221.0,7221.0,7221.0,7221.0,7221,7221.0,7221.0,7221.0,7221.0,7221,7221.0,7221.0,5912,5912,5854.0,5852.0,5852.0,5851.0,5852.0,5851.0,5851.0,6242,7221,7221,7221,7221,7221.0,7221.0,7221.0,7221.0,5912.0
unique,,2,7077,6051,4402,6439,3508,2746,3199,3377,3165,3717,,2102,2084,241,2029,4,43,77,2,144,,,240,2,2,37,36,16,34,4,37,,,2,28,4,,,,,5,6396,,,,,,,,,,,,,,,,74,,,,,2,,,1926,795,,,,,,,,10,2,4,2,2,,,,,
top,,2020-05-31,TRIBE ❤️ Deep Cleaned 2BR @ Great Mall Milpitas,The HackerHome Network New to the Valley? Tran...,"At Zeus, we focus exclusively on 30 day+ stays...","Dorm-style house in Silicon Valley. No-host, 5...",*Area Highlights* -Located in the Heart of the...,Parking is not included with your stay. Parkin...,This Flat Location is Uber/Lyft Accessible & C...,"• All utilities are included, including gas, e...",We're available 24/7 via Text | Email | Emerge...,By reserving your stay with Tribe you hereby a...,,Zeus,2015-11-02,"San Jose, California, United States",Zeus is reinventing corporate housing. We do t...,within an hour,100%,100%,f,Santa Clara,,,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,f,"San Jose, CA, United States",Palo Alto,San Jose,San Jose,"South Bay, CA","San Jose, CA",,,t,House,Private room,,,,,Real Bed,{},,,,,,,,,,,,,,,,3 months ago,,,,,2020-05-31,,,2019-05-10,2020-03-13,,,,,,,,"{""SAN JOSE""}",f,flexible,f,f,,,,,
freq,,4782,21,44,411,20,31,41,67,29,64,121,,411,412,1936,411,3993,4389,2218,3988,644,,,527,7215,4061,3028,786,2882,3028,5492,3028,,,6796,3715,3454,,,,,7156,36,,,,,,,,,,,,,,,,1163,,,,,4782,,,23,103,,,,,,,,2810,3987,2817,6991,6993,,,,,
mean,26239220.0,,,,,,,,,,,,88864430.0,,,,,,,,,,139.588504,139.588504,,,,,,,,,,37.352285,-121.9666,,,,3.114804,1.383449,1.42176,1.817,,,161.287079,80.958731,263.774962,253.71846,84.5448,1.699349,13.39399,9.756959,298088.0,9.142916,14.247888,298232.5,1785205.0,13.631505,1538945.0,,13.689101,30.373494,48.260352,160.282648,,29.85376,9.369201,,,95.245473,9.723855,9.572454,9.815074,9.793233,9.796103,9.527773,,,,,,32.048608,28.216729,2.775793,1.029359,1.357124
std,12364200.0,,,,,,,,,,,,93905930.0,,,,,,,,,,472.674992,472.674992,,,,,,,,,,0.064819,0.108587,,,,2.370579,0.688715,0.999833,1.633911,,,352.069,441.434587,1312.057642,506.534144,106.107352,1.620945,22.178563,34.689846,25271500.0,35.027086,63.358607,25271500.0,61880840.0,61.69182,53477420.0,,13.616884,26.501026,38.981316,141.746896,,51.488761,15.188236,,,7.734005,0.741816,0.848685,0.641301,0.691503,0.583498,0.833306,,,,,,95.159193,95.714488,6.810433,5.741521,1.530782
min,4952.0,,,,,,,,,,,,7054.0,,,,,,,,,,0.0,0.0,,,,,,,,,,36.9656,-122.19035,,,,1.0,0.0,0.0,0.0,,,10.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,0.0,0.0,0.0,0.0,,0.0,0.0,,,20.0,2.0,2.0,2.0,2.0,2.0,2.0,,,,,,1.0,0.0,0.0,0.0,0.01
25%,16242420.0,,,,,,,,,,,,16972780.0,,,,,,,,,,1.0,1.0,,,,,,,,,,37.31977,-122.04245,,,,2.0,1.0,1.0,1.0,,,60.0,0.0,0.0,0.0,20.0,1.0,0.0,1.0,60.0,1.0,1.0,365.0,365.0,1.0,365.0,,0.0,0.0,0.0,1.0,,1.0,0.0,,,94.0,10.0,9.0,10.0,10.0,10.0,9.0,,,,,,1.0,0.0,0.0,0.0,0.32
50%,28184320.0,,,,,,,,,,,,48005490.0,,,,,,,,,,3.0,3.0,,,,,,,,,,37.35826,-121.96194,,,,2.0,1.0,1.0,1.0,,,100.0,0.0,0.0,100.0,50.0,1.0,7.0,2.0,1125.0,2.0,2.0,1125.0,1125.0,2.0,1125.0,,10.0,30.0,59.0,136.0,,10.0,3.0,,,98.0,10.0,10.0,10.0,10.0,10.0,10.0,,,,,,3.0,1.0,1.0,0.0,0.83
75%,37499480.0,,,,,,,,,,,,142933100.0,,,,,,,,,,10.0,10.0,,,,,,,,,,37.39825,-121.88634,,,,4.0,2.0,2.0,2.0,,,175.0,0.0,0.0,350.0,100.0,2.0,20.0,7.0,1125.0,5.0,7.0,1125.0,1125.0,7.0,1125.0,,29.0,59.0,89.0,323.0,,35.0,12.0,,,100.0,10.0,10.0,10.0,10.0,10.0,10.0,,,,,,8.0,2.0,3.0,0.0,1.81


## Data Exploartion

### Check the numerical and nominal data seperately

In [552]:
numeric_features = data.select_dtypes(include=['int64', 'float64'])
nominal_features = data.select_dtypes(include=['object'])

In [553]:
numeric_features.head()

Unnamed: 0,id,host_id,host_listings_count,host_total_listings_count,latitude,longitude,accommodates,bathrooms,bedrooms,beds,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,4952,7054,5.0,5.0,37.43972,-122.15672,2,2.0,1.0,1.0,90.0,440.0,1428.0,0.0,35.0,1,45.0,2,365,2,2,365,365,2.0,365.0,0,0,0,0,75,0,96.0,10.0,10.0,10.0,10.0,10.0,10.0,5,0,5,0,0.57
1,11464,42458,13.0,13.0,37.34342,-121.98829,1,1.0,,1.0,85.0,0.0,2100.0,700.0,250.0,1,10.0,14,365,14,14,365,365,14.0,365.0,30,60,90,179,6,1,87.0,10.0,10.0,10.0,9.0,9.0,9.0,12,12,0,0,0.08
2,17884,69023,2.0,2.0,37.38612,-122.07357,2,1.0,1.0,1.0,65.0,1400.0,2800.0,1000.0,0.0,1,0.0,30,180,30,30,180,180,30.0,180.0,29,59,89,179,2,0,90.0,8.0,10.0,10.0,10.0,10.0,10.0,2,2,0,0,0.11
3,21373,7054,5.0,5.0,37.44001,-122.15667,1,2.0,1.0,1.0,99.0,600.0,2000.0,0.0,35.0,1,50.0,2,730,2,2,730,730,2.0,730.0,0,0,0,212,264,39,97.0,10.0,10.0,10.0,10.0,10.0,10.0,5,0,5,0,2.17
4,37512,161741,2.0,2.0,37.30598,-121.84036,2,1.0,1.0,1.0,52.0,393.0,1386.0,0.0,0.0,1,0.0,1,7,1,1,7,7,1.0,7.0,0,0,28,303,183,0,94.0,9.0,10.0,10.0,10.0,9.0,9.0,2,0,2,0,1.53


In [555]:
numeric_features.columns

Index(['id', 'host_id', 'host_listings_count', 'host_total_listings_count',
       'latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms',
       'beds', 'price', 'weekly_price', 'monthly_price', 'security_deposit',
       'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',
       'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'cal

In [556]:
(data['maximum_maximum_nights'] ==  data['minimum_maximum_nights']).value_counts()

True     7178
False      43
dtype: int64

In [557]:
(data['host_listings_count'] == data['host_total_listings_count']).value_counts()

True     7220
False       1
dtype: int64

In [558]:
numeric_features.isna().sum()

id                                                 0
host_id                                            0
host_listings_count                                1
host_total_listings_count                          1
latitude                                           0
longitude                                          0
accommodates                                       0
bathrooms                                          1
bedrooms                                           6
beds                                              68
price                                              0
weekly_price                                       0
monthly_price                                      0
security_deposit                                   0
cleaning_fee                                       0
guests_included                                    0
extra_people                                       0
minimum_nights                                     0
maximum_nights                                

In [559]:
nominal_features.head()

Unnamed: 0,last_scraped,name,summary,space,description,neighborhood_overview,notes,transit,access,interaction,house_rules,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,city,market,smart_location,is_location_exact,property_type,room_type,bed_type,amenities,calendar_updated,calendar_last_scraped,first_review,last_review,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification
0,2020-05-30,Butterfly Inn - Graceful Living !,Lovely garden setting in a serene and art-fill...,Very comfortable Queen bed and small desk in b...,Lovely garden setting in a serene and art-fill...,"Located in Professorville, Palo Alto, one of t...","Housekeeping every Monday, leave your door ope...",Walking distance to Stanford University (30 mi...,"Kitchen, laundry, family/TV room, garden, free...","Welcome and Orientation. Shared kitchen, garde...",No shoes in the house beyond the hall or laund...,Mei Lin,2009-01-24,"Palo Alto, California, United States","I have lived and worked in Asia, Europe, Austr...",within an hour,100%,98%,t,Palo Alto,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,f,"Palo Alto, CA, United States",Palo Alto,Palo Alto,Palo Alto,"South Bay, CA","Palo Alto, CA",t,Villa,Private room,Real Bed,"{TV,""Cable TV"",Internet,Wifi,Kitchen,Heating,W...",2 months ago,2020-05-30,2009-08-02,2019-01-12,"{""PALO ALTO""}",f,moderate,f,f
1,2020-05-31,Deluxe Private Studio-custom int.,Custom built Studio with exquisite design. Rea...,Description A favorite for international corpo...,Custom built Studio with exquisite design. Rea...,10 min. to Lucky's and Starbucks at El Camino ...,Pet Policy: - Well Behaved pet up to 25 lb. of...,Public transportation at Homestead and Pomeroy...,Complimentary Wifi-internet + Basic Cable,Available any time for support and problem res...,Can house a couple- bed type is a full size be...,Emey,2009-09-30,"Mountain View, California, United States",I pride myself on attention to details and cus...,within an hour,100%,90%,f,Santa Clara,"['email', 'phone', 'reviews']",t,f,"Santa Clara, CA, United States",Santa Clara,Santa Clara,Santa Clara,"South Bay, CA","Santa Clara, CA",t,Apartment,Entire home/apt,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",5 months ago,2020-05-31,2014-06-15,2019-10-11,"{""SANTA CLARA""}",f,moderate,f,t
2,2020-05-31,Silicon Valley Suite,"A guest suite for one or two, in a house in a ...",This is a private suite at the rear of a house...,"A guest suite for one or two, in a house in a ...","This is a very quiet family neighborhood, but ...","I cannot accommodate cats, sorry. I can usuall...","The CalTrain station is in walking distance, a...",Private outdoor patio. Shared washer/dryer on ...,I am usually in town. Guests have a private e...,"This is a quiet neighborhood, and the suite sh...",Janet,2010-01-08,"Lynn Haven, Florida, United States","I am a software engineer, recently retired fro...",within a day,100%,50%,f,Palo Alto,"['email', 'phone', 'reviews', 'kba']",t,t,"Mountain View, CA, United States",Mountain View,Mountain View,Mountain View,"South Bay, CA","Mountain View, CA",t,Guest suite,Entire home/apt,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",4 weeks ago,2020-05-31,2018-11-20,2019-04-27,"{""Mountain View"","" CA""}",f,strict_14_with_grace_period,f,f
3,2020-05-30,Bonsai Garden Inn in Professorville,Room in gracious home with beautiful garden. ...,"Bright, garden-facing room in beautiful home. ...",Room in gracious home with beautiful garden. ...,This room is in an ultra convenient location i...,The family room has a flat panel tv and desk f...,The Stanford Shopping Center is 20 minutes wal...,"Kitchen, Laundry, Garden, Family Room with TV,...",We will see you everyday as we cook meals in t...,No shoes in the house beyond hall or laundry. ...,Mei Lin,2009-01-24,"Palo Alto, California, United States","I have lived and worked in Asia, Europe, Austr...",within an hour,100%,98%,t,Palo Alto,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,f,"Palo Alto, CA, United States",Palo Alto,Palo Alto,Palo Alto,"South Bay, CA","Palo Alto, CA",t,Villa,Private room,Real Bed,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Paid par...",3 weeks ago,2020-05-30,2010-05-28,2020-03-14,"{""PALO ALTO""}",f,moderate,f,f
4,2020-05-31,Private room - Parking 3 carport,We live in a safe community close to public tr...,I have a really nice room in a quiet neighborh...,We live in a safe community close to public tr...,Our community is a safe environment and at nig...,Please remember that you are in my home and re...,We have a bus stop right outside our community...,The kitchen and all that it offers. The laund...,We follow our guests lead when it comes to soc...,No drunk and disorderly conduct please. You m...,Maria,2010-07-09,"San Jose, California, United States",I am in a committed relationship with my live...,within a few hours,100%,100%,t,South San Jose,"['email', 'phone', 'facebook', 'reviews', 'man...",t,t,"San Jose, CA, United States",South San Jose,San Jose,San Jose,"South Bay, CA","San Jose, CA",t,Bungalow,Private room,Real Bed,"{Internet,Wifi,""Air conditioning"",Kitchen,""Fre...",23 months ago,2020-05-31,2010-08-01,2017-03-14,"{""SAN JOSE""}",f,moderate,t,t


In [560]:
nominal_features.columns

Index(['last_scraped', 'name', 'summary', 'space', 'description',
       'neighborhood_overview', 'notes', 'transit', 'access', 'interaction',
       'house_rules', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_neighbourhood', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed', 'city', 'market',
       'smart_location', 'is_location_exact', 'property_type', 'room_type',
       'bed_type', 'amenities', 'calendar_updated', 'calendar_last_scraped',
       'first_review', 'last_review', 'jurisdiction_names', 'instant_bookable',
       'cancellation_policy', 'require_guest_profile_picture',
       'require_guest_phone_verification'],
      dtype='object')

In [561]:
(nominal_features['last_scraped'] == nominal_features['calendar_last_scraped']).value_counts()

True    7221
dtype: int64

In [562]:
nominal_features.isna().sum()

last_scraped                           0
name                                   1
summary                              299
space                               1707
description                          238
neighborhood_overview               2677
notes                               3529
transit                             3085
access                              3117
interaction                         2968
house_rules                         2264
host_name                              1
host_since                             1
host_location                         23
host_about                          2334
host_response_time                  1807
host_response_rate                  1807
host_acceptance_rate                 714
host_is_superhost                      1
host_neighbourhood                  1181
host_verifications                     0
host_has_profile_pic                   1
host_identity_verified                 1
street                                 0
neighbourhood   

In [563]:
data['room_type'].unique()

array(['Private room', 'Entire home/apt', 'Shared room', 'Hotel room'],
      dtype=object)

In [564]:
accomo_type = data['accommodates'].unique()
accomo_type.sort()
accomo_type

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16])

In [565]:
data['bed_type'].unique()

array(['Real Bed', 'Futon', 'Pull-out Sofa', 'Airbed', 'Couch'],
      dtype=object)

In [566]:
data['jurisdiction_names'].unique()

array(['{"PALO ALTO"}', '{"SANTA CLARA"}', '{"Mountain View"," CA"}',
       '{"SAN JOSE"}', '{Cupertino," CA"}', '{Sunnyvale," CA"}', nan,
       '{Milpitas," CA"}', '{"Morgan Hill"," CA"}', '{"Los Gatos"," CA"}',
       '{"San Benito County"," CA"}'], dtype=object)

In [567]:
data['jurisdiction_names'].isna().sum()

979

In [568]:
data['cancellation_policy'].unique()

array(['moderate', 'strict_14_with_grace_period', 'flexible',
       'super_strict_60'], dtype=object)

In [569]:
data['city'].unique()

array(['Palo Alto', 'Santa Clara', 'Mountain View', 'San Jose',
       'Cupertino', 'Sunnyvale', 'Campbell', 'Milpitas', 'Saratoga',
       'Morgan Hill', 'Los Altos', 'Los Gatos', 'Menlo Park',
       'Los Altos Hills', 'Monte Sereno', 'Hollister', 'Gilroy',
       'San Jose ', 'San Martin', 'Stanford', 'Santa Clara County',
       'Sunnyvale ', 'Palo Alto ', 'Milpitas ', 'Los Gatos ', 'san jose',
       'Mountain View ', 'Danville ', nan, 'Campbell ', '洛斯阿尔托斯',
       'santa clara', 'Fremont', '圣何塞', 'Watsonville'], dtype=object)

In [570]:
# find binary columns:
for col in data.columns:
    if data[col].nunique() == 2:
        print(col)

last_scraped
host_is_superhost
host_has_profile_pic
host_identity_verified
is_location_exact
calendar_last_scraped
instant_bookable
require_guest_profile_picture
require_guest_phone_verification


In [571]:
for col in data.columns:
    if data[col].nunique() == 3:
        print(col)

In [572]:
for col in data.columns:
    if data[col].nunique() == 4:
        print(col)

host_response_time
market
room_type
cancellation_policy


In [573]:
for col in data.columns:
    if data[col].nunique() == 5:
        print(col)

bed_type


In [574]:
for col, n in zip(data.columns, data.nunique()):
    if n <10 and n > 1:
        print(col,n)

last_scraped 2
host_response_time 4
host_is_superhost 2
host_has_profile_pic 2
host_identity_verified 2
market 4
is_location_exact 2
room_type 4
bed_type 5
calendar_last_scraped 2
review_scores_accuracy 8
review_scores_cleanliness 8
review_scores_checkin 8
review_scores_communication 8
review_scores_location 7
review_scores_value 8
instant_bookable 2
cancellation_policy 4
require_guest_profile_picture 2
require_guest_phone_verification 2


## What we got now:

Data cleaning

  - id？ host ID？ which to keep?
  - how to deal with missing price? -- fill na zero?
  - can we delete one of these columns each?
    - (data['maximum_maximum_nights'] ==  data['minimum_maximum_nights']).value_counts()
    - (data['host_listings_count'] == data['host_total_listings_count']).value_counts()
    - (nominal_features['last_scraped'] == nominal_features['calendar_last_scraped']).value_counts()

---
- Neighbors does not match the neighbors in data
  - sub data by region?
- Location :  clean later
  -  ('latitude', 5882)
  -  ('longitude', 6357)

 

### Amenities hot words analysis

In [575]:
from collections import Counter

results = Counter()

In [595]:
amenities = data['amenities']

In [596]:
amenities.str.strip('{}')\
               .str.replace('"', '')\
               .str.lstrip('\"')\
               .str.rstrip('\"')\
               .str.split(',')\
               .apply(results.update)

0       [TV, Cable TV, Internet, Wifi, Kitchen, Heatin...
1       [TV, Cable TV, Internet, Wifi, Air conditionin...
2       [TV, Cable TV, Internet, Wifi, Air conditionin...
3       [TV, Cable TV, Internet, Wifi, Kitchen, Paid p...
4       [Internet, Wifi, Air conditioning, Kitchen, Fr...
5       [TV, Cable TV, Internet, Wifi, Pets live on th...
6       [TV, Wifi, Free parking on premises, Pets live...
7       [Internet, Wifi, Free parking on premises, Pet...
8       [Internet, Wifi, Air conditioning, Kitchen, Fr...
9       [TV, Cable TV, Internet, Wifi, Kitchen, Indoor...
10      [TV, Cable TV, Internet, Wifi, Kitchen, Free s...
11      [TV, Cable TV, Internet, Wifi, Air conditionin...
12      [TV, Internet, Wifi, Free parking on premises,...
13      [Cable TV, Internet, Wifi, Free parking on pre...
14      [Wifi, Kitchen, Free parking on premises, Pets...
15      [Internet, Wifi, Air conditioning, Kitchen, Fr...
16      [Wifi, Kitchen, Free parking on premises, Free...
17      [Wifi,

In [578]:
# results

In [579]:
sub_df = pd.DataFrame(results.most_common(50), columns=['amenities', 'count'])

In [580]:
sub_df['ratio'] = round(sub_df['count'] / len(amenities),4)

In [581]:
sub_df.sort_values(by=['count'], ascending=False)

Unnamed: 0,amenities,count,ratio
0,Wifi,7071,0.9792
1,Essentials,6904,0.9561
2,Heating,6824,0.945
3,Smoke alarm,6794,0.9409
4,Hangers,6480,0.8974
5,Shampoo,6271,0.8684
6,Carbon monoxide alarm,6178,0.8556
7,Laptop-friendly workspace,5998,0.8306
8,Hair dryer,5997,0.8305
9,Kitchen,5890,0.8157


## Fusion with reviews

In [582]:
reviews = pd.read_csv('./data/reviews.csv')
print(reviews.shape)
reviews.head()

(212513, 2)


Unnamed: 0,listing_id,date
0,4952,2009-08-02
1,4952,2009-09-04
2,4952,2009-10-16
3,4952,2009-12-10
4,4952,2010-06-08


In [583]:
rev_row = reviews['listing_id'].unique()

In [584]:
data_rev = data.loc[data['id'].isin(rev_row)]
data_rev.head()

Unnamed: 0,id,last_scraped,name,summary,space,description,neighborhood_overview,notes,transit,access,interaction,house_rules,host_id,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,city,market,smart_location,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,4952,2020-05-30,Butterfly Inn - Graceful Living !,Lovely garden setting in a serene and art-fill...,Very comfortable Queen bed and small desk in b...,Lovely garden setting in a serene and art-fill...,"Located in Professorville, Palo Alto, one of t...","Housekeeping every Monday, leave your door ope...",Walking distance to Stanford University (30 mi...,"Kitchen, laundry, family/TV room, garden, free...","Welcome and Orientation. Shared kitchen, garde...",No shoes in the house beyond the hall or laund...,7054,Mei Lin,2009-01-24,"Palo Alto, California, United States","I have lived and worked in Asia, Europe, Austr...",within an hour,100%,98%,t,Palo Alto,5.0,5.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,f,"Palo Alto, CA, United States",Palo Alto,Palo Alto,Palo Alto,"South Bay, CA","Palo Alto, CA",37.43972,-122.15672,t,Villa,Private room,2,2.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,Kitchen,Heating,W...",90.0,440.0,1428.0,0.0,35.0,1,45.0,2,365,2,2,365,365,2.0,365.0,2 months ago,0,0,0,0,2020-05-30,75,0,2009-08-02,2019-01-12,96.0,10.0,10.0,10.0,10.0,10.0,10.0,"{""PALO ALTO""}",f,moderate,f,f,5,0,5,0,0.57
1,11464,2020-05-31,Deluxe Private Studio-custom int.,Custom built Studio with exquisite design. Rea...,Description A favorite for international corpo...,Custom built Studio with exquisite design. Rea...,10 min. to Lucky's and Starbucks at El Camino ...,Pet Policy: - Well Behaved pet up to 25 lb. of...,Public transportation at Homestead and Pomeroy...,Complimentary Wifi-internet + Basic Cable,Available any time for support and problem res...,Can house a couple- bed type is a full size be...,42458,Emey,2009-09-30,"Mountain View, California, United States",I pride myself on attention to details and cus...,within an hour,100%,90%,f,Santa Clara,13.0,13.0,"['email', 'phone', 'reviews']",t,f,"Santa Clara, CA, United States",Santa Clara,Santa Clara,Santa Clara,"South Bay, CA","Santa Clara, CA",37.34342,-121.98829,t,Apartment,Entire home/apt,1,1.0,,1.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",85.0,0.0,2100.0,700.0,250.0,1,10.0,14,365,14,14,365,365,14.0,365.0,5 months ago,30,60,90,179,2020-05-31,6,1,2014-06-15,2019-10-11,87.0,10.0,10.0,10.0,9.0,9.0,9.0,"{""SANTA CLARA""}",f,moderate,f,t,12,12,0,0,0.08
2,17884,2020-05-31,Silicon Valley Suite,"A guest suite for one or two, in a house in a ...",This is a private suite at the rear of a house...,"A guest suite for one or two, in a house in a ...","This is a very quiet family neighborhood, but ...","I cannot accommodate cats, sorry. I can usuall...","The CalTrain station is in walking distance, a...",Private outdoor patio. Shared washer/dryer on ...,I am usually in town. Guests have a private e...,"This is a quiet neighborhood, and the suite sh...",69023,Janet,2010-01-08,"Lynn Haven, Florida, United States","I am a software engineer, recently retired fro...",within a day,100%,50%,f,Palo Alto,2.0,2.0,"['email', 'phone', 'reviews', 'kba']",t,t,"Mountain View, CA, United States",Mountain View,Mountain View,Mountain View,"South Bay, CA","Mountain View, CA",37.38612,-122.07357,t,Guest suite,Entire home/apt,2,1.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",65.0,1400.0,2800.0,1000.0,0.0,1,0.0,30,180,30,30,180,180,30.0,180.0,4 weeks ago,29,59,89,179,2020-05-31,2,0,2018-11-20,2019-04-27,90.0,8.0,10.0,10.0,10.0,10.0,10.0,"{""Mountain View"","" CA""}",f,strict_14_with_grace_period,f,f,2,2,0,0,0.11
3,21373,2020-05-30,Bonsai Garden Inn in Professorville,Room in gracious home with beautiful garden. ...,"Bright, garden-facing room in beautiful home. ...",Room in gracious home with beautiful garden. ...,This room is in an ultra convenient location i...,The family room has a flat panel tv and desk f...,The Stanford Shopping Center is 20 minutes wal...,"Kitchen, Laundry, Garden, Family Room with TV,...",We will see you everyday as we cook meals in t...,No shoes in the house beyond hall or laundry. ...,7054,Mei Lin,2009-01-24,"Palo Alto, California, United States","I have lived and worked in Asia, Europe, Austr...",within an hour,100%,98%,t,Palo Alto,5.0,5.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,f,"Palo Alto, CA, United States",Palo Alto,Palo Alto,Palo Alto,"South Bay, CA","Palo Alto, CA",37.44001,-122.15667,t,Villa,Private room,1,2.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Paid par...",99.0,600.0,2000.0,0.0,35.0,1,50.0,2,730,2,2,730,730,2.0,730.0,3 weeks ago,0,0,0,212,2020-05-30,264,39,2010-05-28,2020-03-14,97.0,10.0,10.0,10.0,10.0,10.0,10.0,"{""PALO ALTO""}",f,moderate,f,f,5,0,5,0,2.17
4,37512,2020-05-31,Private room - Parking 3 carport,We live in a safe community close to public tr...,I have a really nice room in a quiet neighborh...,We live in a safe community close to public tr...,Our community is a safe environment and at nig...,Please remember that you are in my home and re...,We have a bus stop right outside our community...,The kitchen and all that it offers. The laund...,We follow our guests lead when it comes to soc...,No drunk and disorderly conduct please. You m...,161741,Maria,2010-07-09,"San Jose, California, United States",I am in a committed relationship with my live...,within a few hours,100%,100%,t,South San Jose,2.0,2.0,"['email', 'phone', 'facebook', 'reviews', 'man...",t,t,"San Jose, CA, United States",South San Jose,San Jose,San Jose,"South Bay, CA","San Jose, CA",37.30598,-121.84036,t,Bungalow,Private room,2,1.0,1.0,1.0,Real Bed,"{Internet,Wifi,""Air conditioning"",Kitchen,""Fre...",52.0,393.0,1386.0,0.0,0.0,1,0.0,1,7,1,1,7,7,1.0,7.0,23 months ago,0,0,28,303,2020-05-31,183,0,2010-08-01,2017-03-14,94.0,9.0,10.0,10.0,10.0,9.0,9.0,"{""SAN JOSE""}",f,moderate,t,t,2,0,2,0,1.53


In [585]:
data_rev.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5734 entries, 0 to 7214
Data columns (total 86 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            5734 non-null   int64  
 1   last_scraped                                  5734 non-null   object 
 2   name                                          5734 non-null   object 
 3   summary                                       5563 non-null   object 
 4   space                                         4571 non-null   object 
 5   description                                   5606 non-null   object 
 6   neighborhood_overview                         3884 non-null   object 
 7   notes                                         3191 non-null   object 
 8   transit                                       3618 non-null   object 
 9   access                                        3642 non-null   o

In [586]:
data_rev.describe(include='all')

Unnamed: 0,id,last_scraped,name,summary,space,description,neighborhood_overview,notes,transit,access,interaction,house_rules,host_id,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,city,market,smart_location,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,5734.0,5734,5734,5563,4571,5606,3884,3191,3618,3642,3640,4196,5734.0,5733,5733,5721,4050,4470,4470,5419,5733,4859,5733.0,5733.0,5734,5733,5733,5734,5365,5734,5733,5725,5734,5734.0,5734.0,5734,5734,5734,5734.0,5734.0,5729.0,5707.0,5734,5734,5734.0,5734.0,5734.0,5734.0,5734.0,5734.0,5734.0,5734.0,5734.0,5734.0,5734.0,5734.0,5734.0,5734.0,5734.0,5734,5734.0,5734.0,5734.0,5734.0,5734,5734.0,5734.0,5707,5707,5651.0,5649.0,5649.0,5648.0,5649.0,5648.0,5648.0,5006,5734,5734,5734,5734,5734.0,5734.0,5734.0,5734.0,5707.0
unique,,2,5615,4920,3791,5224,3066,2479,2834,3047,2780,3231,,1812,1864,181,1791,4,41,74,2,114,,,222,2,2,32,36,16,30,4,32,,,2,27,4,,,,,5,5247,,,,,,,,,,,,,,,,59,,,,,2,,,1909,787,,,,,,,,10,2,4,2,2,,,,,
top,,2020-05-31,TRIBE ❤️ Deep Cleaned 2BR @ Great Mall Milpitas,The HackerHome Network New to the Valley? Tran...,"At Zeus, we focus exclusively on 30 day+ stays...","Dorm-style house in Silicon Valley. No-host, 5...",*Area Highlights* -Located in the Heart of the...,Parking is not included with your stay. Parkin...,This Flat Location is Uber/Lyft Accessible & C...,"• All utilities are included, including gas, e...",We're available 24/7 via Text | Email | Emerge...,By reserving your stay with Tribe you hereby a...,,Zeus,2015-11-02,"San Jose, California, United States",Zeus is reinventing corporate housing. We do t...,within an hour,100%,100%,t,Santa Clara,,,"['email', 'phone', 'reviews', 'kba']",t,f,"San Jose, CA, United States",Palo Alto,San Jose,San Jose,"South Bay, CA","San Jose, CA",,,t,House,Private room,,,,,Real Bed,"{Wifi,Kitchen,""Free parking on premises"",Break...",,,,,,,,,,,,,,,,3 months ago,,,,,2020-05-31,,,2019-05-10,2020-03-13,,,,,,,,"{""SAN JOSE""}",f,moderate,f,f,,,,,
freq,,3778,21,44,242,20,28,41,59,26,56,120,,242,243,1648,242,3362,3717,1947,2875,581,,,401,5730,3120,2431,666,2320,2431,4552,2431,,,5414,3046,2781,,,,,5676,19,,,,,,,,,,,,,,,,1007,,,,,3778,,,21,97,,,,,,,,2257,3262,2002,5532,5547,,,,,
mean,24618250.0,,,,,,,,,,,,84285940.0,,,,,,,,,,102.008373,102.008373,,,,,,,,,,37.352121,-121.96665,,,,3.115452,1.371992,1.400768,1.816015,,,138.385246,87.096268,305.895361,254.594698,80.045692,1.74189,14.805023,8.277293,375197.7,7.70858,9.726718,375355.9,749874.6,9.15422,688019.4,,13.247646,29.492675,47.013777,158.786188,,36.578828,11.436693,,,95.310918,9.727562,9.580457,9.81852,9.796955,9.797805,9.531339,,,,,,26.45361,22.485699,2.877224,1.058598,1.359872
std,12096880.0,,,,,,,,,,,,90378140.0,,,,,,,,,,405.285405,405.285405,,,,,,,,,,0.065509,0.109539,,,,2.365797,0.678344,0.981305,1.535468,,,209.639023,393.717528,1361.610384,491.86586,97.015465,1.649889,22.454505,30.368753,28359650.0,31.332788,37.62408,28359650.0,40103100.0,36.784421,36940220.0,,13.503198,26.329598,38.790671,141.436769,,54.834069,16.156561,,,7.749835,0.745349,0.848019,0.643308,0.691166,0.584477,0.83767,,,,,,83.43773,83.821405,6.912066,5.849206,1.535631
min,4952.0,,,,,,,,,,,,7054.0,,,,,,,,,,0.0,0.0,,,,,,,,,,36.9656,-122.18868,,,,1.0,0.0,0.0,0.0,,,10.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,0.0,0.0,0.0,0.0,,0.0,0.0,,,20.0,2.0,2.0,2.0,2.0,2.0,2.0,,,,,,1.0,0.0,0.0,0.0,0.01
25%,14890020.0,,,,,,,,,,,,14809140.0,,,,,,,,,,1.0,1.0,,,,,,,,,,37.32029,-122.043607,,,,2.0,1.0,1.0,1.0,,,60.0,0.0,0.0,0.0,20.0,1.0,0.0,1.0,60.0,1.0,1.0,365.0,365.0,1.0,365.0,,0.0,0.0,0.0,1.0,,4.0,1.0,,,94.0,10.0,9.0,10.0,10.0,10.0,9.0,,,,,,1.0,0.0,0.0,0.0,0.32
50%,25982720.0,,,,,,,,,,,,48005490.0,,,,,,,,,,3.0,3.0,,,,,,,,,,37.35774,-121.96194,,,,2.0,1.0,1.0,1.0,,,97.0,0.0,0.0,100.0,50.0,1.0,10.0,2.0,1125.0,2.0,2.0,1125.0,1125.0,2.0,1125.0,,8.0,29.0,59.0,135.0,,16.0,5.0,,,98.0,10.0,10.0,10.0,10.0,10.0,10.0,,,,,,3.0,1.0,1.0,0.0,0.83
75%,35007040.0,,,,,,,,,,,,135143800.0,,,,,,,,,,9.0,9.0,,,,,,,,,,37.39832,-121.886337,,,,4.0,1.5,2.0,2.0,,,165.0,0.0,0.0,350.0,100.0,2.0,20.0,5.0,1125.0,4.0,6.0,1125.0,1125.0,5.0,1125.0,,29.0,59.0,89.0,318.75,,44.0,15.0,,,100.0,10.0,10.0,10.0,10.0,10.0,10.0,,,,,,8.0,2.0,3.0,0.0,1.81


In [587]:
counts = reviews['listing_id'].value_counts()


In [588]:
counts.head()

7476637     489
52786       478
10814836    445
19641513    429
13828514    427
Name: listing_id, dtype: int64

## Fusion with neighborhood data

In [589]:
neighbors = pd.read_csv('./data/neighbourhoods.csv')
print(neighbors.shape)
neighbors.head()

(16, 2)


Unnamed: 0,neighbourhood_group,neighbourhood
0,,Campbell
1,,Cupertino
2,,Gilroy
3,,Los Altos
4,,Los Altos Hills


In [590]:
data_neighbor = data['neighbourhood'].unique().tolist()
neighbors_hood = neighbors['neighbourhood'].unique().tolist()

In [591]:
same_locations = []

for neighbor in neighbors_hood:
    if neighbor in data_neighbor:
        same_locations.append(neighbor)

In [592]:
same_locations

['Campbell',
 'Cupertino',
 'Los Altos',
 'Los Altos Hills',
 'Mountain View',
 'Palo Alto',
 'Santa Clara',
 'Sunnyvale']

In [593]:
data_neighbor

['Palo Alto',
 'Santa Clara',
 'Mountain View',
 'South San Jose',
 'Cupertino',
 'Sunnyvale',
 'Downtown',
 'Campbell',
 'West Valley',
 'Edenvale',
 nan,
 'Willow Glen',
 'Central San Jose',
 'Los Altos',
 'Berryessa',
 'Cory',
 'College Park',
 'Alum Rock',
 'Cambrian/Pioneer',
 'North San Jose',
 'Los Altos Hills',
 'Burbank/Del Monte',
 'Evergreen',
 'Newhall/Sherwood',
 'Shasta/Hanchett Park',
 'Rose Garden',
 'Five Wounds/Brookwood Terrace',
 'Naglee Park',
 'Forest/Pruneridge',
 'Alviso',
 'Japantown',
 'Stanford',
 'Delmas Park',
 'Chapman/Morse',
 'Menlo Park',
 'Autumn/Montgomery',
 'Vermont/McKendrie']

In [594]:
neighbors_hood

['Campbell',
 'Cupertino',
 'Gilroy',
 'Los Altos',
 'Los Altos Hills',
 'Los Gatos',
 'Milpitas',
 'Monte Sereno',
 'Morgan Hill',
 'Mountain View',
 'Palo Alto',
 'San Jose',
 'Santa Clara',
 'Saratoga',
 'Sunnyvale',
 'Unincorporated Areas']


## ideas list

### methdology for this large scale data?


### which cols we need

- e.g.

  id
  name
  host_id
  host_name

  neighbourhood_group
  neighbourhood

  latitude
  longitude

  room_type

  price

  minimum_nights
  number_of_reviews

  reviews_per_month
  calculated_host_listings_count
  availability_365

### What to compare

- Room Type vs Price
- Price/log price Distribution Plot


### pipleline?



## Codes for refering

- [amenities refer](https://www.kaggle.com/brittabettendorf/predicting-prices-xgboost-feature-engineering)
- [word cloud /weighted words /sentimental analsys / topic modeling](https://www.kaggle.com/brittabettendorf/nlp-on-airbnb-data)
