# Bay Area, CA - Airbnb Data


## Context

Since its inception in 2008, Airbnb has disrupted the hospitality industry by allowing almost anyone to rent out a spare room and host travelers looking for an overnight stay. 

While Airbnb has publicly available data from many locations, **Santa Clara County is of particular interest as it is one of the major counties in the Bay Area and home to many prominent tech companies and startups in the Silicon Valley.**
As such, home prices and cost of living are exceptionally high in this area, as one will see in the data.


## Content

All data is publicly available under the Creative Commons "Public Domain Dedication" license and has been updated as of June 12th, 2020.Provided are metrics that are publicly visible for each listing (e.g. name, description, price, reviews, etc.), but the dataset contains some Airbnb's internal metrics as well (review score accuracy, host acceptance rate, etc.).

## Acknowledgements

Acknowledgments go to Airbnb for their publicly released datasets that are available at this website.




## Inspiration & ideas:

1. Create a price-suggestion model for new Airbnb hosts who might not know the value of their listing.
    - Text mining with statified house price data
2. Can we predict the rating of an Airbnb listing utilizing **NLP** of the description columns?How have Airbnb
3. prices changed over time? Are prices seasonal?
   - Time series analysis?
4. Which areas in the Santa Clara County are most "popular"?
   - Region analysis
5. Which features of an Airbnb listing are important to add to its perceived value?
   - Feature analysis,PCA
   - Correlation analysis
   

## Import  libraries and data

In [77]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

In [78]:
data = pd.read_csv('./data/Airbnb_Listings.csv')
data.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,4952,https://www.airbnb.com/rooms/4952,20200530151957,2020-05-30,Butterfly Inn - Graceful Living !,Lovely garden setting in a serene and art-fill...,Very comfortable Queen bed and small desk in b...,Lovely garden setting in a serene and art-fill...,none,"Located in Professorville, Palo Alto, one of t...",...,f,f,moderate,f,f,5,0,5,0,0.57
1,11464,https://www.airbnb.com/rooms/11464,20200530151957,2020-05-31,Deluxe Private Studio-custom int.,Custom built Studio with exquisite design. Rea...,Description A favorite for international corpo...,Custom built Studio with exquisite design. Rea...,none,10 min. to Lucky's and Starbucks at El Camino ...,...,f,f,moderate,f,t,12,12,0,0,0.08
2,17884,https://www.airbnb.com/rooms/17884,20200530151957,2020-05-31,Silicon Valley Suite,"A guest suite for one or two, in a house in a ...",This is a private suite at the rear of a house...,"A guest suite for one or two, in a house in a ...",none,"This is a very quiet family neighborhood, but ...",...,f,f,strict_14_with_grace_period,f,f,2,2,0,0,0.11
3,21373,https://www.airbnb.com/rooms/21373,20200530151957,2020-05-30,Bonsai Garden Inn in Professorville,Room in gracious home with beautiful garden. ...,"Bright, garden-facing room in beautiful home. ...",Room in gracious home with beautiful garden. ...,none,This room is in an ultra convenient location i...,...,f,f,moderate,f,f,5,0,5,0,2.17
4,37512,https://www.airbnb.com/rooms/37512,20200530151957,2020-05-31,Private room - Parking 3 carport,We live in a safe community close to public tr...,I have a really nice room in a quiet neighborh...,We live in a safe community close to public tr...,none,Our community is a safe environment and at nig...,...,f,f,moderate,t,t,2,0,2,0,1.53


In [79]:
data.shape

(7221, 106)

In [80]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7221 entries, 0 to 7220
Columns: 106 entries, id to reviews_per_month
dtypes: float64(23), int64(21), object(62)
memory usage: 5.8+ MB


## Data cleaning

In [81]:
# find missing values and single value columns in data:
del_cols = []

for col in data.columns:
    if data[col].isna().all():
        del_cols.append(col)
    elif data[col].nunique() == 1:
        del_cols.append(col)

In [82]:
del_cols

['scrape_id',
 'experiences_offered',
 'thumbnail_url',
 'medium_url',
 'xl_picture_url',
 'neighbourhood_group_cleansed',
 'country_code',
 'country',
 'has_availability',
 'requires_license',
 'license',
 'is_business_travel_ready']

In [83]:
data = data.drop(del_cols, axis=1)

In [84]:
data['id'].isna().sum()

0

In [85]:
# delete url column:
urls = [
    'listing_url', 'picture_url', 'host_url', 'host_thumbnail_url',
    'host_picture_url'
]


In [86]:
data = data.drop(urls, axis=1)

In [87]:
for col in data.columns:
    list_unique = (col, len(data[col].unique()))
    print(list_unique)

('id', 7221)
('last_scraped', 2)
('name', 7078)
('summary', 6052)
('space', 4403)
('description', 6440)
('neighborhood_overview', 3509)
('notes', 2747)
('transit', 3200)
('access', 3378)
('interaction', 3166)
('house_rules', 3718)
('host_id', 3533)
('host_name', 2103)
('host_since', 2085)
('host_location', 242)
('host_about', 2030)
('host_response_time', 5)
('host_response_rate', 44)
('host_acceptance_rate', 78)
('host_is_superhost', 3)
('host_neighbourhood', 145)
('host_listings_count', 62)
('host_total_listings_count', 62)
('host_verifications', 240)
('host_has_profile_pic', 3)
('host_identity_verified', 3)
('street', 37)
('neighbourhood', 37)
('neighbourhood_cleansed', 16)
('city', 35)
('state', 5)
('zipcode', 83)
('market', 5)
('smart_location', 37)
('latitude', 5882)
('longitude', 6357)
('is_location_exact', 2)
('property_type', 28)
('room_type', 4)
('accommodates', 16)
('bathrooms', 16)
('bedrooms', 11)
('beds', 21)
('bed_type', 5)
('amenities', 6396)
('square_feet', 25)
('price'

In [88]:
data['property_type'].value_counts()

House                 3715
Apartment             1152
Serviced apartment     504
Townhouse              429
Guest suite            363
Guesthouse             318
Condominium            303
Villa                  129
Bungalow               115
Loft                    41
Cottage                 29
Camper/RV               27
Boutique hotel          23
Tiny house              16
Other                   13
Bed and breakfast       12
Cabin                    7
Farm stay                6
Tent                     5
Treehouse                3
Yurt                     3
Barn                     2
Train                    1
Lighthouse               1
Campsite                 1
Earth house              1
Aparthotel               1
Chalet                   1
Name: property_type, dtype: int64

In [89]:
data['state'].unique() # to drop

array(['CA', 'Ca', 'ca', nan, 'California '], dtype=object)

In [90]:
data_null = data.isnull().sum()

In [91]:
data_null.sort_values(ascending=False)

square_feet               7192
weekly_price              6647
monthly_price             6629
notes                     3529
access                    3117
                          ... 
maximum_minimum_nights       0
minimum_maximum_nights       0
maximum_maximum_nights       0
minimum_nights_avg_ntm       0
bed_type                     0
Length: 89, dtype: int64

In [92]:
data = data.drop(['square_feet','zipcode','state'], axis=1)

In [93]:
data.columns

Index(['id', 'last_scraped', 'name', 'summary', 'space', 'description',
       'neighborhood_overview', 'notes', 'transit', 'access', 'interaction',
       'house_rules', 'host_id', 'host_name', 'host_since', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed', 'city', 'market',
       'smart_location', 'latitude', 'longitude', 'is_location_exact',
       'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms',
       'beds', 'bed_type', 'amenities', 'price', 'weekly_price',
       'monthly_price', 'security_deposit', 'cleaning_fee', 'guests_included',
       'extra_people', 'minimum_nights', 'maximum_nights',
       'minimum_minimum_nights', 'maximum_minimum_nights',
       '

### Transfer the unit of currency to float

In [94]:
price_list = [
    'weekly_price',
    'monthly_price',
    'price',
    'security_deposit',
    'cleaning_fee',
    'extra_people',
]

In [95]:
for p in price_list: 
    print(p,data[p].isna().sum())

weekly_price 6647
monthly_price 6629
price 0
security_deposit 1457
cleaning_fee 800
extra_people 0


In [96]:
for p in price_list:
    data[p].fillna('$0.00', inplace=True)
    data[p] = data[p].str.replace('$', '').str.replace(',', '').astype(float)

  data[p] = data[p].str.replace('$', '').str.replace(',', '').astype(float)


In [97]:
data[price_list].isna().sum()

weekly_price        0
monthly_price       0
price               0
security_deposit    0
cleaning_fee        0
extra_people        0
dtype: int64

In [98]:
data.describe(include='all')

Unnamed: 0,id,last_scraped,name,summary,space,description,neighborhood_overview,notes,transit,access,...,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,7221.0,7221,7220,6922,5514,6983,4544,3692,4136,4104,...,6242,7221,7221,7221,7221,7221.0,7221.0,7221.0,7221.0,5912.0
unique,,2,7077,6051,4402,6439,3508,2746,3199,3377,...,10,2,4,2,2,,,,,
top,,2020-05-31,TRIBE ❤️ Deep Cleaned 2BR @ Great Mall Milpitas,The HackerHome Network New to the Valley? Tran...,"At Zeus, we focus exclusively on 30 day+ stays...","Dorm-style house in Silicon Valley. No-host, 5...",*Area Highlights* -Located in the Heart of the...,Parking is not included with your stay. Parkin...,This Flat Location is Uber/Lyft Accessible & C...,"• All utilities are included, including gas, e...",...,"{""SAN JOSE""}",f,flexible,f,f,,,,,
freq,,4782,21,44,411,20,31,41,67,29,...,2810,3987,2817,6991,6993,,,,,
mean,26239220.0,,,,,,,,,,...,,,,,,32.048608,28.216729,2.775793,1.029359,1.357124
std,12364200.0,,,,,,,,,,...,,,,,,95.159193,95.714488,6.810433,5.741521,1.530782
min,4952.0,,,,,,,,,,...,,,,,,1.0,0.0,0.0,0.0,0.01
25%,16242420.0,,,,,,,,,,...,,,,,,1.0,0.0,0.0,0.0,0.32
50%,28184320.0,,,,,,,,,,...,,,,,,3.0,1.0,1.0,0.0,0.83
75%,37499480.0,,,,,,,,,,...,,,,,,8.0,2.0,3.0,0.0,1.81


## Data Exploartion

### Check the numerical and nominal data seperately

In [99]:
numeric_features = data.select_dtypes(include=['int64', 'float64'])
nominal_features = data.select_dtypes(include=['object'])

In [100]:
numeric_features.head()

Unnamed: 0,id,host_id,host_listings_count,host_total_listings_count,latitude,longitude,accommodates,bathrooms,bedrooms,beds,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,4952,7054,5.0,5.0,37.43972,-122.15672,2,2.0,1.0,1.0,...,10.0,10.0,10.0,10.0,10.0,5,0,5,0,0.57
1,11464,42458,13.0,13.0,37.34342,-121.98829,1,1.0,,1.0,...,10.0,10.0,9.0,9.0,9.0,12,12,0,0,0.08
2,17884,69023,2.0,2.0,37.38612,-122.07357,2,1.0,1.0,1.0,...,10.0,10.0,10.0,10.0,10.0,2,2,0,0,0.11
3,21373,7054,5.0,5.0,37.44001,-122.15667,1,2.0,1.0,1.0,...,10.0,10.0,10.0,10.0,10.0,5,0,5,0,2.17
4,37512,161741,2.0,2.0,37.30598,-121.84036,2,1.0,1.0,1.0,...,10.0,10.0,10.0,9.0,9.0,2,0,2,0,1.53


In [101]:
numeric_features.columns

Index(['id', 'host_id', 'host_listings_count', 'host_total_listings_count',
       'latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms',
       'beds', 'price', 'weekly_price', 'monthly_price', 'security_deposit',
       'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',
       'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'cal

In [204]:
(data['maximum_maximum_nights'] ==  data['minimum_maximum_nights']).value_counts()

True     7178
False      43
dtype: int64

In [103]:
(data['host_listings_count'] == data['host_total_listings_count']).value_counts()

True     7220
False       1
dtype: int64

In [104]:
numeric_features.isna().sum()

id                                                 0
host_id                                            0
host_listings_count                                1
host_total_listings_count                          1
latitude                                           0
longitude                                          0
accommodates                                       0
bathrooms                                          1
bedrooms                                           6
beds                                              68
price                                              0
weekly_price                                       0
monthly_price                                      0
security_deposit                                   0
cleaning_fee                                       0
guests_included                                    0
extra_people                                       0
minimum_nights                                     0
maximum_nights                                

In [105]:
nominal_features.head()

Unnamed: 0,last_scraped,name,summary,space,description,neighborhood_overview,notes,transit,access,interaction,...,amenities,calendar_updated,calendar_last_scraped,first_review,last_review,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification
0,2020-05-30,Butterfly Inn - Graceful Living !,Lovely garden setting in a serene and art-fill...,Very comfortable Queen bed and small desk in b...,Lovely garden setting in a serene and art-fill...,"Located in Professorville, Palo Alto, one of t...","Housekeeping every Monday, leave your door ope...",Walking distance to Stanford University (30 mi...,"Kitchen, laundry, family/TV room, garden, free...","Welcome and Orientation. Shared kitchen, garde...",...,"{TV,""Cable TV"",Internet,Wifi,Kitchen,Heating,W...",2 months ago,2020-05-30,2009-08-02,2019-01-12,"{""PALO ALTO""}",f,moderate,f,f
1,2020-05-31,Deluxe Private Studio-custom int.,Custom built Studio with exquisite design. Rea...,Description A favorite for international corpo...,Custom built Studio with exquisite design. Rea...,10 min. to Lucky's and Starbucks at El Camino ...,Pet Policy: - Well Behaved pet up to 25 lb. of...,Public transportation at Homestead and Pomeroy...,Complimentary Wifi-internet + Basic Cable,Available any time for support and problem res...,...,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",5 months ago,2020-05-31,2014-06-15,2019-10-11,"{""SANTA CLARA""}",f,moderate,f,t
2,2020-05-31,Silicon Valley Suite,"A guest suite for one or two, in a house in a ...",This is a private suite at the rear of a house...,"A guest suite for one or two, in a house in a ...","This is a very quiet family neighborhood, but ...","I cannot accommodate cats, sorry. I can usuall...","The CalTrain station is in walking distance, a...",Private outdoor patio. Shared washer/dryer on ...,I am usually in town. Guests have a private e...,...,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",4 weeks ago,2020-05-31,2018-11-20,2019-04-27,"{""Mountain View"","" CA""}",f,strict_14_with_grace_period,f,f
3,2020-05-30,Bonsai Garden Inn in Professorville,Room in gracious home with beautiful garden. ...,"Bright, garden-facing room in beautiful home. ...",Room in gracious home with beautiful garden. ...,This room is in an ultra convenient location i...,The family room has a flat panel tv and desk f...,The Stanford Shopping Center is 20 minutes wal...,"Kitchen, Laundry, Garden, Family Room with TV,...",We will see you everyday as we cook meals in t...,...,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Paid par...",3 weeks ago,2020-05-30,2010-05-28,2020-03-14,"{""PALO ALTO""}",f,moderate,f,f
4,2020-05-31,Private room - Parking 3 carport,We live in a safe community close to public tr...,I have a really nice room in a quiet neighborh...,We live in a safe community close to public tr...,Our community is a safe environment and at nig...,Please remember that you are in my home and re...,We have a bus stop right outside our community...,The kitchen and all that it offers. The laund...,We follow our guests lead when it comes to soc...,...,"{Internet,Wifi,""Air conditioning"",Kitchen,""Fre...",23 months ago,2020-05-31,2010-08-01,2017-03-14,"{""SAN JOSE""}",f,moderate,t,t


In [106]:
nominal_features.columns

Index(['last_scraped', 'name', 'summary', 'space', 'description',
       'neighborhood_overview', 'notes', 'transit', 'access', 'interaction',
       'house_rules', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_neighbourhood', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed', 'city', 'market',
       'smart_location', 'is_location_exact', 'property_type', 'room_type',
       'bed_type', 'amenities', 'calendar_updated', 'calendar_last_scraped',
       'first_review', 'last_review', 'jurisdiction_names', 'instant_bookable',
       'cancellation_policy', 'require_guest_profile_picture',
       'require_guest_phone_verification'],
      dtype='object')

In [107]:
(nominal_features['last_scraped'] == nominal_features['calendar_last_scraped']).value_counts()

True    7221
dtype: int64

In [108]:
nominal_features.isna().sum()

last_scraped                           0
name                                   1
summary                              299
space                               1707
description                          238
neighborhood_overview               2677
notes                               3529
transit                             3085
access                              3117
interaction                         2968
house_rules                         2264
host_name                              1
host_since                             1
host_location                         23
host_about                          2334
host_response_time                  1807
host_response_rate                  1807
host_acceptance_rate                 714
host_is_superhost                      1
host_neighbourhood                  1181
host_verifications                     0
host_has_profile_pic                   1
host_identity_verified                 1
street                                 0
neighbourhood   

In [109]:
data['room_type'].unique()

array(['Private room', 'Entire home/apt', 'Shared room', 'Hotel room'],
      dtype=object)

In [110]:
accomo_type = data['accommodates'].unique()
accomo_type.sort()
accomo_type

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16])

In [111]:
data['bed_type'].unique()

array(['Real Bed', 'Futon', 'Pull-out Sofa', 'Airbed', 'Couch'],
      dtype=object)

In [112]:
data['jurisdiction_names'].unique()

array(['{"PALO ALTO"}', '{"SANTA CLARA"}', '{"Mountain View"," CA"}',
       '{"SAN JOSE"}', '{Cupertino," CA"}', '{Sunnyvale," CA"}', nan,
       '{Milpitas," CA"}', '{"Morgan Hill"," CA"}', '{"Los Gatos"," CA"}',
       '{"San Benito County"," CA"}'], dtype=object)

In [113]:
data['jurisdiction_names'].isna().sum()

979

In [114]:
data['cancellation_policy'].unique()

array(['moderate', 'strict_14_with_grace_period', 'flexible',
       'super_strict_60'], dtype=object)

### clean the city column

In [115]:
data['city'].unique()

array(['Palo Alto', 'Santa Clara', 'Mountain View', 'San Jose',
       'Cupertino', 'Sunnyvale', 'Campbell', 'Milpitas', 'Saratoga',
       'Morgan Hill', 'Los Altos', 'Los Gatos', 'Menlo Park',
       'Los Altos Hills', 'Monte Sereno', 'Hollister', 'Gilroy',
       'San Jose ', 'San Martin', 'Stanford', 'Santa Clara County',
       'Sunnyvale ', 'Palo Alto ', 'Milpitas ', 'Los Gatos ', 'san jose',
       'Mountain View ', 'Danville ', nan, 'Campbell ', '洛斯阿尔托斯',
       'santa clara', 'Fremont', '圣何塞', 'Watsonville'], dtype=object)

In [175]:
data['city'] = data['city'].str.replace('圣何塞', 'San Jose')
data['city'] = data['city'].str.replace('san jose', 'San Jose')
data['city'] = data['city'].str.replace('洛斯阿尔托斯', 'Los Altos')
data['city'] = data['city'].str.replace('santa clara', 'Santa Clara')
data['city'] = data['city'].str.replace('Santa Clara County','Santa Clara')
data['city'].str.strip()

0           Palo Alto
1         Santa Clara
2       Mountain View
3           Palo Alto
4            San Jose
            ...      
7216         San Jose
7217      Santa Clara
7218        Sunnyvale
7219         San Jose
7220         San Jose
Name: city, Length: 7221, dtype: object

In [203]:
data['city'].unique()

array(['Palo Alto', 'Santa Clara', 'Mountain View', 'San Jose',
       'Cupertino', 'Sunnyvale', 'Campbell', 'Milpitas', 'Saratoga',
       'Morgan Hill', 'Los Altos', 'Los Gatos', 'Menlo Park',
       'Los Altos Hills', 'Monte Sereno', 'Hollister', 'Gilroy',
       'San Jose ', 'San Martin', 'Stanford', 'Sunnyvale ', 'Palo Alto ',
       'Milpitas ', 'Los Gatos ', 'Mountain View ', 'Danville ', nan,
       'Campbell ', 'Fremont', 'Watsonville'], dtype=object)

In [177]:
# find binary columns:
for col in data.columns:
    if data[col].nunique() == 2:
        print(col)

last_scraped
host_is_superhost
host_has_profile_pic
host_identity_verified
is_location_exact
calendar_last_scraped
instant_bookable
require_guest_profile_picture
require_guest_phone_verification


In [178]:
for col in data.columns:
    if data[col].nunique() == 3:
        print(col)

In [179]:
for col in data.columns:
    if data[col].nunique() == 4:
        print(col)

host_response_time
market
room_type
cancellation_policy


In [180]:
for col in data.columns:
    if data[col].nunique() == 5:
        print(col)

bed_type


In [181]:
for col, n in zip(data.columns, data.nunique()):
    if n <10 and n > 1:
        print(col,n)

last_scraped 2
host_response_time 4
host_is_superhost 2
host_has_profile_pic 2
host_identity_verified 2
market 4
is_location_exact 2
room_type 4
bed_type 5
calendar_last_scraped 2
review_scores_accuracy 8
review_scores_cleanliness 8
review_scores_checkin 8
review_scores_communication 8
review_scores_location 7
review_scores_value 8
instant_bookable 2
cancellation_policy 4
require_guest_profile_picture 2
require_guest_phone_verification 2


## What we got now:

Data cleaning

  - id？ host ID？ which to keep?
  - how to deal with missing price? -- fill na zero?
  - can we delete one of these columns each?
    - (data['maximum_maximum_nights'] ==  data['minimum_maximum_nights']).value_counts()
    - (data['host_listings_count'] == data['host_total_listings_count']).value_counts()
    - (nominal_features['last_scraped'] == nominal_features['calendar_last_scraped']).value_counts()

---
- Neighbors does not match the neighbors in data
  - sub data by region?
- Location :  clean later
  -  ('latitude', 5882)
  -  ('longitude', 6357)

 

In [206]:
data.drop([
    'minimum_maximum_nights', 'host_total_listings_count',
    'calendar_last_scraped'
],
          axis=1,
          inplace=True)

### Amenities hot words analysis

In [182]:
from collections import Counter

results = Counter()

In [183]:
amenities = data['amenities']

In [184]:
amenities.str.strip('{}')\
               .str.replace('"', '')\
               .str.lstrip('\"')\
               .str.rstrip('\"')\
               .str.split(',')\
               .apply(results.update)

0       None
1       None
2       None
3       None
4       None
        ... 
7216    None
7217    None
7218    None
7219    None
7220    None
Name: amenities, Length: 7221, dtype: object

In [185]:
# results

In [186]:
sub_df = pd.DataFrame(results.most_common(50), columns=['amenities', 'count'])

In [187]:
sub_df['ratio'] = round(sub_df['count'] / len(amenities),4)

In [188]:
sub_df.sort_values(by=['count'], ascending=False)

Unnamed: 0,amenities,count,ratio
0,Wifi,7071,0.9792
1,Essentials,6904,0.9561
2,Heating,6824,0.945
3,Smoke alarm,6794,0.9409
4,Hangers,6480,0.8974
5,Shampoo,6271,0.8684
6,Carbon monoxide alarm,6178,0.8556
7,Laptop-friendly workspace,5998,0.8306
8,Hair dryer,5997,0.8305
9,Kitchen,5890,0.8157


## Fusion with reviews

In [189]:
reviews = pd.read_csv('./data/reviews.csv')
print(reviews.shape)
reviews.head()

(212513, 2)


Unnamed: 0,listing_id,date
0,4952,2009-08-02
1,4952,2009-09-04
2,4952,2009-10-16
3,4952,2009-12-10
4,4952,2010-06-08


In [190]:
rev_row = reviews['listing_id'].unique()

In [191]:
data_rev = data.loc[data['id'].isin(rev_row)]
data_rev.head()

Unnamed: 0,id,last_scraped,name,summary,space,description,neighborhood_overview,notes,transit,access,...,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,4952,2020-05-30,Butterfly Inn - Graceful Living !,Lovely garden setting in a serene and art-fill...,Very comfortable Queen bed and small desk in b...,Lovely garden setting in a serene and art-fill...,"Located in Professorville, Palo Alto, one of t...","Housekeeping every Monday, leave your door ope...",Walking distance to Stanford University (30 mi...,"Kitchen, laundry, family/TV room, garden, free...",...,"{""PALO ALTO""}",f,moderate,f,f,5,0,5,0,0.57
1,11464,2020-05-31,Deluxe Private Studio-custom int.,Custom built Studio with exquisite design. Rea...,Description A favorite for international corpo...,Custom built Studio with exquisite design. Rea...,10 min. to Lucky's and Starbucks at El Camino ...,Pet Policy: - Well Behaved pet up to 25 lb. of...,Public transportation at Homestead and Pomeroy...,Complimentary Wifi-internet + Basic Cable,...,"{""SANTA CLARA""}",f,moderate,f,t,12,12,0,0,0.08
2,17884,2020-05-31,Silicon Valley Suite,"A guest suite for one or two, in a house in a ...",This is a private suite at the rear of a house...,"A guest suite for one or two, in a house in a ...","This is a very quiet family neighborhood, but ...","I cannot accommodate cats, sorry. I can usuall...","The CalTrain station is in walking distance, a...",Private outdoor patio. Shared washer/dryer on ...,...,"{""Mountain View"","" CA""}",f,strict_14_with_grace_period,f,f,2,2,0,0,0.11
3,21373,2020-05-30,Bonsai Garden Inn in Professorville,Room in gracious home with beautiful garden. ...,"Bright, garden-facing room in beautiful home. ...",Room in gracious home with beautiful garden. ...,This room is in an ultra convenient location i...,The family room has a flat panel tv and desk f...,The Stanford Shopping Center is 20 minutes wal...,"Kitchen, Laundry, Garden, Family Room with TV,...",...,"{""PALO ALTO""}",f,moderate,f,f,5,0,5,0,2.17
4,37512,2020-05-31,Private room - Parking 3 carport,We live in a safe community close to public tr...,I have a really nice room in a quiet neighborh...,We live in a safe community close to public tr...,Our community is a safe environment and at nig...,Please remember that you are in my home and re...,We have a bus stop right outside our community...,The kitchen and all that it offers. The laund...,...,"{""SAN JOSE""}",f,moderate,t,t,2,0,2,0,1.53


In [192]:
data_rev.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5734 entries, 0 to 7214
Data columns (total 86 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            5734 non-null   int64  
 1   last_scraped                                  5734 non-null   object 
 2   name                                          5734 non-null   object 
 3   summary                                       5563 non-null   object 
 4   space                                         4571 non-null   object 
 5   description                                   5606 non-null   object 
 6   neighborhood_overview                         3884 non-null   object 
 7   notes                                         3191 non-null   object 
 8   transit                                       3618 non-null   object 
 9   access                                        3642 non-null   o

In [193]:
data_rev.describe(include='all')

Unnamed: 0,id,last_scraped,name,summary,space,description,neighborhood_overview,notes,transit,access,...,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,5734.0,5734,5734,5563,4571,5606,3884,3191,3618,3642,...,5006,5734,5734,5734,5734,5734.0,5734.0,5734.0,5734.0,5707.0
unique,,2,5615,4920,3791,5224,3066,2479,2834,3047,...,10,2,4,2,2,,,,,
top,,2020-05-31,TRIBE ❤️ Deep Cleaned 2BR @ Great Mall Milpitas,The HackerHome Network New to the Valley? Tran...,"At Zeus, we focus exclusively on 30 day+ stays...","Dorm-style house in Silicon Valley. No-host, 5...",*Area Highlights* -Located in the Heart of the...,Parking is not included with your stay. Parkin...,This Flat Location is Uber/Lyft Accessible & C...,"• All utilities are included, including gas, e...",...,"{""SAN JOSE""}",f,moderate,f,f,,,,,
freq,,3778,21,44,242,20,28,41,59,26,...,2257,3262,2002,5532,5547,,,,,
mean,24618250.0,,,,,,,,,,...,,,,,,26.45361,22.485699,2.877224,1.058598,1.359872
std,12096880.0,,,,,,,,,,...,,,,,,83.43773,83.821405,6.912066,5.849206,1.535631
min,4952.0,,,,,,,,,,...,,,,,,1.0,0.0,0.0,0.0,0.01
25%,14890020.0,,,,,,,,,,...,,,,,,1.0,0.0,0.0,0.0,0.32
50%,25982720.0,,,,,,,,,,...,,,,,,3.0,1.0,1.0,0.0,0.83
75%,35007040.0,,,,,,,,,,...,,,,,,8.0,2.0,3.0,0.0,1.81


In [194]:
counts = reviews['listing_id'].value_counts()


In [195]:
counts.head()

7476637     489
52786       478
10814836    445
19641513    429
13828514    427
Name: listing_id, dtype: int64

## Fusion with neighborhood data

In [196]:
neighbors = pd.read_csv('./data/neighbourhoods.csv')
print(neighbors.shape)
neighbors.head()

(16, 2)


Unnamed: 0,neighbourhood_group,neighbourhood
0,,Campbell
1,,Cupertino
2,,Gilroy
3,,Los Altos
4,,Los Altos Hills


In [197]:
data_neighbor = data['neighbourhood'].unique().tolist()
neighbors_hood = neighbors['neighbourhood'].unique().tolist()

In [198]:
same_locations = []

for neighbor in neighbors_hood:
    if neighbor in data_neighbor:
        same_locations.append(neighbor)

In [199]:
same_locations

['Campbell',
 'Cupertino',
 'Los Altos',
 'Los Altos Hills',
 'Mountain View',
 'Palo Alto',
 'Santa Clara',
 'Sunnyvale']

In [200]:
data_neighbor

['Palo Alto',
 'Santa Clara',
 'Mountain View',
 'South San Jose',
 'Cupertino',
 'Sunnyvale',
 'Downtown',
 'Campbell',
 'West Valley',
 'Edenvale',
 nan,
 'Willow Glen',
 'Central San Jose',
 'Los Altos',
 'Berryessa',
 'Cory',
 'College Park',
 'Alum Rock',
 'Cambrian/Pioneer',
 'North San Jose',
 'Los Altos Hills',
 'Burbank/Del Monte',
 'Evergreen',
 'Newhall/Sherwood',
 'Shasta/Hanchett Park',
 'Rose Garden',
 'Five Wounds/Brookwood Terrace',
 'Naglee Park',
 'Forest/Pruneridge',
 'Alviso',
 'Japantown',
 'Stanford',
 'Delmas Park',
 'Chapman/Morse',
 'Menlo Park',
 'Autumn/Montgomery',
 'Vermont/McKendrie']

In [201]:
neighbors_hood

['Campbell',
 'Cupertino',
 'Gilroy',
 'Los Altos',
 'Los Altos Hills',
 'Los Gatos',
 'Milpitas',
 'Monte Sereno',
 'Morgan Hill',
 'Mountain View',
 'Palo Alto',
 'San Jose',
 'Santa Clara',
 'Saratoga',
 'Sunnyvale',
 'Unincorporated Areas']


## ideas list

### methdology for this large scale data?


### which cols we need

- e.g.

  id
  name
  host_id
  host_name

  neighbourhood_group
  neighbourhood

  latitude
  longitude

  room_type

  price

  minimum_nights
  number_of_reviews

  reviews_per_month
  calculated_host_listings_count
  availability_365

### What to compare

- Room Type vs Price
- Price/log price Distribution Plot


### pipleline?



## Codes for refering

- [amenities refer](https://www.kaggle.com/brittabettendorf/predicting-prices-xgboost-feature-engineering)
- [word cloud /weighted words /sentimental analsys / topic modeling](https://www.kaggle.com/brittabettendorf/nlp-on-airbnb-data)


### nextweek

modeling for price prediction


