In [1]:
# Import libraries
import pandas as pd

## Reviews.csv

In [2]:
# Read in Reviews dataset
reviews = pd.read_csv('/Users/jessieowens2/Desktop/general_assembly/airbnb_data/reviews.csv')
# Check shape for total number of reviews
reviews.shape

(358268, 6)

In [3]:
# Look at reviews data
reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,3344,2185,2009-05-09,12016,Tony,The stay at Amos' condo greatly exceeded my ex...
1,3344,18774,2009-11-29,40724,Faris,"What can I say? AJ picked me up from Dulles, ..."
2,3344,20550,2009-12-16,58506,Sean,Amos is a phenomenal host. Where to start? Fir...
3,3344,293978,2011-06-01,583926,Yewwee,Aj is a great and friendly host! Excellent loc...
4,3344,296775,2011-06-04,503189,Jonathan,"As a first-time airbnb.com user, I am glad Amo..."


In [38]:
# Look at info stats
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358268 entries, 0 to 358267
Data columns (total 6 columns):
listing_id       358268 non-null int64
id               358268 non-null int64
date             358268 non-null object
reviewer_id      358268 non-null int64
reviewer_name    358268 non-null object
comments         358090 non-null object
dtypes: int64(3), object(3)
memory usage: 16.4+ MB


Are there any listings that have never gotten reviews? Or listings that only got bad reviews? 

What can be analyzed about the actual review content? Classify certain words as positive? Or negative? 

Should reviewer name be removed and just use the id? (For ethical concerns? But the host name of review is public on the site?)

In [6]:
# Check for null values
reviews.isnull().sum()

listing_id         0
id                 0
date               0
reviewer_id        0
reviewer_name      0
comments         178
dtype: int64

Drop rows without any comments

In [4]:
# Looking at how many reviews are listed for a single date
reviews[reviews['date'] == '2019-09-22']

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
6664,223922,534301264,2019-09-22,286508382,Kierrha,Natasha was very kind and welcoming. The locat...
6716,262629,534389351,2019-09-22,27311645,Tsitsi,Hellen was an exceptional host. She was nice a...
24339,1245292,534398015,2019-09-22,155838693,Kristi,I enjoyed my stay so much! This family goes th...
53275,3729418,534326069,2019-09-22,41107304,Joshua,"Had a fantastic, stress-free stay in Sally’s b..."
75242,5495901,534414090,2019-09-22,6030437,Heidi,What a treat! Convenient location on Capitol H...
...,...,...,...,...,...,...
358028,37676860,534611186,2019-09-22,13781772,(Email hidden by Airbnb),The host canceled this reservation 18 days bef...
358114,37841043,534322275,2019-09-22,21978461,Liz,This is a great space and was perfect for my d...
358117,37841107,534362566,2019-09-22,130834330,Joe,A lovely experience. Glenn’s apartment is very...
358248,38409709,534367786,2019-09-22,206960670,Romond,Had to cut my visit short but the place was cl...


In [7]:
# Sort values by date
reviews['date'].sort_values()

368       2009-01-21
411       2009-01-21
11        2009-01-21
412       2009-01-22
413       2009-03-26
             ...    
354569    2019-09-22
358028    2019-09-22
355995    2019-09-22
247848    2019-09-22
291733    2019-09-22
Name: date, Length: 358268, dtype: object

In [8]:
# Look at number of reviews per date
reviews['date'].value_counts()

2019-04-07    1044
2019-04-28     890
2019-04-14     878
2019-03-31     878
2019-04-21     844
              ... 
2012-01-29       1
2011-04-22       1
2011-05-13       1
2012-06-23       1
2009-10-30       1
Name: date, Length: 3314, dtype: int64

The first five dates are within a 20 day span and each have a high volume of reviews. Should look into how volume of reviews left over time changes? Does it increase at similar rate that listings increase at? Or similar/inverse to something else? Also do external research on what might have related to this? Is this at time of Cherry Blossom festival or some other event?

Reviews go back to 2009 (which may be around the time that AirBnB started listing places in DC) - worth further investigation

## Listings.csv

In [9]:
# Read in Listings data
listings = pd.read_csv('/Users/jessieowens2/Desktop/general_assembly/airbnb_data/listings.csv')
# Check shape for total number of listings
listings.shape

(9189, 16)

In [10]:
# Look at Listings data
listings.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,3344,"White House/Center City, 1 roommate",4957,A.J.,,"Downtown, Chinatown, Penn Quarters, Mount Vern...",38.90126,-77.02857,Private room,58,90,11,2016-08-31,0.09,2,358
1,3362,"Convention Center Rowhouse & In Law: 2 Units, 4BR",2798,Ayeh,,"Shaw, Logan Circle",38.91046,-77.01933,Entire home/apt,433,2,171,2019-08-26,1.32,5,307
2,3662,Vita's Hideaway II,4645,Vita,,Historic Anacostia,38.86193,-76.98963,Private room,65,2,36,2019-04-14,0.35,3,296
3,3670,Beautiful Sun-Lit U Street 1BR/1BA,4630,Sheila,,"Howard University, Le Droit Park, Cardozo/Shaw",38.91842,-77.0275,Private room,75,2,79,2018-07-25,1.44,1,361
4,3686,Vita's Hideaway,4645,Vita,,Historic Anacostia,38.86314,-76.98836,Private room,55,2,71,2019-08-05,0.66,3,283


Do the neighbourhood values in listings line up to the 39 available in hoods CSV?

Should host name be removed? Their names are referenced in reviews. Same ethical concern as mentioned with reviews.csv

Room type is categorical - only 4 options. Use this for some type of classification? 

Manually calculate reviews per month. Does this match what that column shows?? What is the number of months used to calculate this.

Time since last review - does this relate to anything? 

In [39]:
# Look at describe stats
listings.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,9189.0,9189.0,9189.0,9189.0,9189.0,9189.0,9189.0,7421.0,9189.0,9189.0
mean,19806410.0,60538870.0,38.911971,-77.017908,197.084558,5.57928,38.988791,2.010132,8.48656,127.113614
std,10493270.0,68497930.0,0.023371,0.02923,308.754577,19.49065,64.184896,2.12114,23.161285,131.208827
min,3344.0,1585.0,38.82037,-77.12128,10.0,1.0,0.0,0.01,1.0,0.0
25%,12513820.0,9419684.0,38.89953,-77.0373,80.0,1.0,1.0,0.33,1.0,0.0
50%,18766550.0,33313860.0,38.91121,-77.02025,118.0,2.0,11.0,1.21,1.0,84.0
75%,28541580.0,95459400.0,38.92478,-76.9974,195.0,3.0,50.0,3.11,4.0,254.0
max,38817660.0,296522800.0,38.99549,-76.90482,10000.0,600.0,769.0,12.97,152.0,365.0


Price range of 10 - $10,000. would be good to look at how this is distributed. 

Also look at relationship between price and minimum number of nights 

Reviews per month max is around 13 - is this an outlier? 

Listings per host max is 152 - is this an outlier?

Rename host listings count column

Check for outliers

In [None]:
# Look at info stats


In [11]:
listings.isnull().sum()

id                                   0
name                                 3
host_id                              0
host_name                            1
neighbourhood_group               9189
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       1768
reviews_per_month                 1768
calculated_host_listings_count       0
availability_365                     0
dtype: int64

Some listings have no reviews - how to handle those? still have valuable data. Also is a large chunk of the data

In [12]:
# all of the neighborhood groups are missing so dropping this value
listings.drop('neighbourhood_group', axis=1, inplace=True)

The name of the listing normally has some location information in it - does this correspond to their actual location? Is it even valuable to explore this?

What is the average price per neighborhood? 

In [13]:
# Sorting on last review data to get an idea of how many reviews have not been reviewed recently
listings['last_review'].sort_values()

5       2009-01-21
148     2013-03-31
107     2013-05-28
180     2013-06-03
219     2013-06-15
           ...    
9184           NaN
9185           NaN
9186           NaN
9187           NaN
9188           NaN
Name: last_review, Length: 9189, dtype: object

Some listings haven't been reviewed in many years. Were they bad reviews? 

## Neighbourhoods.csv

In [14]:
# Read in neighbourhood data
hoods = pd.read_csv('/Users/jessieowens2/Desktop/general_assembly/airbnb_data/neighbourhoods.csv')
# Check shape of hood data
hoods.shape

(39, 2)

In [16]:
# Look at data
hoods.head()

Unnamed: 0,neighbourhood_group,neighbourhood
0,,"Brightwood Park, Crestwood, Petworth"
1,,"Brookland, Brentwood, Langdon"
2,,"Capitol Hill, Lincoln Park"
3,,"Capitol View, Marshall Heights, Benning Heights"
4,,"Cathedral Heights, McLean Gardens, Glover Park"


In [None]:
# Look at info stats


In [15]:
# Check for null values
hoods.isnull().sum()

neighbourhood_group    39
neighbourhood           0
dtype: int64

The neighbourhood group is all missing so removing. 

In [17]:
# drop column since all of data is missing
hoods.drop('neighbourhood_group', axis=1, inplace=True)

Do something with mapping out the neighbourhoods? Map out with color code for price of neighbourhoods

## Calendars.csv

In [18]:
# Read in calendar data
calendar = pd.read_csv('/Users/jessieowens2/Desktop/general_assembly/airbnb_data/calendar.csv')
# Look at shape of calendar data
calendar.shape

(3354070, 7)

In [19]:
# Look at calendar data
calendar.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,3344,2019-09-22,f,$58.00,$58.00,90.0,720.0
1,3344,2019-09-23,f,$58.00,$58.00,90.0,720.0
2,3344,2019-09-24,f,$58.00,$58.00,90.0,720.0
3,3344,2019-09-25,f,$58.00,$58.00,90.0,720.0
4,3344,2019-09-26,f,$58.00,$58.00,90.0,720.0


In [None]:
# Look at describe stats


In [None]:
# Look at info stats

In [36]:
# Check for null values
calendar.isnull().sum()

listing_id        0
date              0
available         0
price             0
adjusted_price    0
minimum_nights    3
maximum_nights    3
price_diff        0
dtype: int64

Very small amount of data missing but will check out individually to see if it's for the same 3 rows. 

In [20]:
# Check data types
calendar.dtypes

listing_id          int64
date               object
available          object
price              object
adjusted_price     object
minimum_nights    float64
maximum_nights    float64
dtype: object

In [21]:
# Sort on date column
calendar.sort_values(by='date', ascending=False)
# this data was pulled on 9/22/19 so only represents future listings at time of data being pulled

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
2014161,22201266,2020-10-09,f,$107.00,$107.00,2.0,7.0
2014160,22201266,2020-10-08,f,$106.00,$106.00,2.0,7.0
2014159,22201266,2020-10-07,f,$106.00,$106.00,2.0,7.0
2014158,22201266,2020-10-06,f,$106.00,$106.00,2.0,7.0
2014157,22201266,2020-10-05,f,$106.00,$106.00,2.0,7.0
...,...,...,...,...,...,...,...
664825,9750676,2019-09-22,f,$350.00,$350.00,1.0,7.0
3000794,34888096,2019-09-22,t,$111.00,$111.00,30.0,1125.0
2799665,32353389,2019-09-22,f,$24.00,$24.00,1.0,14.0
1431341,16671355,2019-09-22,t,$199.00,$199.00,2.0,1125.0


This only represents future data so won't be able to build time series off of just this.

In [22]:
# Strip dollar signs from price column
calendar['price'] = calendar['price'].str.replace('$', '')

In [23]:
# Strip commas from price column
calendar['price'] = calendar['price'].str.replace(',', '')

In [25]:
# Convert price column to numeric
calendar['price'] = pd.to_numeric(calendar['price'])

In [26]:
# Check that data type has changed
calendar.dtypes

listing_id          int64
date               object
available          object
price             float64
adjusted_price     object
minimum_nights    float64
maximum_nights    float64
dtype: object

In [27]:
# Strip dollar signs from adjusted price column
calendar['adjusted_price'] = calendar['adjusted_price'].str.replace('$', '')

In [28]:
# Strip commas from adjusted price column
calendar['adjusted_price'] = calendar['adjusted_price'].str.replace(',', '')

In [29]:
# Convert price column to numeric
calendar['adjusted_price'] = pd.to_numeric(calendar['adjusted_price'])

In [30]:
# Calculate price difference between price and adjusted price, put in a new column
calendar['price_diff'] = calendar['price'] - calendar['adjusted_price']

In [31]:
# Look at how big/small the price difference might be
calendar['price_diff'].value_counts()

 0.0     3326260
 2.0        4707
 1.0        2795
 5.0        1846
 9.0        1771
          ...   
-12.0          1
 53.0          1
-33.0          1
-8.0           1
 46.0          1
Name: price_diff, Length: 101, dtype: int64

Not sure what the difference in price is representing (why there is an adjusted price?). Might be that the price listed in this csv changed between instances of the InsideAirBnB team pulling this data (they pull on semi regular basis)

In [33]:
# Check the possible values for the available column
calendar['available'].value_counts()

f    2178644
t    1175426
Name: available, dtype: int64

The values t and f might stand for true and false. I will have to look into this with more EDA to get a better idea of this or will just not use this data feature. 

In [34]:
# seeing how many times one listing shows up in this dataset. 
calendar[calendar['listing_id'] == 3344]['date'].value_counts()

2020-08-25    1
2019-10-18    1
2020-09-15    1
2020-02-06    1
2019-12-23    1
             ..
2020-01-12    1
2020-06-22    1
2020-09-16    1
2019-12-07    1
2019-11-19    1
Name: date, Length: 365, dtype: int64

In [None]:
# trying to sort listing_id and price, but currently getting an error
# calendar['listing_id'].groupby(by='price')

## Additional notes on future actions

Unsupervised learning problem: analyzing factors that contribute to good reviews and predict on a new dataset? (should be oct and nov data by time of capstone) 

Try to predict price changes based on review sentiment? 

Does high availability of a listing have a relationship with anything? (review sentiment, number of reviews, price per night, neighbourhood, host?)

Look at how price has changed over time

May or may not use time series to correlate data over time

Use archived calendar data set to account for seasonality or change over time

Look at how older events affected the price and see if this can predict how newer events will affect price (model)