# Description:



## 1. Data Source

Data was downloaded from this website: http://insideairbnb.com/get-the-data.html, which is sourced from publicly available information from the Airbnb site. 

* Target area for analysis: Hawaii, Hawaii
* Data collection date: 07/07/2020

### Data cleaning 

* remove na rows
* remove reviews with type = 'float' (showing as NaN)
* remove reviews with length shorter than 10
* only keep reviews made in English

In [1]:
# read in customer review data in Hawaii. Data is saved in a Local folder as a .csv file.
import pandas as pd

df_review=pd.read_csv("/Users/fujinhuizi/Documents/GitHub/data/reviews.csv")
df_review.info()
# 620,547 rows and 6 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 620548 entries, 0 to 620547
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     620548 non-null  int64 
 1   id             620548 non-null  int64 
 2   date           620548 non-null  object
 3   reviewer_id    620548 non-null  int64 
 4   reviewer_name  620548 non-null  object
 5   comments       620290 non-null  object
dtypes: int64(3), object(3)
memory usage: 28.4+ MB


In [2]:
# remove missing 
df_review.dropna()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,5065,3578629,2013-02-18,4574728,Terry,The place was difficult to find and communicat...
1,5065,4412184,2013-05-03,3067352,Olivia,Wayne was very friendly and his place is sweet...
2,5065,55331648,2015-11-29,33781202,Elspeth And Adam,We loved our time at this BnB! Beautiful surro...
3,5065,57598810,2015-12-27,12288841,Lydia,"The organisation was very uncomplicated,\r\nwe..."
4,5065,58905911,2016-01-05,41538214,Andrew,Place was great for what we wanted. Be ready t...
...,...,...,...,...,...,...
620543,43862996,635233601,2020-07-05,301920114,Iwalani,Amazing home away from home.
620544,43872014,633204187,2020-06-28,47237367,Cash,Amazing stay! Very quiet resort! Location was ...
620545,43894616,634395596,2020-07-03,12832377,Korel,Brittany’s condo was probably the nicest Airbn...
620546,43894616,635272463,2020-07-05,10157021,Amber,"Aloha,\n\nBrittany’s place was absolutely perf..."


In [3]:
# check types of review column
def trans(text):
    textresult=type(text).__name__
    return(textresult)

df_review["type"]=df_review.comments.apply(trans)

df_review.type.unique()

array(['str', 'float'], dtype=object)

In [4]:
# what's the comment like when it's a 'float'?
check1 = df_review[df_review.type == 'float']
check1
# 258 rows, missing comment, safe to remove

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,type
1807,14514,614340791,2020-03-04,287260439,Todd,,float
12121,205622,442161202,2019-04-22,224654953,Michael,,float
14509,233716,21013801,2014-10-09,2216314,McComma,,float
14779,238177,606062597,2020-02-15,232975745,Nathaniel,,float
15656,253282,614558986,2020-03-05,335728336,Nicholas,,float
...,...,...,...,...,...,...,...
618389,40865973,593441632,2020-01-15,84920240,Alejandra,,float
618815,41018679,589348517,2020-01-05,46814063,Jooyoung,,float
619651,41719929,600092381,2020-01-31,154950516,Chase,,float
620459,42844820,620144581,2020-03-20,110683284,Sebastian,,float


In [6]:
df_review=df_review[df_review.type == "str"]
df_review

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,type
0,5065,3578629,2013-02-18,4574728,Terry,The place was difficult to find and communicat...,str
1,5065,4412184,2013-05-03,3067352,Olivia,Wayne was very friendly and his place is sweet...,str
2,5065,55331648,2015-11-29,33781202,Elspeth And Adam,We loved our time at this BnB! Beautiful surro...,str
3,5065,57598810,2015-12-27,12288841,Lydia,"The organisation was very uncomplicated,\r\nwe...",str
4,5065,58905911,2016-01-05,41538214,Andrew,Place was great for what we wanted. Be ready t...,str
...,...,...,...,...,...,...,...
620543,43862996,635233601,2020-07-05,301920114,Iwalani,Amazing home away from home.,str
620544,43872014,633204187,2020-06-28,47237367,Cash,Amazing stay! Very quiet resort! Location was ...,str
620545,43894616,634395596,2020-07-03,12832377,Korel,Brittany’s condo was probably the nicest Airbn...,str
620546,43894616,635272463,2020-07-05,10157021,Amber,"Aloha,\n\nBrittany’s place was absolutely perf...",str


In [7]:
check2 = df_review[df_review.comments == 'NaN']
check2

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,type


In [8]:
df_review["len"]=df_review.comments.apply(len)
df_review.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,type,len
0,5065,3578629,2013-02-18,4574728,Terry,The place was difficult to find and communicat...,str,429
1,5065,4412184,2013-05-03,3067352,Olivia,Wayne was very friendly and his place is sweet...,str,205
2,5065,55331648,2015-11-29,33781202,Elspeth And Adam,We loved our time at this BnB! Beautiful surro...,str,202
3,5065,57598810,2015-12-27,12288841,Lydia,"The organisation was very uncomplicated,\r\nwe...",str,373
4,5065,58905911,2016-01-05,41538214,Andrew,Place was great for what we wanted. Be ready t...,str,164


In [9]:
check3 = df_review[df_review.len < 8]
check3

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,type,len
199,5387,363465188,2018-12-28,36673205,Phillip,.,str,1
256,5390,21956488,2014-10-27,13831665,Dianbo,great,str,5
330,5390,344519912,2018-11-03,5794129,James Knowles And,G,str,1
657,8833,157756873,2017-06-04,71309499,Anabela,.,str,1
982,13528,6771095,2013-08-24,8227817,Cyril,Amazing,str,7
...,...,...,...,...,...,...,...,...
619874,41883138,620714921,2020-03-23,123467778,Sandra,.,str,1
620207,42310583,620800725,2020-03-24,58599981,Michael,Great!,str,6
620329,42470217,612104027,2020-02-28,274426845,Florentin,_,str,1
620399,42670704,622493126,2020-04-15,16251600,John,hi,str,2


In [10]:
# remove comments with length < 5
df_review=df_review[df_review['len'] >= 8]
df_review

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,type,len
0,5065,3578629,2013-02-18,4574728,Terry,The place was difficult to find and communicat...,str,429
1,5065,4412184,2013-05-03,3067352,Olivia,Wayne was very friendly and his place is sweet...,str,205
2,5065,55331648,2015-11-29,33781202,Elspeth And Adam,We loved our time at this BnB! Beautiful surro...,str,202
3,5065,57598810,2015-12-27,12288841,Lydia,"The organisation was very uncomplicated,\r\nwe...",str,373
4,5065,58905911,2016-01-05,41538214,Andrew,Place was great for what we wanted. Be ready t...,str,164
...,...,...,...,...,...,...,...,...
620543,43862996,635233601,2020-07-05,301920114,Iwalani,Amazing home away from home.,str,28
620544,43872014,633204187,2020-06-28,47237367,Cash,Amazing stay! Very quiet resort! Location was ...,str,136
620545,43894616,634395596,2020-07-03,12832377,Korel,Brittany’s condo was probably the nicest Airbn...,str,812
620546,43894616,635272463,2020-07-05,10157021,Amber,"Aloha,\n\nBrittany’s place was absolutely perf...",str,607


In [15]:
# check review date range
df_review.date.min() #'2009-05-07'
df_review.date.max() #'2020-07-07'

'2020-07-07'

In [17]:
# focus on comments made within 2 years
df_review=df_review[df_review['date'] >= '2018-07-07']
df_review

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,type,len
33,5065,311261133,2018-08-20,81213098,Cheryl,"Beautiful, peaceful, little hale on the mountain.",str,49
34,5065,428339790,2019-03-24,138578832,Mokulua,"The unit was comfortable, clean and quiet. The...",str,203
35,5065,453792817,2019-05-16,124614704,Tiffany,Great place if you want peace and quiet and be...,str,67
36,5065,513673702,2019-08-19,4707325,Nicolas,This is the perfect place to stay if you want ...,str,77
37,5065,606909944,2020-02-16,253726221,Boyang,"Big, quiet studio in the country. Lots of wind...",str,230
...,...,...,...,...,...,...,...,...
620543,43862996,635233601,2020-07-05,301920114,Iwalani,Amazing home away from home.,str,28
620544,43872014,633204187,2020-06-28,47237367,Cash,Amazing stay! Very quiet resort! Location was ...,str,136
620545,43894616,634395596,2020-07-03,12832377,Korel,Brittany’s condo was probably the nicest Airbn...,str,812
620546,43894616,635272463,2020-07-05,10157021,Amber,"Aloha,\n\nBrittany’s place was absolutely perf...",str,607


In [11]:
# detect review language - analyze English only 
from langdetect import detect
#test_1 = df_review.head()
df_review["detect"]=df_review.comments.apply(detect)

KeyboardInterrupt: 