## Goal
The goal of this project is to separate the positive and negative reviews using VADER, then 
we will use topic modelling to extract the most positive and negative words in each category.
For future work, I want to create a flask app where I will allow user to enter a hotel name, 
and the app will return the pros and cons of the hotel.

One attribute in the reviews.csv file is hotel class. It's calculated based on the positive comments on the trip advisor. Trip advisor has got 5 rating criteria started from rooms,hotel facilities, swimming pool, smocking zone, WiFi, business centre, Bar, Coffee shop etc. After we have the positive and negative sentiments, we can compare the stars to see if they match up.



In [5]:
import pandas as pd
import json

In [6]:
with open('data/offering.json') as json_file:
    data = json.loads(json_file.read())

In [7]:
offering_df = pd.read_json('data/offering.json')

In [4]:
offering_df.head()

Unnamed: 0,address,details,hotel_class,id,name,phone,region_id,type,url
0,"{'region': 'NY', 'street-address': '147 West 4...",,4.0,113317,Casablanca Hotel Times Square,,60763,hotel,http://www.tripadvisor.com/Hotel_Review-g60763...
1,"{'region': 'CA', 'street-address': '300 S Dohe...",,5.0,76049,Four Seasons Hotel Los Angeles at Beverly Hills,,32655,hotel,http://www.tripadvisor.com/Hotel_Review-g32655...
2,"{'region': 'NY', 'street-address': '790 Eighth...",,3.5,99352,Hilton Garden Inn Times Square,,60763,hotel,http://www.tripadvisor.com/Hotel_Review-g60763...
3,"{'region': 'NY', 'street-address': '152 West 5...",,4.0,93589,The Michelangelo Hotel,,60763,hotel,http://www.tripadvisor.com/Hotel_Review-g60763...
4,"{'region': 'NY', 'street-address': '130 West 4...",,4.0,217616,The Muse Hotel New York,,60763,hotel,http://www.tripadvisor.com/Hotel_Review-g60763...


In [14]:
#the data has 4333 rows and 9 columns
offering_df.shape

(4333, 9)

In [8]:
#make lists of place holder values
#the placeholder values are None for now 
#for each address dictionary, if the key exists, 
#reassign None to the value stored by the key

num_rows = offering_df.shape[0]

In [9]:
region_list=[None]*num_rows
address_list=[None]*num_rows
postal_code_list=[None]*num_rows
locality_list=[None]*num_rows

In [15]:
keys = ['region', 'street-address', 'postal-code', 'locality']

ct = 0

addresses, offering_df = offering_df['address'], offering_df.drop('address', axis =1)
# addresses = offering_df['address']

In [17]:
addresses[0]

{'region': 'NY',
 'street-address': '147 West 43rd Street',
 'postal-code': '10036',
 'locality': 'New York City'}

In [52]:
for address in addresses:
    for key in address.keys():
        if key == 'region':
            region_list[ct] = address[key]
        elif key == 'street-address':
            address_list[ct] = address[key]
        elif key == 'postal-code':
            postal_code_list[ct] = address[key]
        elif key == 'locality':
            locality_list[ct] = address[key]
    ct += 1


offering_df['region'] = region_list
offering_df['street-address'] = address_list
offering_df['postal-code'] = postal_code_list
offering_df['locality'] = locality_list

In [53]:
offering_df.head()

Unnamed: 0,details,hotel_class,id,name,phone,region_id,type,url,region,street-address,postal-code,locality
0,,4.0,113317,Casablanca Hotel Times Square,,60763,hotel,http://www.tripadvisor.com/Hotel_Review-g60763...,NY,147 West 43rd Street,10036,New York City
1,,5.0,76049,Four Seasons Hotel Los Angeles at Beverly Hills,,32655,hotel,http://www.tripadvisor.com/Hotel_Review-g32655...,CA,300 S Doheny Dr,90048,Los Angeles
2,,3.5,99352,Hilton Garden Inn Times Square,,60763,hotel,http://www.tripadvisor.com/Hotel_Review-g60763...,NY,790 Eighth Avenue,10019,New York City
3,,4.0,93589,The Michelangelo Hotel,,60763,hotel,http://www.tripadvisor.com/Hotel_Review-g60763...,NY,152 West 51st Street,10019,New York City
4,,4.0,217616,The Muse Hotel New York,,60763,hotel,http://www.tripadvisor.com/Hotel_Review-g60763...,NY,130 West 46th Street,10036,New York City


In [39]:
#the id in offering_df is
offering_df.iloc[0]

details                                                         NaN
hotel_class                                                       4
id                                                           113317
name                                  Casablanca Hotel Times Square
phone                                                              
region_id                                                     60763
type                                                          hotel
url               http://www.tripadvisor.com/Hotel_Review-g60763...
region                                                           NY
street-address                                 147 West 43rd Street
postal-code                                                   10036
locality                                              New York City
Name: 0, dtype: object

In [40]:
#let's see if details is always NaN
offering_df.details.value_counts()
#since it's always NaN, we can drop this column

Series([], Name: details, dtype: int64)

In [54]:
offering_df=offering_df.drop('details',axis=1)

In [55]:
offering_df.to_pickle('./offering_df.pkl')

In [56]:
offering_df=pd.read_pickle('offering_df.pkl', compression='infer')

In [47]:
#we can tell that in this dataset, most of the reviews are neg/neutral
offering_df.hotel_class.value_counts()

2.0    753
3.0    746
2.5    631
3.5    405
4.0    394
5.0     72
1.5     67
4.5     61
1.0     12
Name: hotel_class, dtype: int64

## Import Review Data

In [18]:
with open('data/review.txt') as text_file:
    reviews = text_file.read()

In [19]:
reviews = reviews.split('\n')

In [75]:
reviews[0]

'{"ratings": {"service": 5.0, "cleanliness": 5.0, "overall": 5.0, "value": 5.0, "location": 5.0, "sleep_quality": 5.0, "rooms": 5.0}, "title": "\\u201cTruly is \\"Jewel of the Upper Wets Side\\"\\u201d", "text": "Stayed in a king suite for 11 nights and yes it cots us a bit but we were happy with the standard of room, the location and the friendliness of the staff. Our room was on the 20th floor overlooking Broadway and the madhouse of the Fairway Market. Room was quite with no noise evident from the hallway or adjoining rooms. It was great to be able to open windows when we craved fresh rather than heated air. The beds, including the fold out sofa bed, were comfortable and the rooms were cleaned well. Wi-fi access worked like a dream with only one connectivity issue on our first night and this was promptly responded to with a call from the service provider to ensure that all was well. The location close to the 72nd Street subway station is great and the complimentary umbrellas on the 

In [20]:
type(reviews)

list

In [21]:
import json
review_list = []
for i, review in enumerate(reviews):
    try:
        review_list.append(json.loads(review))
    except:
        break

In [22]:
review_list[0]

{'ratings': {'service': 5.0,
  'cleanliness': 5.0,
  'overall': 5.0,
  'value': 5.0,
  'location': 5.0,
  'sleep_quality': 5.0,
  'rooms': 5.0},
 'title': '“Truly is "Jewel of the Upper Wets Side"”',
 'text': 'Stayed in a king suite for 11 nights and yes it cots us a bit but we were happy with the standard of room, the location and the friendliness of the staff. Our room was on the 20th floor overlooking Broadway and the madhouse of the Fairway Market. Room was quite with no noise evident from the hallway or adjoining rooms. It was great to be able to open windows when we craved fresh rather than heated air. The beds, including the fold out sofa bed, were comfortable and the rooms were cleaned well. Wi-fi access worked like a dream with only one connectivity issue on our first night and this was promptly responded to with a call from the service provider to ensure that all was well. The location close to the 72nd Street subway station is great and the complimentary umbrellas on the dri

In [23]:
reviews_df=pd.DataFrame(review_list)

In [24]:
reviews_df.head()

Unnamed: 0,author,date,date_stayed,id,num_helpful_votes,offering_id,ratings,text,title,via_mobile
0,"{'username': 'Papa_Panda', 'num_cities': 22, '...","December 17, 2012",December 2012,147643103,0,93338,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...",Stayed in a king suite for 11 nights and yes i...,"“Truly is ""Jewel of the Upper Wets Side""”",False
1,"{'username': 'Maureen V', 'num_reviews': 2, 'n...","December 17, 2012",December 2012,147639004,0,93338,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...","On every visit to NYC, the Hotel Beacon is the...",“My home away from home!”,False
2,"{'username': 'vuguru', 'num_cities': 12, 'num_...","December 18, 2012",December 2012,147697954,0,1762573,"{'service': 4.0, 'cleanliness': 5.0, 'overall'...",This is a great property in Midtown. We two di...,“Great Stay”,False
3,"{'username': 'Hotel-Designer', 'num_cities': 5...","December 17, 2012",August 2012,147625723,0,1762573,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...",The Andaz is a nice hotel in a central locatio...,“Modern Convenience”,False
4,"{'username': 'JamesE339', 'num_cities': 34, 'n...","December 17, 2012",December 2012,147612823,0,1762573,"{'service': 4.0, 'cleanliness': 5.0, 'overall'...",I have stayed at each of the US Andaz properti...,“Its the best of the Andaz Brand in the US....”,False


In [88]:
reviews_df.to_pickle('./reviews_df.pkl')

In [99]:
reviews_df=pd.read_pickle('reviews_df.pkl', compression='infer')

In [100]:
#now we can separate author, ratings into different columns
num_rows = reviews_df.shape[0]

username_list=[None]*num_rows
num_cities_list=[None]*num_rows
num_helpful_votes_list=[None]*num_rows
num_reviews_list=[None]*num_rows


keys= ['username', 'num_cities', 'num_helpful_votes', 'num_reviews']

ct = 0

author, reviews_df = reviews_df['author'], reviews_df.drop('author', axis =1)

for user in author:
    for key in user.keys():
        if key == 'username':
            username_list[ct] = user[key]
        elif key == 'num_cities':
            num_cities_list[ct] = user[key]
        elif key == 'num_helpful_votes':
            num_helpful_votes_list[ct] = user[key]
        elif key == 'num_reviews':
            num_reviews_list[ct] = user[key]
    ct += 1


reviews_df['username']=username_list
reviews_df['num_cities']=num_cities_list
reviews_df['num_helpful_votes'] =num_helpful_votes_list
reviews_df['num_reviews']=num_reviews_list



In [101]:
reviews_df.head()

Unnamed: 0,date,date_stayed,id,num_helpful_votes,offering_id,ratings,text,title,via_mobile,username,num_cities,num_reviews
0,"December 17, 2012",December 2012,147643103,12.0,93338,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...",Stayed in a king suite for 11 nights and yes i...,"“Truly is ""Jewel of the Upper Wets Side""”",False,Papa_Panda,22.0,29.0
1,"December 17, 2012",December 2012,147639004,,93338,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...","On every visit to NYC, the Hotel Beacon is the...",“My home away from home!”,False,Maureen V,2.0,2.0
2,"December 18, 2012",December 2012,147697954,17.0,1762573,"{'service': 4.0, 'cleanliness': 5.0, 'overall'...",This is a great property in Midtown. We two di...,“Great Stay”,False,vuguru,12.0,14.0
3,"December 17, 2012",August 2012,147625723,26.0,1762573,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...",The Andaz is a nice hotel in a central locatio...,“Modern Convenience”,False,Hotel-Designer,5.0,5.0
4,"December 17, 2012",December 2012,147612823,65.0,1762573,"{'service': 4.0, 'cleanliness': 5.0, 'overall'...",I have stayed at each of the US Andaz properti...,“Its the best of the Andaz Brand in the US....”,False,JamesE339,34.0,104.0


In [93]:
reviews_df['ratings'][0]

{'service': 5.0,
 'cleanliness': 5.0,
 'overall': 5.0,
 'value': 5.0,
 'location': 5.0,
 'sleep_quality': 5.0,
 'rooms': 5.0}

In [105]:
service_list=[None]*num_rows
cleanliness_list=[None]*num_rows
overall_list=[None]*num_rows
value_list=[None]*num_rows
location_list=[None]*num_rows
sleep_quality_list=[None]*num_rows
rooms_list=[None]*num_rows

keys = ['service','cleanliness','overall','value','sleep_quality','rooms']  

ratings, reviews_df = reviews_df['ratings'], reviews_df.drop('ratings', axis =1)

ct=0

for rating in ratings:
    for key in rating.keys():
        if key == 'service':
            service_list[ct] = rating[key]
        elif key == 'cleanliness':
            cleanliness_list[ct] = rating[key]
        elif key == 'overall':
            overall_list[ct] = rating[key]
        elif key == 'value':
            value_list[ct] = rating[key]
        elif key == 'location':
            location_list[ct] = rating[key]
        elif key == 'sleep_quality':
            sleep_quality_list[ct] = rating[key]
        elif key == 'rooms':
            rooms_list[ct] = rating[key]
    ct += 1


reviews_df['service']=service_list
reviews_df['cleanliness']=cleanliness_list
reviews_df['overall']=overall_list
reviews_df['value']=value_list
reviews_df['location']=location_list
reviews_df['sleep_quality']=sleep_quality_list
reviews_df['rooms']=rooms_list

In [106]:
reviews_df.head()

Unnamed: 0,date,date_stayed,id,num_helpful_votes,offering_id,text,title,via_mobile,username,num_cities,num_reviews,service,cleanliness,overall,value,location,sleep_quality,rooms
0,"December 17, 2012",December 2012,147643103,12.0,93338,Stayed in a king suite for 11 nights and yes i...,"“Truly is ""Jewel of the Upper Wets Side""”",False,Papa_Panda,22.0,29.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
1,"December 17, 2012",December 2012,147639004,,93338,"On every visit to NYC, the Hotel Beacon is the...",“My home away from home!”,False,Maureen V,2.0,2.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
2,"December 18, 2012",December 2012,147697954,17.0,1762573,This is a great property in Midtown. We two di...,“Great Stay”,False,vuguru,12.0,14.0,4.0,5.0,4.0,4.0,5.0,4.0,4.0
3,"December 17, 2012",August 2012,147625723,26.0,1762573,The Andaz is a nice hotel in a central locatio...,“Modern Convenience”,False,Hotel-Designer,5.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0
4,"December 17, 2012",December 2012,147612823,65.0,1762573,I have stayed at each of the US Andaz properti...,“Its the best of the Andaz Brand in the US....”,False,JamesE339,34.0,104.0,4.0,5.0,4.0,3.0,5.0,5.0,5.0


In [107]:
reviews_df.to_pickle('./reviews_final_df.pkl')

In [108]:
reviews_final_df=pd.read_pickle('reviews_final_df.pkl', compression='infer')

In [109]:
reviews_final_df.head()

Unnamed: 0,date,date_stayed,id,num_helpful_votes,offering_id,text,title,via_mobile,username,num_cities,num_reviews,service,cleanliness,overall,value,location,sleep_quality,rooms
0,"December 17, 2012",December 2012,147643103,12.0,93338,Stayed in a king suite for 11 nights and yes i...,"“Truly is ""Jewel of the Upper Wets Side""”",False,Papa_Panda,22.0,29.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
1,"December 17, 2012",December 2012,147639004,,93338,"On every visit to NYC, the Hotel Beacon is the...",“My home away from home!”,False,Maureen V,2.0,2.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
2,"December 18, 2012",December 2012,147697954,17.0,1762573,This is a great property in Midtown. We two di...,“Great Stay”,False,vuguru,12.0,14.0,4.0,5.0,4.0,4.0,5.0,4.0,4.0
3,"December 17, 2012",August 2012,147625723,26.0,1762573,The Andaz is a nice hotel in a central locatio...,“Modern Convenience”,False,Hotel-Designer,5.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0
4,"December 17, 2012",December 2012,147612823,65.0,1762573,I have stayed at each of the US Andaz properti...,“Its the best of the Andaz Brand in the US....”,False,JamesE339,34.0,104.0,4.0,5.0,4.0,3.0,5.0,5.0,5.0
