# MSDS 430 Module 6 Python Assignment

<div class="alert alert-block alert-warning"> In this assignment you will complete the following exercises and submit your <b>notebook</b> and <b>html file</b> to Canvas. Your files should include all output, i.e. run each cell and save your file before submitting.</div>

<div class="alert alert-block alert-info">In this exercise you will work with TripAdvisor customer review data for the <b>Comfort Inn & Suites Seattle</b> hotel in Seattle, Washington. The data is stored in a JSON file. JSON is a popular language-independent data format derived from JavaScript. In fact, JSON stands for JavaScript Object Notation. The load method in the json module in Python can be used to parse a JSON file with result being a Python dictionary. Then by using dictionary methods we can extract the list of reviews for the hotel and then use String methods to get information from within the comments made by the users.</div>

### Dictionaries and Dict Methods

The hotel data we want to analyze is contained in the (json) file `hotel_reviews.json`. The data includes some information about the hotel, and a number of hotel reviews made by people who (we assume) stayed there. When we read the data into Python we will end up with a "nested" dictionary, i.e. a dictionary some of whose values are also (lists of) dictionaries. Before we examine the structure of this nested dictionary we need to talk a bit about dictionaries in general.

Dictionaries in Python are data structures that store key/value pairs. The keys have to be of an "immutable" type (such as numbers or strings) but the values can be various kinds of things, including lists, arrays, and other dictionaries. The keys need also to be unique: there can't be duplicates. Let us look at some examples.

In [2]:
# A dictionary with different types of keys: 1, "two" and (1,2). 
# Here (1,2) is an example of a tuple.
mixed_keys_dict = {1:"one", "two":2, (1,2):"ordered pair" }
mixed_keys_dict

{1: 'one', 'two': 2, (1, 2): 'ordered pair'}

In [3]:
#  Let us define a simple dictionary with String keys: "name", "age" and "sex":
cust_dict = {"name":"John Doe","age": 32, "sex": "M"}
cust_dict

{'name': 'John Doe', 'age': 32, 'sex': 'M'}

The list of keys in a dictionary can be obtained by using the dictionary's `keys` method. Also, we can obtain the value of any key in the dictionary by "bracketing" the key. We could then use assignment to change the value of the key if we wished. For more on this, see __[Dictionary View Objects](https://docs.python.org/3/library/stdtypes.html#dictionary-view-objects)__

In [4]:
# Get the list of keys--actually a dict_keys object (views) in Python 3.x.
cust_dict.keys()

dict_keys(['name', 'age', 'sex'])

In [5]:
# Get the value associated with the "name" key
cust_dict["name"]

'John Doe'

In [6]:
# Change the value of the "name" key
cust_dict["name"] = "John Doe Jr."

In [7]:
cust_dict

{'name': 'John Doe Jr.', 'age': 32, 'sex': 'M'}

In [8]:
# What happens when you try to access a key that is not there?
cust_dict['height']

KeyError: 'height'

In [9]:
# A better way...
cust_dict.get("height", "missing")

'missing'

In [10]:
# If the key is there it will return its value...
cust_dict.get("name", "missing")

'John Doe Jr.'

We can also use assignment to add new key/value pairs to the dictionary. 

In [11]:
cust_dict['height'] = 6.0
cust_dict['weight'] = 200.5
cust_dict

{'name': 'John Doe Jr.', 'age': 32, 'sex': 'M', 'height': 6.0, 'weight': 200.5}

Note that we need to add keys to an *existing* dictionary even if the dictionary is empty to begin with...

In [12]:
market_dict = {}  # create an empty dictionary
market_dict['market_name'] = 'Foods R Us'
market_dict

{'market_name': 'Foods R Us'}

Let us add a new key/value pair to `cust_dict`, where the key is `"location"` and the value of that key is another dictionary (with keys: `"city"`, `"state"` and `"zip code"`).

In [13]:
# Example of a nested dictionary...
location_dict = {"city":"Miami","state":"FL","zip code":33165}
cust_dict["location"]=location_dict
cust_dict

{'name': 'John Doe Jr.',
 'age': 32,
 'sex': 'M',
 'height': 6.0,
 'weight': 200.5,
 'location': {'city': 'Miami', 'state': 'FL', 'zip code': 33165}}

Note that the value of the `"location"` key is itself a dictionary and we can access its value by "bracketing" again.

In [14]:
cust_dict['location']

{'city': 'Miami', 'state': 'FL', 'zip code': 33165}

In [15]:
cust_dict['location']['city']

'Miami'

In [16]:
cust_dict['location']['zip code']

33165

<div class="alert alert-block alert-success"><b>Problem 1 (2 pts.)</b>: Use Python code to add a key/value pair to the <b><i>market_dict</i></b> dictionary defined above. We want the key to be "<b><i>fruits</i></b>" and its corresponding value to be an "inventory" dictionary. This "inventory" dictionary should consist of fruit names as keys (i.e. <b><i>apples</i></b>, <b><i>oranges</i></b> and <b><i>pears</i></b>). The value of each key should be the number of such fruits being sold at the market. Assume that there are 123 apples, 98 oranges and 53 pears on sale. After adding this key/value pair to <b><i>market_dict</i></b>, display <b><i>market_dict["fruits"]</i></b> to verify your work.</div>

In [18]:
fruit_dict = {"apples":123,"oranges":98,"pears":53}
# TO DO: Add a key/value pair to the dictionary where the key is "fruits" and the value fruit_dict
market_dict['fruits'] = fruit_dict

# The following should display the three keys: 'apples', 'oranges' and 'pears'.
print(market_dict['fruits'].keys())

dict_keys(['apples', 'oranges', 'pears'])


### Examining a JSON File

Now it is time to turn our attention to our JSON file. We want to open and read `hotel_reviews.json` and save the data as a Python dictionary to the variable `hotel_data`. This is a two step process:

 1. Use the open method to create a file object.
 2. Pass the file object to `load` method in the `json` module. This method parses the contents of the file and returns a Python dictionary.
 
 But first we need to import the json module.

In [19]:
import json
with open('hotel_reviews.json') as json_data:
    hotel_data = json.load(json_data,) 
#hotel_data

The structure of `hotel_data` is a bit complicated but it is divided into two parts: a **HotelInfo** "section" (i.e. the value of the `'HotelInfo'` key) and the **Reviews** "section" (the value of the `'Reviews'` key).

In [20]:
hotel_data.keys()

dict_keys(['Reviews', 'HotelInfo'])

In [21]:
# The hotel information is stored in a dictionary.
hotel_data['HotelInfo']

{'Name': 'BEST WESTERN PLUS Pioneer Square Hotel',
 'HotelURL': '/ShowUserReviews-g60878-d72572-Reviews-BEST_WESTERN_PLUS_Pioneer_Square_Hotel-Seattle_Washington.html',
 'Price': '$117 - $189*',
 'Address': '<address class="addressReset"> <span rel="v:address"> <span dir="ltr"><span class="street-address" property="v:street-address">77 Yesler Way</span>, <span class="locality"><span property="v:locality">Seattle</span>, <span property="v:region">WA</span> <span property="v:postal-code">98104-2530</span></span> </span> </span> </address>',
 'HotelID': '72572',
 'ImgURL': 'http://media-cdn.tripadvisor.com/media/ProviderThumbnails/dirs/51/f5/51f5d5761c9d693626e59f8178be15442large.jpg'}

In [22]:
# The list of reviews with the data for each review also being stored in a dictionary.
#hotel_data['Reviews']

The hotel information is stored in a dictionary (with keys such as `'HotelID'` and `'Address'`), while the reviews are stored in a list--a list of dictionaries, with each dictionary containing information about a particular review. Let us get the list of reviews and save them to the `reviews` variable for further analysis.

In [23]:
reviews = hotel_data['Reviews'] # list of reviews
type(reviews) # check that it is a list

list

In [24]:
print("There are",len(reviews),"reviews altogether.")

There are 233 reviews altogether.


In [25]:
# display first review
first_review = reviews[0] 
first_review

{'Ratings': {'Service': '4',
  'Cleanliness': '5',
  'Overall': '5.0',
  'Value': '4',
  'Sleep Quality': '4',
  'Rooms': '5',
  'Location': '5'},
 'AuthorLocation': 'Boston',
 'Title': '“Excellent Hotel & Location”',
 'Author': 'gowharr32',
 'ReviewID': 'UR126946257',
 'Content': 'We enjoyed the Best Western Pioneer Square. My husband and I had a room with a king bed and it was clean, quiet, and attractive. Our sons were in a room with twin beds. Their room was in the corner on the main street and they said it was a little noisier and the neon light shone in. But later hotels on the trip made them appreciate this one more. We loved the old wood center staircase. Breakfast was included and everyone was happy with waffles, toast, cereal, and an egg meal. Location was great. We could walk to shops and restaurants as well as transportation. Pike Market was a reasonable walk. We enjoyed the nearby Gold Rush Museum. Very, very happy with our stay. Staff was helpful and knowledgeable.',
 'Da

In [26]:
print("The first review's author is", first_review['Author'])

The first review's author is gowharr32


In [27]:
print(first_review['Author'],"made the following comments:",'\n')
print(first_review['Content'])

gowharr32 made the following comments: 

We enjoyed the Best Western Pioneer Square. My husband and I had a room with a king bed and it was clean, quiet, and attractive. Our sons were in a room with twin beds. Their room was in the corner on the main street and they said it was a little noisier and the neon light shone in. But later hotels on the trip made them appreciate this one more. We loved the old wood center staircase. Breakfast was included and everyone was happy with waffles, toast, cereal, and an egg meal. Location was great. We could walk to shops and restaurants as well as transportation. Pike Market was a reasonable walk. We enjoyed the nearby Gold Rush Museum. Very, very happy with our stay. Staff was helpful and knowledgeable.


### Creating a List

We want to create a list with just the comments (strings). We do this by iterating over the list of reviews...

In [28]:
comment_lst = []  # will contain the review strings
for review in reviews:
    comment_lst.append(review['Content'])

In [29]:
len(comment_lst) # contains 233 comments--one for each reviewer

233

In [30]:
first_comment=comment_lst[0]
print("The first comment in the comment list is:",'\n')
print(first_comment)

The first comment in the comment list is: 

We enjoyed the Best Western Pioneer Square. My husband and I had a room with a king bed and it was clean, quiet, and attractive. Our sons were in a room with twin beds. Their room was in the corner on the main street and they said it was a little noisier and the neon light shone in. But later hotels on the trip made them appreciate this one more. We loved the old wood center staircase. Breakfast was included and everyone was happy with waffles, toast, cereal, and an egg meal. Location was great. We could walk to shops and restaurants as well as transportation. Pike Market was a reasonable walk. We enjoyed the nearby Gold Rush Museum. Very, very happy with our stay. Staff was helpful and knowledgeable.


### String Methods

We we want to iterate over the list of comments and obtain information about the comments made by the reviewers. Since each of the comments is a String object, we are going to need some String methods to extract the information. See, for example, __[String Methods](https://www.w3schools.com/python/python_ref_string.asp)__. Let us illustrate some of the listed methods with the comments from the first reviewer.

In [31]:
# Create a new string with all characters made lower case..
first_comment.lower()

'we enjoyed the best western pioneer square. my husband and i had a room with a king bed and it was clean, quiet, and attractive. our sons were in a room with twin beds. their room was in the corner on the main street and they said it was a little noisier and the neon light shone in. but later hotels on the trip made them appreciate this one more. we loved the old wood center staircase. breakfast was included and everyone was happy with waffles, toast, cereal, and an egg meal. location was great. we could walk to shops and restaurants as well as transportation. pike market was a reasonable walk. we enjoyed the nearby gold rush museum. very, very happy with our stay. staff was helpful and knowledgeable.'

In [32]:
# Find how many times the string "we" is mentioned in the comments.
first_comment.count("we")

2

In [33]:
# If we wanted a "case-insensitive" search of instances of "we", we can do this...
first_comment.lower().count("we")  # include "We" as well

7

<div class="alert alert-block alert-success"><b>Problem 2 (4 pts.)</b>: Complete the loop below to display the number of times "bathroom" is contained within the text of the reviews for this hotel. Your output should look like this:</div>

`The word 'bathroom' occurs 29 times in the reviews for this hotel.`

In [34]:
counter = 0
for review in comment_lst:
    # To Do: Insert code in the for loop body to determine the number of times "bathroom" appears 
    # in the current review and increment the counter variable accordingly.
    bathroom_count = review.lower().count("bathroom")
    if bathroom_count > 0:
        counter = counter + bathroom_count
        
# Print using an f-string
print(f"The word 'bathroom' occurs {counter} times in the reviews for this hotel.")

The word 'bathroom' occurs 29 times in the reviews for this hotel.


<div class="alert alert-block alert-success"><b>Problem 3 (5 pts.)</b>: Print the number of "wordy" comments. A comment is considered "wordy" if it contains more than 100 words. For example, "We stayed here and we liked it" contains 7 words. Your output should look like this:</div>

`There are 128 wordy comments.`

In [35]:
counter = 0
for review in comment_lst:
    # TO DO: Insert code in the for loop body to get the number of "wordy" comments.
    wordy = len(review.split())
    if wordy>100:
        counter = counter + 1 

# Print the number of wordy comments.
print(f'There are {counter} wordy comments.')

There are 128 wordy comments.


We want to iterate over the `reviews` dictionary again this time saving with name of the reviewer together with the comments (in a dictionary).

<div class="alert alert-block alert-success"><b>Problem 4 (5 pts.)</b>: Create a list of dictionaries (<b><i>ar_dict</i></b>), where each dictionary has two keys: "<b><i>Author</i></b>" and "<b><i>Comments</i></b>", by iterating over the list of reviews and for each review constructing a dictionary containing the author's name and comments and then appending it to the list of dictionaries we are creating.</div>

In [36]:
ar_lst = []
for review in reviews:
    # TO DO: (1) Create a dictionary, ar_dict, containing just two keys, “Author” and "Comments”. 
    #            Their values should be obtained from the current review dictionary stored in the “review” variable.
    #        (2) Append this newly constructed dictionary, ar_dict, to the ar_lst list.
    ar_dict= {"Author":review.get('Author'),"Comments":review.get('Content')}
    ar_lst.append(ar_dict)

# Let us check that we have 233 elements in the ar_lst list.
print(f'There are {len(ar_lst)} elements in the list.')

There are 233 elements in the list.


In [37]:
# Let us display the data from the first dictionary in the list.

first_review = ar_lst[0]
print(f"{first_review['Author']}, said this",'\n')
#print(first_review['Author'],"said this:",'\n')
print(first_review['Comments'])

gowharr32, said this 

We enjoyed the Best Western Pioneer Square. My husband and I had a room with a king bed and it was clean, quiet, and attractive. Our sons were in a room with twin beds. Their room was in the corner on the main street and they said it was a little noisier and the neon light shone in. But later hotels on the trip made them appreciate this one more. We loved the old wood center staircase. Breakfast was included and everyone was happy with waffles, toast, cereal, and an egg meal. Location was great. We could walk to shops and restaurants as well as transportation. Pike Market was a reasonable walk. We enjoyed the nearby Gold Rush Museum. Very, very happy with our stay. Staff was helpful and knowledgeable.


### The Counter Module

In the following the following exercise we want to count the number of *unique*, i.e. *unique*, words in each of the comments. We previous learned how to split a string into to create a list of words. We can write code from scratch to count the number of different words in the list. Alternatively, we can convert the list to another container data type that makes it easier to obtain this information. The `counter` module defines the `Counter` class. A `Counter` is basically a "special type" of dictionary. Given a list object `my_list` we can turn it into a counter object as follows: `Counter(my_list)`. This assume we already imported the class from the collections module: `from collections import Counter`. See __[Counter Module](http://rahmonov.me/posts/python-collections-counter/)__. 

For example,

```python
from collections import Counter
my_list = ['a', 'b', 'c', 'c', 'a', 'd', 'b', 'e', 'a']
Counter(my_list)
```
creates the Counter object:

```python
Counter({'a': 3, 'b': 2, 'c': 2, 'd': 1, 'e': 1})
```

This tells you that the letter `a` appears `3` times in the list, etc.

We can get the keys and values just like with dictionaries:

```python
list(Counter(my_list).keys())
```

returns

```python
['a', 'b', 'c', 'd', 'e']
```

and

```python
list(Counter(my_list).values())
```

returns

```python
[3, 2, 2, 1, 1]
```

<div class="alert alert-block alert-success"><b>Problem 5 (4 pts.)</b>: Iterate over <b><i>ar_list</i></b> and print the name of each reviewer (author) and the total number of *different*, i.e. *unique*, words in his review. For example, "We stayed here and we liked it" contains 6 *unique* words since 'we' is repeated.</div>

In [40]:
#Approach 3 used with slightly different results than given output
from collections import Counter
import re
for review in ar_lst:
    # TO DO: (1) Get the number of words in the current review variable.
    punctuation = '.,;:@#$?><%\"/&^!'
    values_lst = (list(Counter(review).values()))
    authors_str = values_lst[0]
    reviews_str = values_lst[1].lower()
    reviews_str_no_punct = ""
    for char in reviews_str:
        if char not in punctuation:
            reviews_str_no_punct = reviews_str_no_punct + char
    reviews_split = reviews_str_no_punct.split()
    word_counts = Counter(reviews_split)
    unique_count = len(word_counts)
    #print(reviews_str_no_punct)
    #        (2) Print the author's name and the number of (unique) words in his review 
    print(f'{authors_str} used {unique_count} unique words.')
    #            Below are three approaches listed in increasing "sophistication".

gowharr32 used 87 unique words.
Nancy W used 85 unique words.
Janet H used 43 unique words.
TimothyFlorida used 69 unique words.
KarenArmstrong_BC used 105 unique words.
Shane33333 used 51 unique words.
Bnkruzn used 15 unique words.
Teacherbear used 54 unique words.
CandyGnomad used 49 unique words.
idahosandy used 16 unique words.
CW2S used 143 unique words.
jimmy62_11 used 78 unique words.
BoulderIllini used 128 unique words.
funlovingdad used 103 unique words.
suntraveler222 used 207 unique words.
rosariodurao used 39 unique words.
Jody R used 86 unique words.
Tasha M used 25 unique words.
Roy C used 87 unique words.
MikeGB2 used 107 unique words.
Jennie S used 54 unique words.
mcdonothing used 80 unique words.
SWanjiru used 104 unique words.
trish0 used 50 unique words.
txlnstr used 25 unique words.
mydogisfat used 46 unique words.
DaddyHoward used 94 unique words.
BCisBeautiful used 119 unique words.
wildorchid416 used 174 unique words.
JCPCG used 81 unique words.
quilter1975 used

##### Three possible approaches...

1) *Removing spaces between words and then counting uniques.*

2) *Converting to lower case, removing spaces and then counting uniques.*

3) *Removing punctuation, converting to lower case, removing spaces and counting uniques.* 

**Note**: That symbols that were considered "punctuation" and removed to generate the output below were:

`. , ; : @ # $ ? > < % \ "/ & ^ !`