## Question 5
I set up an ssh tunnel to work with the sample_airbnb database in python.

#### Server set up

In [10]:
from sshtunnel import SSHTunnelForwarder
from pymongo import MongoClient
from random import sample
from pprint import pprint
from getpass import getpass

In [11]:
MONGO_HOST = "10.10.11.10"
MONGO_DB = "sample_airbnb"
MONGO_USER = "mtweed"
MONGO_PASS = getpass("Enter your password: ")

Enter your password: ········


In [12]:
server = SSHTunnelForwarder(
    MONGO_HOST,
    ssh_username=MONGO_USER,
    ssh_password=MONGO_PASS,
    remote_bind_address=('127.0.0.1', 27017)
)

In [13]:
server.start()

In [14]:
client = MongoClient('127.0.0.1', server.local_bind_port)

In [15]:
db = client[MONGO_DB]

In [16]:
listings = db.listingsAndReviews

### Method One
I initially wrote the following javascript to perform this query soley within mongodb.  I reviewed several highly rated listings and selected several key words withing the comments to use in my query.

```
>  var listings = db.listingsAndReviews
>  
>  var cursor = listings.find({$and: [
...                                 {$or:[ {"reviews.comments": {$regex : /easy/i}},
...                                 {"reviews.comments": {$regex : /best/i}},
...                                 {"reviews.comments": {$regex : /great/i}},
...                                 {"reviews.comments": {$regex : /excellent/i}},
...                                 {"reviews.comments": {$regex : /helpful/i}},
...                                 {"reviews.comments": {$regex : /amazing/i}},
...                                 {"reviews.comments": {$regex : /pleasent/i}},
...                                 {"reviews.comments": {$regex : /perfect/i}}]},                 
...                                 {"review_scores.review_scores_rating": {$exists:true}}]}, 
...                                 {_id:1, "review_scores.review_scores_rating":1})
> var total = 0
> var count = 0
> 
> for (i=0;i<cursor.count();i++){
...         var temp = cursor.next()
...         total = total + temp.review_scores.review_scores_rating
...         count = count + 1
...         }
3403
> let average_rating = total/count
> 
> print(average_rating)
93.84014105201292
> 
```

### Method Two
Then I performed a similar query in the python shell to attempt oqtimization of the query.  I first memoized the id, review_scores_rating, and comments for every listing. This method worked better for me since multiple queries took longer that one single big query, however, I could have also performed this with many queries to the database.

In [17]:
cursor = listings.find({}, {"_id" : 1, "review_scores.review_scores_rating" : 1, "reviews.comments":1})

The following code creates dictionary objects that allow quick reference of rating and comments by using the id. This step takes a while.

In [18]:
length = cursor.count()

In [19]:
scores = {}
reviews = {}
for x in range(length):
    temp = cursor.next()
    try:
        scores[temp["_id"]] = temp["review_scores"]["review_scores_rating"]
    except KeyError:
        scores[temp["_id"]] = 0
    try:
        reviews[temp["_id"]] = [temp["reviews"][x]["comments"] for x in range(len(temp["reviews"]))]
    except KeyError:
        reviews[temp["_id"]] = []

The following is the set of keywords that I determined would best correlate to highly rated listings.

In [20]:
all_pos_words = {"easy", "super", "amazing", "bright", "quiet", 
             "clean", "great", "nice", "enjoy", "enjoyed", 
             "friendly", "convenient", "flexible", "good",
             "fantastic", "best", "perfect", "helpful",
             "pleasent", "pleasing", "calm", "fun", "excellent",
             "prime", "spacious"}

The following code allows the user to select a random sample of the positive words in order to determin the effect they have on the average rating.

In [21]:
num_words = int(input("Enter the number positive words to check against ({} total):".format(len(all_pos_words))))
if num_words < len(all_pos_words):
    pos_words = set(sample(all_pos_words, num_words))
else:
    pos_words = all_pos_words.copy()

Enter the number positive words to check against (25 total):25


In [22]:
pos_ids = []
ids = list(scores.keys())
for i in range(len(ids)):
    _id = ids[i]
    for j in range(len(reviews[_id])):
        comment = reviews[_id][j]
        if len(pos_words.intersection(comment.split())) > 0:
            pos_ids.append(_id)

pos_ratings = [scores[pos_ids[x]] for x in range(len(pos_ids))]
ave_rating = sum(pos_ratings)/len(pos_ratings)

In [23]:
ave_rating

94.2707122290433

In [24]:
len(pos_ids)

92035

### Summary
I found that, while searching for fewer keywords could lead to a higher average rating, searching for fewer keywords led to fewer results returned.  Therefore, I developed this method to see how correlated each keyword was to a high average rating by earchingeach one individually.

In [25]:
def ave_rat(pos_words, reviews, scores):
    pos_ids = []
    ids = list(scores.keys())
    for i in range(len(ids)):
        _id = ids[i]
        for j in range(len(reviews[_id])):
            comment = reviews[_id][j]
            if len(pos_words.intersection(comment.split())) > 0:
                pos_ids.append(_id)

    pos_ratings = [scores[pos_ids[x]] for x in range(len(pos_ids))]
    ave_rating = sum(pos_ratings)/len(pos_ratings)
    return ave_rating, len(pos_ratings)

This showed that every keyword chosen corresponded to a high average rating (>93). Since I consider an average score greater than 93 to be quite good, I decided that including every one of these search terms was optimal.

In [26]:
pos_list = list(all_pos_words)
for x in pos_list:
    pos_word = set()
    pos_word.add(x)
    ar, length = ave_rat(pos_word, reviews, scores)
    print("The word",pos_word,"was found in",length,"documents with and average rating of",ar)

The word {'perfect'} was found in 11641 documents with and average rating of 94.807662571944
The word {'amazing'} was found in 6222 documents with and average rating of 95.32642237222758
The word {'easy'} was found in 11720 documents with and average rating of 94.28327645051195
The word {'great'} was found in 35234 documents with and average rating of 94.42864846455129
The word {'pleasing'} was found in 24 documents with and average rating of 95.45833333333333
The word {'fun'} was found in 1144 documents with and average rating of 94.45716783216783
The word {'enjoyed'} was found in 7943 documents with and average rating of 94.71635402240967
The word {'spacious'} was found in 4211 documents with and average rating of 94.63927808121586
The word {'friendly'} was found in 6885 documents with and average rating of 94.39782135076253
The word {'helpful'} was found in 8423 documents with and average rating of 94.3153270806126
The word {'excellent'} was found in 4286 documents with and average 

In [28]:
server.stop()

In [29]:
server.is_active

False