
       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|


In [35]:
from pymongo import MongoClient
from datetime import datetime
import json

You have to change this variable each time the EC2 server stops or restarts. Please email/text me to get the new IP address.

In [None]:
ip = '54.210.15.250'

Create the connection to the MongoDB server. The first argument is the IP we've supplied above and the second is the port (TCP) through which we'll be talking to the EC2 server and the MongoDB instance running inside it.

In [10]:
conn = MongoClient(ip, 27017)

Take a look at the databases available in our MongoDB instance

In [12]:
conn.database_names()

[u'local', u'cleaned_data']

In [13]:
db = conn.get_database('cleaned_data')

Print the collection names

In [14]:
db.collection_names()

[u'academic_biz', u'academic_reviews', u'dc_reviews', u'system.indexes']

Let's grab a a subset reviews from the academic reviews collection. Suppose we want a random set of 5000, all from after 2010, from each city in our dataset.

In [15]:
collection = db.get_collection('academic_reviews')

In [16]:
#I cheated and just had a list of all the states. 
#You should try to find a unique list of all the states from mongoDB as an exercise.
states = [u'OH', u'NC', u'WI', u'IL', u'AZ', u'NV']

First, I'm going to take a look at what one of the reviews looks like. I totally could have done something wrong earlier and the output is pure garbage. This is a good sanity check to make.

In [17]:
collection.find()[0]

{u'_id': ObjectId('58e2e9d4decef619d1cfdff0'),
 u'business_id': u'4P-vTvE6cncJyUyLh73pxw',
 u'cool': 0,
 u'date': u'2014-08-14',
 u'funny': 0,
 u'review_id': u'tRd0-mPa9O1TMJp_dw5khQ',
 u'stars': 4,
 u'state': u'OH',
 u'text': u'Got my mojo back after having a few of their appetite teasers. Love LPW for a no-frills bite to eat.',
 u'type': u'review',
 u'useful': 0,
 u'user_id': u'kXUySHSlRgVrcR4Aa0HtGQ'}

Sweet, this is pretty much what we were expecting. Let's pull out the date field from this entry. We're going to filter on this in a second. Depending on its type, we're going to need to develop different strategies in constructing the logical statements that filter for the date.

In [18]:
print collection.find()[0]['date']
print type(collection.find()[0]['date'])

2014-08-14
<type 'unicode'>


Dang it's unicode. Unicode is a pain in the ass to deal with, it's some Python specific format. Let's try converting it to a more usable Python format (datetime). We care about the *relative* difference between the date variable. Doing this with a string doesn't make sense to a computer so we have to transform it into a quantitative measure of time.

In [19]:
string_year = collection.find()[0]['date'][0:4]
year = datetime.strptime(string_year, '%Y')
year

datetime.datetime(2014, 1, 1, 0, 0)

Note that the datetime above is given as January-1st, 2014. We only gave it a year variable so it just defaults to the first day of that year. That's all good though, we just want stuff after 2010, we just define the beginning of 2010 to be January-1st 2010.

In [20]:
threshold_year = datetime.strptime('2010', '%Y')

Running the below code is going to take a little while. But it's essentially doing the following:

        For each review in the reviews database:  
            If the review comes from one of our states:  
                Check to see if the review was made after 2010:  
                  If it did, append it to the overall reviews dictionary. 
                  If it didn't, proceed to the next review.

In [51]:
reviews_dict = {}
num_reviews = 50000

for obj in collection.find():
    if obj['state'] in states:
        try:
            if len(reviews_dict[obj['state']]) > num_reviews:
                continue
        except KeyError:
            pass
        if datetime.strptime(obj['date'][0:4], '%Y') >= threshold_year:
            del obj['_id']
            try:
                reviews_dict[obj['state']].append(obj)
            except KeyError:
                reviews_dict[obj['state']]=[obj]
                

So the new dictionary we created is structured with each state being a key and each entry being a list of reviews. Let's take a look at what Ohio looks like:

In [31]:
reviews_dict['OH'][0:50]

[{u'_id': ObjectId('58e2e9d4decef619d1cfe0eb'),
  u'business_id': u'1DedueD53YsKcpqMWPIe9w',
  u'cool': 0,
  u'date': u'2013-12-30',
  u'funny': 0,
  u'review_id': u'3xGR24wD5ILntyX2UXZWTA',
  u'stars': 3,
  u'state': u'IL',
  u'text': u"We were passing town through and stopped based on the Yelp reviews. Mediocre suburban tex mex; not authentic Mexican which we prefer. Chips and salsa nothing special. Others in my group ordered burritos, which were ok. I ordered the Mexican salad and asked the waitress to add steak. It came as a nice mix of greens, covered in warm, melted Kraft cheddar cheese, no steak. The waitress that brought our food was different from the one who took our order, so seemed confused when I pointed out the steak was missing. However, she came back promptly with a big plate of steak to add to my salad and said it was ok, because my server had it down right but the kitchen missed it. The way she said it made me wonder what would have happened if the waitress had gotten

It's good practice to save whatever data you're using in a more permanent location if you plan on using it again. That way, we don't have to load up the EC2 server and wait for our local machines to run the above filtering process.

In [54]:
with open('cleaned_reviews_states_2010.json', 'w') as outfile:
    json.dumps(reviews_dict, outfile)

__Congratulations!__ You just finished downloading and filtering data from MongoDB as hosted on an EC2 instance