## Review Requests Notebook

The code below is a series of experiments agains the RB REST API.

The API is documented [here](https://www.reviewboard.org/docs/manual/2.5/webapi/2.0/resources/review-request-list/)

### Imports

In [4]:
from __future__ import print_function
          
import json
import requests
import sys

# Hack to import our code from the Notebook
sys.path.insert(0, '..')

from datastore.reviews_db import Reviews
from rb_etl.fetcher import ReviewFetcher

### Fetching the list of RRs from a given date

The date format must be in `yyyy-mm-dd` format to comply with the API.

The `max-results` chunks the response (by default, the max is 25 returned results), while the returned `total_results` gives the total count from the query.

In the `links` field of the response, there is also the URL for the `next` chunk of results: we don't use it here, but in the code this would be useful to paginate requests.

#### Building the code

The code in the following section is how we arrived at the `ReviewFetcher` class in the `rb_etl.fetcher` module; an example usage of the actual code is further below.

In [5]:
ARGS = {
    'last-updated-from': '2015-06-23',
    'to-groups': 'mesos',
    'status': 'pending',
    'max-results': 20,
    'start': 0
}

RB_URL = 'http://reviews.apache.org/api/review-requests/?'

def build_url(url, args):
    """ Builds up the URL for the RB request
    
        @param args: the arguments for the request
        @type args: dict
    """
    base = url
    for arg_name, value in args.iteritems():
        base = "{base}&{arg}={val}".format(base=base, arg=arg_name, val=value)
    return base

rb = build_url(RB_URL, ARGS)
print("Retrieving RRs from {}".format(rb))
reviews = requests.get(rb)

def parse_results(results):
    rrs = results.get('review_requests')
    return rrs

if reviews.status_code == 200:
    result = reviews.json()
    for key in result:
        print(key)
    rrs = parse_results(result)
    count = result.get('total_results', 0)

    print("Found {} pending Review Requests ({} in response)".format(count, len(rrs)))
    print("The next chunk can be retrieved from: {}\n".format(
        result.get('links').get('next', {}).get('href')))
    

Retrieving RRs from http://reviews.apache.org/api/review-requests/?&status=pending&max-results=20&last-updated-from=2015-06-23&to-groups=mesos&start=0
total_results
stat
review_requests
links
Found 74 pending Review Requests (20 in response)
The next chunk can be retrieved from: https://reviews.apache.org/api/review-requests/?start=20&max-results=20&status=pending&last-updated-from=2015-06-23&to-groups=mesos





### Saving to MongoDb

This is trivial: as every returned RR is a `dict` this can be directly saved to MongoDB, using the connector implemented in the base class `datastore.base_db.MongoDB`, via the `Reviews` implementation.

In [6]:
mongo_conf = {
    'db.host': 'localhost',
    'db.port': 27017,
    'db.name': 'mesos-reviews'
    # Authentication is not used here, we would otherwise set:
    # 'db.user': 'foobar',
    # 'db.passwd': 'zekret'
}

# save just a few RRs:
revs_db = Reviews(mongo_conf)

# clean up first
revs_db.drop()

for i in xrange(5):
    revs_db.save(rrs[i])



At this point, the first 5 RRs have been saved to the MongoDb running on the local machine; this can be verified by logging into the shell:
```
test> show dbs
...
mesos-reviews  0.078GB
...
test> use mesos-reviews
mesos-reviews> show collections
reviews
system.indexes

mesos-reviews> db.reviews.find().pretty()
```

We drop the `links` field, as it is too verbose, and just extract the `links.submitter.href` field into a `submitter` field.

We currently use the following fields as indexes:
```
    coll.ensure_index('last_updated')
    coll.ensure_index('time_added')
    coll.ensure_index('submitter')
```
Other can easily be added.

### Retrieving RRs from the Database

This is fairly straightforward, just use the `get(id)` of the `Reviews` class.
Alternatively, first retrieve **all** the IDs, then just pick the ones of interest.

In [7]:
# First get ALL the ids in the collection
all_ids = revs_db.get_all_ids()

assert(len(all_ids) == 5)
print("There are {} RRs in the DB".format(len(all_ids)))

rid = all_ids[2]
review = revs_db.get(rid)

print("{}\n".format(review.pop('description')))
for k in review.keys():
    print("{:20}  {:<60}".format(k, review[k]))

There are 5 RRs in the DB
Moved filesystem/linux from review https://reviews.apache.org/r/34135/

status                pending                                                     
commit_id             None                                                        
last_updated          2015-07-12T05:56:05Z                                        
url                   /r/36429/                                                   
reviewers             [u'jieyu', u'tnachen', u'vinodkone']                        
absolute_url          https://reviews.apache.org/r/36429/                         
issue_resolved_count  0                                                           
bugs_closed           []                                                          
issue_open_count      0                                                           
issue_dropped_count   0                                                           
dependencies          {u'depends_on': [34135], u'blocked_by': []}       

## ETL Process

### Extracting the data from RB

In the `ReviewFetcher` we put all the above together, retrieving the RRs via the API in a loop to retrieve **all** results; we transform the JSON objects by dropping the fields not of interest and retaining only the relevant ones; and then we load data in the DB.

*Note: Make sure to edit this notebook and update the date in `mongo_conf['since']`, in order to limit the amount of data fetched from ReviewBoard*

In [8]:
mongo_conf['since'] = '2015-06-15'
ff = ReviewFetcher(mongo_conf)

all_rrs = ff.fetch_all()
ff.store_data()
print("We retrieved and saved {} Reviews".format(len(all_rrs)))

# Let's verify it all went to plan:

assert(len(all_rrs) == len(revs_db.get_all_ids()))



We retrieved and saved 83 Reviews


In [10]:
import pprint

with open('../tests/data/review.json') as review:
    rv = ''.join(review.readlines())

rv_d = json.loads(rv)
pprint.pprint(rv_d, indent=4, width=40)

{   u'absolute_url': u'https://reviews.apache.org/r/36037/',
    u'approval_failure': u'The review request has open issues.',
    u'approved': False,
    u'blocks': [   {   u'href': u'https://reviews.apache.org/api/review-requests/36099/',
                       u'method': u'GET',
                       u'title': u'bogus'},
                   {   u'href': u'https://reviews.apache.org/api/review-requests/45568/',
                       u'method': u'GET',
                       u'title': u'bogus-2'}],
    u'branch': u'',
    u'bugs_closed': [u'MESOS-2860'],
    u'changenum': None,
    u'close_description': None,
    u'close_description_text_type': u'plain',
    u'commit_id': None,
    u'depends_on': [   {   u'href': u'https://reviews.apache.org/api/review-requests/36073/',
                           u'method': u'GET',
                           u'title': u'New MethodNotAllowed HTTP response type'}],
    u'description': u'Adding a call route with HTTP request header validations',
    u'de