## Review Requests Notebook

The code below is a series of experiments agains the RB REST API.

Described [here](https://www.reviewboard.org/docs/manual/2.5/webapi/2.0/resources/review-request-list/)

### Imports

In [1]:
import json
import requests
import sys

# Hack to import our code from the Notebook
sys.path.insert(0, '..')
from datastore.reviews_db import Reviews

### Fetching the list of RRs from a given date

The date format must be in `yyyy-mm-dd` format to comply with the API.

The `max-results` chunks the response (by default, the max is 25 returned results), while the returned `total_results` gives the total count from the query.

In the `links` field of the response, there is also the URL for the `next` chunk of results: we don't use it here, but in the code this would be useful to paginate requests.

In [2]:
ARGS = {
    'last-updated-from': '2015-06-23',
    'to-groups': 'mesos',
    'status': 'pending',
    'max-results': 200,
    'start': 0
}

RB_URL = 'http://reviews.apache.org/api/review-requests/?'

def build_url(url, args):
    """ Builds up the URL for the RB request
    
        @param args: the arguments for the request
        @type args: dict
    """
    base = url
    for arg_name, value in args.iteritems():
        base = "{base}&{arg}={val}".format(base=base, arg=arg_name, val=value)
    return base

rb = build_url(RB_URL, ARGS)
print rb
reviews = requests.get(rb)

def parse_results(results):
    count = results.get('total_results', 0)
    rrs = results.get('review_requests')
    print 'Found {} pending Review Requests ({} in response)'.format(count, len(rrs))
    print 'The next chunk can be retrieved from: {}\n'.format(
        results.get('links').get('next', {}).get('href'))
    return rrs

if reviews.status_code == 200:
    result = reviews.json()
    for key in result:
        print key
    print result.get('links').get('next', {}).get('href')
    rrs = parse_results(result)
    print rrs[0].keys()  

http://reviews.apache.org/api/review-requests/?&status=pending&max-results=200&last-updated-from=2015-06-23&to-groups=mesos&start=0
total_results
stat
review_requests
links
None
Found 53 pending Review Requests (53 in response)
The next chunk can be retrieved from: None

[u'status', u'last_updated', u'target_people', u'depends_on', u'description_text_type', u'issue_resolved_count', u'ship_it_count', u'close_description_text_type', u'id', u'description', u'links', u'changenum', u'bugs_closed', u'testing_done_text_type', u'testing_done', u'close_description', u'time_added', u'extra_data', u'public', u'commit_id', u'blocks', u'branch', u'text_type', u'issue_open_count', u'approved', u'url', u'absolute_url', u'target_groups', u'summary', u'issue_dropped_count', u'approval_failure']




### Saving to MongoDb

This is trivial: as every returned RR is a `dict` this can be directly saved to MongoDB, using the connector implemented in the base class `datastore.base_db.MongoDB`, via the `Reviews` implementation.

In [3]:
mongo_conf = {
    'db.host': 'localhost',
    'db.port': 27017,
    'db.name': 'mesos-reviews'
    # Authentication is not used here, we would otherwise set:
    # 'db.user': 'foobar',
    # 'db.passwd': 'zekret'
}

# save just a few RRs:
revs_db = Reviews(mongo_conf)

# clean up first
revs_db.drop()

for i in xrange(5):
    revs_db.save(rrs[i])



At this point, the first 5 RRs have been saved to the MongoDb running on the local machine; this can be verified by logging into the shell:
```
test> show dbs
...
mesos-reviews  0.078GB

test> use mesos-reviews
switched to db mesos-reviews

mesos-reviews> show collections
reviews
system.indexes

mesos-reviews> db.reviews.find().pretty()

{
        "_id" : 36113,
        "status" : "pending",
        "last_updated" : "2015-07-02T00:28:35Z",
        "issue_resolved_count" : 0,
        "ship_it_count" : 0,
        "description" : "perf: refactored parse to allow determining an output parsing function based on the runtime version.",
        "bugs_closed" : [
                "mesos-2834"
        ],
        "close_description" : null,
        "reviewers" : [
                "idownes",
                "pbrett",
                "wangcong"
        ],
        "time_added" : "2015-07-01T22:44:16Z",
        "commit_id" : null,
        "issue_open_count" : 5,
        "approved" : false,
        "url" : "/r/36113/",
        "absolute_url" : "https://reviews.apache.org/r/36113/",
        "summary" : "perf: refactored parse to allow determining an output parsing function based on the runtime version.",
        "issue_dropped_count" : 0,
        "deps" : {
                "depends_on" : [
                        "36112"
                ],
                "blocked_by" : [
                        "36114"
                ]
        },
        "submitter" : "chzhcn"
}
...

```

We drop the `links` field, as it is too verbose, and just extract the `links.submitter.href` field into a `submitter` field.

We currently use the following fields as indexes:
```
    coll.ensure_index('last_updated')
    coll.ensure_index('time_added')
    coll.ensure_index('submitter')
```
Other can easily be added.

### Retrieving RRs

This is fairly straightforward, just use the `get(id)` of the `Reviews` class:

In [4]:
# First get ALL the ids in the collection
all_ids = revs_db.get_all_ids()

assert(len(all_ids) == 5)
print "There are {} RRs in the DB".format(len(all_ids))

rid = all_ids[2]
review = revs_db.get(rid)

print "{}\n".format(review.pop('description'))
for k in review.keys():
    print "{:20}  {:<60}".format(k, review[k])

There are 5 RRs in the DB
Adding the possibility to 'keep-alive' the connection

status                pending                                                     
commit_id             None                                                        
last_updated          2015-07-02T01:49:01Z                                        
url                   /r/36040/                                                   
reviewers             [u'anandmazumdar', u'marco', u'vinodkone']                  
absolute_url          https://reviews.apache.org/r/36040/                         
issue_resolved_count  1                                                           
bugs_closed           []                                                          
issue_open_count      1                                                           
issue_dropped_count   0                                                           
dependencies          {u'depends_on': [], u'blocked_by': [u'35934']}              
summar

### Processing data

In the `ReviewFetcher` we process all the above, in a loop to make sure we retrieve all results, then we store data in the DB.

*Note: Make sure to edit this notebook and update the date, to limit the amount of data*

In [8]:
from rb_etl.fetcher import ReviewFetcher

mongo_conf['since'] = '2015-06-15'
ff = ReviewFetcher(mongo_conf)

all_rrs = ff.fetch_all()
ff.store_data()
print "We retrieved and saved {} Reviews".format(len(all_rrs))

# Let's verify it all went to plan:
assert(len(all_rrs) == len(revs_db.get_all_ids()))



We retrieved and saved 76 Reviews
