# Building a Dataset

This should give a quick overview of how we can build a preprocess our training and test sets.

We start by reading in the first fifty lines of the file as text:

In [14]:
with open('../data/reviews.json') as file:
    head = [next(file) for _ in range(50)]
    
head[0]

'{"votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything i look for in a general practitioner.  he\'s nice and easy to talk to without being patronizing; he\'s always on time in seeing his patients; he\'s affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i\'m sitting here trying to think of any complaints i have about him, but i\'m really drawing a blank.", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"}\n'

Next, we parse the the lines into dicts:

In [15]:
import json

parsed = [json.loads(review) for review in head]

parsed[0]

{u'business_id': u'vcNAWiLM4dR7D2nwwJ7nCA',
 u'date': u'2007-05-17',
 u'review_id': u'15SdjuK7DmYqUAj6rjGowg',
 u'stars': 5,
 u'text': u"dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
 u'type': u'review',
 u'user_id': u'Xqd0DzHaiyRqVH3WRG7hzg',
 u'votes': {u'cool': 1, u'funny': 0, u'useful': 2}}

We can extract the necessary information into an `Example` object:

In [18]:
class Example():
    def __init__(self, review, votes):
        self.review = review
        self.votes = votes
        
    def __str__(self):
        return "Review: \"" + self.review[0:100] + "...\""
    
    __repr__ = __str__
    
e = [Example(ex['text'], ex['votes']) for ex in parsed]

e[0:5]

[Review: "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to ...",
 Review: "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've ha...",
 Review: "Dr. Goldberg has been my doctor for years and I like him.  I've found his office to be fairly effici...",
 Review: "Been going to Dr. Goldberg for over 10 years. I think I was one of his 1st patients when he started ...",
 Review: "Got a letter in the mail last week that said Dr. Goldberg is moving to Arizona to take a new positio..."]

Split the data into training and test (50/50 split for proof of concept)

In [19]:
training = e[0:25]

testing = e[26:50]

Build a quick pybrain dataset from the data using only usefulness:

In [32]:
from pybrain.datasets import SupervisedDataSet

# two inputs, one output
ds = SupervisedDataSet(2, 1)

for example in training:
    ds.addSample((len(example.review), "good" in example.review), (example.votes['useful'] > 0,))

Set up a trainer:

In [33]:
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer

# Two inputs, three hidden layers, one output
net = buildNetwork(2, 3, 1)

trainer = BackpropTrainer(net, ds)

In [34]:
trainer.train()

0.27558358818266793

In [48]:
trainer.trainUntilConvergence()

# Suppress long output
""

''

Test the model on our test set

In [49]:
results = [net.activate([len(ex.review), "good" in ex.review])[0] > 0 for ex in testing]

results[0:5]

[True, True, True, True, True]

Build array of which of our predictions were correct

In [51]:
actual = [ex.votes['useful'] > 0 for ex in testing]

correct = [a == b for a, b in zip(results, actual)]

correct[0:5]

[False, True, False, False, False]

Calculate accuracy

In [52]:
from __future__ import division

accuracy = correct.count(True) / len(correct)

accuracy

0.125