# Amazon review corpus

The purpose of this notebook is to transform the raw corpus into a [Pandas](http://pandas.pydata.org/) `DataFrame` with a standardized format. The format is fairly simple:
* Each row contains a review
* There are two columns named `text` and `category` containing the respective information

Every data set obeying these simple rules can be plugged into the forthcoming pipeline.

The data set was downloaded from [this](http://jmcauley.ucsd.edu/data/amazon/) location. We chose the following files:
* `reviews_Books_5.json`
* `reviews_Electronics_5.json`
* `reviews_Home_and_Kitchen_5.json`
* `reviews_Movies_and_TV_5.json`

As the `5` at the end of the file name indicates, we selected the 5-core versions. This guarantees that each item has at least 5 reviews.

Each line of each file contains a JSON object. For example the first line of `reviews_Home_and_Kitchen_5.json` looks like this:

`{"reviewerID": "APYOBQE6M18AA", "asin": "0615391206", "reviewerName": "Martin Schwartz", "helpful": [0, 0], "reviewText": "My daughter wanted this book and the price on Amazon was the best.  She has already tried one recipe a day after receiving the book.  She seems happy with it.", "overall": 5.0, "summary": "Best Price", "unixReviewTime": 1382140800, "reviewTime": "10 19, 2013"}`

The data set contains several useful information but in this work we are only interested in the `reviewText` field as well as the assigned class which is determined by the filename (e.b. `reviews_Movies_and_TV_5.json` -> `reviews_Movies_and_TV`)

The variables `raw_corpus_path` and `pd_corpus_path` may need some adaption.

* `raw_corpus_path` is expected to contain a path to a folder that contains the raw data set files (like `reviews_Books_5.json`) and nothing else.
* The resulting Pandas DataFrame  will be stored into the directory referred to by `raw_corpus_path`.

In [None]:
raw_corpus_path = 'data/AMAZON/raw'
pd_corpus_path = 'data/AMAZON/dataframes'

In [None]:
from os import walk, sep
import pandas as pd
from tqdm import tqdm
import json

reviews = []
for root, dirs, files in walk(raw_corpus_path):
    for file in files:
        with open(root + '/' + file) as fh:
            for line in tqdm(fh):
                datapoint = json.loads(line)
                category = file.replace('reviews_', '')
                category = category.replace('_5.json', '')
                datapoint['category'] = category
                datapoint['text'] = datapoint.pop('reviewText')
                reviews.append(pd.Series(datapoint))
print('Processing finished. Start compiling DataFrame.')
df = pd.DataFrame(reviews)
df.to_pickle(pd_corpus_path + '/amazon.pkl')
print('finished')