# Summary

We will be working with Amazon review data again, but this time the data will come from files, not from scraping web pages.
Relevant data is split across two files / schemas

* Product data -- information about a product, keyed by ASIN.  Has information like the product's name, price, browse categories, and sales rank 
* Review data -- information about a single review.  Has information like the reviewer's name and ID, the product ASIN, the review's summary, body, score, and "helpfulness"

The two files are linked by ASIN, and every ASIN that is the ASIN of a review is guaranteed to appear in the product data file.  It is not guaranteed that every entry in the ASIN file has any reviews.

This notebook will guide you through the process of 

-----------------------------------
### Fields and Their Types

#### Reviews

| Field | Type | Note |
|-------|------|------|
| id | UUID | Not in the input file;  supplied by SOLR
| reviewerID | ignore |
| asin | string | Joins with the asin field in the product file |
| reviewerName | string | ignore |
| helpful | ignore |
| reviewText | text | Full review body |
| overall | integer | Truncate the floating-point value |
| summary | text | Review summary text |
| unixReviewTime | ignore |
| reviewTime | ignore |

#### Products
 
| Field | Type | Note |
|-------|------|------|
| asin | string | Unique ID for products.  Joins with the asin field in the reviews file. |
| description | string | Stored but not indexed.  Shown on product detail page. |
| title | string | Stored but not indexed.  Shown on the product detail page and also on review search result and detail pages |
| imUrl | ignore |  |
| price | float | Displayed in currency format on the product detail page.|
| salesRank | ignore |  |
| categories | ignore | |



## Data Elements

### Loading the Data Files

Content in the data files is one line per "data row."  Each line can be converted to a Python dictionary using this 
code:

``eval('(' + line + ')')``

## Loading Files and Indexing

This code will look very similar to Assignment 1.  This code should parse and index all review and product records.  Notice that the code takes two file names, one for products and one for reviews.  I may run your code on different data sets, but they will be in the same format as the files provided for you.

In [9]:
import subprocess
SOLR_EXECUTABLE = '/usr/local/Cellar/solr/8.0.0/bin/solr'
def solr_command(*args):
    return subprocess.check_output([SOLR_EXECUTABLE] + list(args))
def create_collection(config_dir, collection_name):
    solr_command('create_core', '-c', collection_name, '-d', config_dir)

In [10]:
create_collection('/Users/jaredalonzo/Computer Science/CS4910 Text Processing & Search/hw2-submission/amazon-reviews', 'amazon-reviews')
create_collection('/Users/jaredalonzo/Computer Science/CS4910 Text Processing & Search/hw2-submission/amazon-products', 'amazon-products')

In [1]:
#  These two functions create a list of documents to be indexed -- each a dictionary 
#  that can be passed to your indexing functions below.  You should do the following checks:
#   -- skip any product that does not have a title
#   -- skip any review that does not have a product 
#  Notice that these have to be done in sequence, as you won't know your list of products until 
#  the first step is finished
def product_json(filename):
    list_of_dictionaries = []
    for item in open(filename):
        dict = {}
        line = eval('(' + item + ')')
        if 'title' in line:
            dict.update({'title': line['title']})
        else:
            continue
        if 'description' in line:
            dict.update({'description': line['description']})
        if 'price' in line:
            dict.update({'price': line['price']})
        dict.update({'asin': line['asin']})
        list_of_dictionaries.append(dict)
    return list_of_dictionaries

def review_json(filename, products):
    list_of_dictionaries = []
    for item in open(filename):
        dict = {}
        line = eval('(' + item + ')')
        if line['asin'] in products:
            dict.update({'reviewText': line['reviewText']})
            dict.update({'overall': int(line['overall'])})
            dict.update({'summary': line['summary']})
            dict.update({'asin': line['asin']})
            list_of_dictionaries.append(dict)
    return list_of_dictionaries

In [2]:
products = product_json("medium_asin_data.txt")
reviews = review_json("medium_review_data.txt", [x['asin'] for x in products])

In [13]:
import pysolr
def index(data, port=8983, collection_name=''):
    solr = pysolr.Solr(f'http://localhost:{port}/solr/{collection_name}')
    solr.add(data, commit=True)

In [14]:
index(reviews, collection_name='amazon-reviews')

In [15]:
index(products, collection_name='amazon-products')