## Assignment 1 -- Page Scrape to SOLR

For this assignment we will be working with Amazon product reviews.  We have already downloaded ~500 product detail pages for movies, they are in the *pages* directory.  

For this assignment you will: 

1. Parse the HTML pages to get attributes of interest about the product and reviews for the product (each review will be a *document*)
1. Create a SOLR collection to hold the documents
1. Index the documents
1. Write an interface (Python) allowing an end user to query the documents and get back useful results

You will be writing the code to interact with SOLR in such a way that I can grab your notebook and run the code, which will create your SOLR collection and parse and load some documents, then I can run queries against your SOLR collection for a set of queries, which you will also supply through Python functions.

### Product Review Data Attributes

You will collect the following attributes from a review.  Please make sure in your implementation, you use *exactly these names* for the attributes.

1. ***productID*** -- a string.  The Amazon ASIN for the product being reviewed.
2. ***productName*** -- a string.  The name of the product being reviewed
2. ***reviewSummary*** -- a string.  Short text the reviewer supplies to introduce the review
3. ***reviewBody*** -- a string.  Text of the review.   Note that although empty reviews on Amazon are possible, you will not index them
4. ***reviewDate*** -- a date.  The web page has a string for the date which must be parsable into a SOLR date
5. ***reviewScore*** -- an integer.  The value must be between 1 and 5 inclusive
6. ***id*** -- a UUID that is unique to the review.  This attribute is not on the web page;  you will make SOLR supply it (see below).

### Scraping Reviews for a Product
Your solution to this problem is a function
<pre>
scrape_pages_for_reviews(dirname)-> JSON array
</pre>

The function takes as input the absolute pathname of a directory containing HTML files.  Each file contains the HTML for an Amazon product detail page, and contains 0 or more reivews.  The output of this function is an array of JSON objects, each containing data for a single review.  Note that there may be more than one JSON object produced for a file (if the product has multiple reviews) or no objects (if the product on the page has no reviews). 


In [3]:
# Your Code Here -- REQUIRED FUNCTION
def scrape_pages_for_reviews(path):
    return None


In [None]:
reviews_json = scrape_pages_for_reviews("pages")

### Schema Design and Search Scenarios

In designing your schema you first have to understand what search/query patterns will be implemented, and also any other data transformations you need to make.

#### Data Transformation

You might notice in looking at product names, that the name often contains *VHS* or *DVD* and in a few cases *Region 2* or similar strings, all of which are not part of the product name, but rather are about the format of the product.  These are thought to be distracting in the search context, so they should not appear in the product name and should not be indexed for search.  Keeping terms like VHS or DVD in the review summary or body is fine.  Since you want to omit VHS or DVD from the *field text* than just in the index, you will have to make this transformation in code prior to indexing the document.

(If you are feeling ambitious and you look at all the product names, you will notice that in a very few cases, simply removing *VHS* or *DVD* or *Region d* results in an ugly product name string.  It is acceptable to leave the few edge cases as they are, but wouldn't it be so much better to come up with a transformation that would work for them too?)

#### Search Scenarios

You will implement four separate search scenarios; the interface for the four searches is similar, but each search scenario is different in the way
* the parameters it takes
* how it searches those parameters
* the document attributes it returns

There will be more information below about how to implement the search scenarios;  this cell contains information you need to define your schema.  The information about how the search does retrieval based on its parameters is crucial to you making schema design decisions.

1.  ***Review Search***:  this is a search on the review summary and body fields, as well as the review date and score.  For this search you should support (a) standard keyword querying over the review summary and review body fields, (b) range queries over the date and score.
1. ***ASIN Search***:  this does a case-independent exact match on the product ID field
1. ***Product Name Search***:  this searches the product name field only, and succeeds only if all keywords in the query string match a word in the stored string, *in order*.   Only the following processing should be done
  * Stopwords should be removed
  * The search should be case insensitive
  * Possessives should be removed
  * (Tokens should not otherwise be stemmed)
1. ***ID Search***:  this is a direct lookup on the document's *id* field (its unique UUID)
  
Here are some examples of product name searches 


| Document Value | Query String | Matches? |
|---|---|---|
| Shakes the Clown | Shakes Clown | Yes |
| Shakes the Clown | Clown Shakes | No  |
| The Young Poisoner's Notebook | poisoner notebook | Yes |
| The Young Poisoner's Notebook  | poison notebook | No |
| Creature From the Black Lagoon | creature Lagoon | Yes |
| Creature From the Black Lagoon | creature black lagoon | Yes |
| Creature From the Black Lagoon | creature lagoon black | No |
| Creature From the Black Lagoon | monster lagoon | No |

------------------------------------------

Here is a summary of the four search types

| Search Type | Parameters Accepted | Document Attributes Returned |
|---|---|---|
| Review| Keywords from summary and body;  time threshold; score threshold | productName, productID, reviewSummary, reviewDate, reviewScore|
| ASIN | Product ID / ASIN | reviewSummary, reviewBody, reviewDate, reviewScore, id |
| Product Name | Keywords from product name | productName, productID, reviewSummary, reviewDate, reviewScore |
| ID | ID UUID | all stored attributes in the document |



--------------------------------------------
#### Creating the SOLR Collection and Schema

Start by defining a directory *conf* in the directory holding your notebook, and copy in standard SOLR configuration files. You will probably want to start with the default configuration and edit out everything you don't need.  You will edit the *schema.xml* file and any other files you need -- like stopwords or synonyms.   Your *conf* directory should have only files that are necessary for your SOLR collection to run, and your *schema.xml* file should contain only the elements required to implement your schema.

Now that you have defined your schema and support files like your stopwords list, you will write a little code that will call SOLR and have it create the collection for you.

For this section you supply a function <pre>create_collection(config_dir, solr_port=8983)</pre> which creates the collection in SOLR.  The *config_dir* parameter is the absolute pathname of your configuration directory.

You may assume that when this function is run 
1.  SOLR is running
2.  There is no existing collection with the name *amzn-reviews*


In [4]:
# This is a function allowing us to do process execution in a platform-indepent way.  
# You might have to change the location of the SOLR executable.  

import subprocess

SOLR_EXECUTABLE = 'C:\\solr-8.0.0\\bin\\solr.cmd'

def solr_command(*args):
    return subprocess.check_output([SOLR_EXECUTABLE] + list(args))

In [5]:
# Your Code Here -- REQUIRED FUNCTION
#  Call to SOLR that creates a new collection with the given name and 
#  whose configuration is located at the given directory.  Function returns no value

def create_collection(config_dir, collection_name = 'amzn-reviews'):
    return None

In [6]:
#create_collection('conf')

#### Indexing

In this step you will index your reviews.  At this point the SOLR server should be running, and should have created the collection **amzn-reviews** to accept the reviews.  

You should produce a function
<pre>
index_reviews(json_reviews, solr_port=8983)
</pre>
whose argument is a JSON array as produced by <pre>scrape_pages_for_reviews()</pre> above.  The function returns no result.

Please be sure to set the **commit parameter** on the indexing so the documents will be available immediately after this function completes.

In [7]:
# Reviews points to a directory containing JSON files that conform to the 
# configured schema.  There is one document per file.  Call SOLR to index all 
# documents in the directory

# Your Code Here -- REQUIRED FUNCTION
def index_reviews(reviews, solr_port=8983, collection_name='amzn-reviews'):
    return None

In [8]:
#index_reviews(reviews_json)

----------------------------------------------
### Queries and Retrieval

Now you will write four query functions, one for each query type.  Each will take some query parameters as input, and return 
a query parameters that can be submitted to SOLR.  

In addition there will be a simple function *do_query(params)* which will take a dictionary, convert the dictionary into query parameters, do the query to SOLR, and return a list of 0 or more documents (JSON dictionaries).  

Each of the four query types will be implemented by a function that returns a dictionary that can be passed to *do_query()*
See below for the definition of *do_query*.


### The Four Search Types

Refer to the cell above about what the four search types are supposed to do
1. Review Search
1. Product Name Search
1. ASIN Search
1. ID Search

----------------------

### Review Search

<pre>
review_query_dictionary(keywords='', date_str='', score_str=No''ne) -> dictionary
</pre>

#### Keywords parameter

The *keywords* is a query string to be sent to SOLR -- you don't need to do any additional processing on it.

#### Date parameter

The *date_str* is of the form
<pre>
&lt;direction&gt; &lt;yyyy&gt;-&lt;mm&gt;-&lt;dd&gt;
</pre>
where &lt;direction&gt; is either '&lt;=' or '&gt;=' and the other components are the year, month and day of the user-supplied date.  The intent is for example if this component is 
<pre>
&gt;= 2018-01-01
</pre>
only reviews with a date on or after Jan 1 2018 are returned.  If this string is omitted, no date filtering will be done.

#### Score parameter

The *score_str* is of the form
<pre>
&lt;direction&gt; &lt;score&gt;
</pre>
where &lt;direction&gt; is either '&lt;=' or '&gt;=' and *score* is an integer between 1 and 5.  The intent is for example if this component is 
<pre>
&lt;= 3
</pre>
only reviews with a score of three or less are returned.  If this string is omitted, no score filtering will be done.

This query returns all documents that meet *all* of the search criteria, if they are supplied.  If it is called with no arguments, it should return no results.

#### Attributes returned
The following document attributes should be returned when *do_query* is called on a string provided by *review_query_dictionary*
* productName
* productID
* reviewSummary
* reviewDate
* reviewScore


In [10]:
# Takes three input strings as defined above, and produces a dictionary
# with key/value pairs that can be transformed into the SOLR select query 
# that implements review search.


# Your Code Here -- REQUIRED FUNCTION
def review_query_dictionary(kwd_str="", date_str="", score_str=""):
    return None


### ASIN Search

<pre>
asin_query_dictionary(asin) -> dictionary
</pre>

This search returns all reviews associated with a particular ASIN (product ID).  Remember that the match should be case insensitive, so for ASINs with letters, the corresponding lower-cased ASIN string should match.

#### Attributes returned
The following document attributes are returned
* reviewSummary
* reviewBody
* reviewDate
* reviewScore

In [15]:
# Takes a strings storing a possible ASIN, and returns a dictionary
# with key/value pairs that can be transformed into the SOLR select query 
# that implements asin search.

# Your Code Here -- REQUIRED FUNCTION
def asin_query_dictionary(asin):
    return None

### Product Name Search

<pre>
product_name_query_dictionary(name_terms) -> dictionary
</pre>

This search returns all reviews associated with all products matching the name terms.  Remember the matching criterion is that all words in the name_terms must appear in that order in the product name, but the match should be case insensitive, ignore stop words, and remove possessives.

#### Attributes returned
The following document attributes are returned
* productName
* productID
* reviewSummary
* reviewDate
* reviewScore
* ID


In [20]:
# Takes a strings with product name terms, and returns a dictionary
# with key/value pairs that can be transformed into the SOLR select query 
# that implements product name search.

# Your Code Here -- REQUIRED FUNCTION
def product_name_query_dictionary(name_terms):
    return None

### ID Search

<pre>
id_query_dictionary(uuid) -> dictionary
</pre>

This search returns the review associated with the UUID provided as input, if such document exists.

#### Attributes returned
The following document attributes are returned
* productName
* productID
* reviewSummary
* reviewDate
* reviewScore
* ID

In [23]:
# Takes a strings storing a possible ID (UUID), and returns a dictionary
# with key/value pairs that can be transformed into the SOLR select query 
# that implements ID search.

# Your Code Here -- REQUIRED FUNCTION

def id_query_dictionary(id):
    return None

#### This is the main entry point to your search engine.  

Its first argument will be the result of a call to one of the four "dictionary" functions above

In [9]:
import requests

# Takes as input a dictionary as returned by the functions above, formats the 
# dictionary as a SOLR GET call. It makes the select request to SOLR.  If the return
# status is 200, returns the JSON response -- a list of documents.  If the return status is different from 200, 
# raise an exception that includes the status code. 

# Your Code Here -- REQUIRED FUNCTION
def do_query(params, port="8983", collection="amzn-reviews"):
    return None
