# Metrics to consider

1. **The ratio `(#different products)/(total #reviews)`.** Ideally we would like to have several reviews per product, thus this should be high.

1. **The similar products should be considered as alternatives for the buyers.** E.g. a hair lotion for dry hair would not replace a hair lotion for greasy hair though different shovels could be considered as possible alternative choices. Unfortunately this is hard to extract thus should be decided manually.  

1. **Data skewness in favor of the ratings 1 and 5.** It would be easier to answer our questions when we have many 5's and 1's, thus this should be high.

1. **The dataset should be loaded instantly.** In order to have short feedback loops - at least in the beginning - we need to pick datasets with small size. We can still consider the large data sets, as long as we use only a random sample of them.

1. **Existing bibliography.** Data sets which have already been used by others are preferred since we can get benchmarks, exploaratory analysis data and notebook kernels we can reuse and extend.

## Queries to extract the metrics

In [18]:
def percentage(some_float):
    return '%i%%' % int(100 * some_float)

In [19]:
def average_review_per_product(dataframe, total_count):
    number_of_distinct_products = dataframe.groupBy('asin').count().count()
    
    return [('reviews_per_product', percentage(number_of_distinct_products / float(total_count)))]

In [20]:
def percentages_per_rating(dataframe, total_count):
    rating_counts = (dataframe
         .groupBy('overall')
         .count()
         .rdd
         .map(lambda row: row.asDict().values())
         .collect())
    
    return [ (str(int(rating)), (percentage(rating_count / float(total_count))))
        for rating_count, rating
        in rating_counts ]

In [21]:
import re

def evaluate_metrics(dataframe, filename):
    name = (re
      .search('^reviews_(.+)_5\.json\.gz*', filename)
      .group(1)
      .replace('_', ' '))
    
    print(name)
    
    total_count = dataframe.count()
    
    return dict([('dataset_name', name)] 
      + average_review_per_product(dataframe, total_count) 
      + percentages_per_rating(dataframe, total_count)
      + [('number_of_reviews', total_count)])

In [22]:
import os
import pandas as pd

def show_metrics_comparison_matrix(data_directory):
    return (pd
        .DataFrame
        .from_dict(
            [ evaluate_metrics(
                    (spark
                         .read
                         .json(os.path.join(data_directory, filename))), 
                    filename)
                for filename in sorted(os.listdir(data_directory)) ])
        .set_index('dataset_name'))

In [None]:
show_metrics_comparison_matrix('./data/raw_data')

Evaluating 'Amazon Instant Video'
Evaluating 'Apps for Android'
Evaluating 'Automotive'
Evaluating 'Baby'
Evaluating 'Beauty'
Evaluating 'Cell Phones and Accessories'
Evaluating 'Clothing Shoes and Jewelry'
Evaluating 'Digital Music'
Evaluating 'Grocery and Gourmet Food'
Evaluating 'Health and Personal Care'
Evaluating 'Home and Kitchen'
