# Metrics to consider

1. **The ratio `(#different products)/(total #reviews)`.** Ideally we would like to have several reviews per product, thus this should be high.

1. **The similar products should be considered as alternatives for the buyers.** E.g. a hair lotion for dry hair would not replace a hair lotion for greasy hair though different shovels could be considered as possible alternative choices. Unfortunately this is hard to extract thus should be decided manually.  

1. **Data skewness in favor of the ratings 1 and 5.** It would be easier to answer our questions when we have many 5's and 1's, thus this should be high.

1. **The dataset should be loaded instantly.** In order to have short feedback loops - at least in the beginning - we need to pick datasets with small size. We can still consider the large data sets, as long as we use only a random sample of them.

1. **Existing bibliography.** Data sets which have already been used by others are preferred since we can get benchmarks, exploaratory analysis data and notebook kernels we can reuse and extend.

## Queries to extract the metrics

In [70]:
def percentage(some_float):
    return '%i%%' % int(100 * some_float)

In [71]:
def average_review_per_product(dataframe, total_count):
    number_of_distinct_products = dataframe.groupBy('asin').count().count()
    
    return [('reviews_per_product', percentage(number_of_distinct_products / float(total_count)))]

In [72]:
def percentages_per_rating(dataframe, total_count):
    rating_counts = (dataframe
         .groupBy('overall')
         .count()
         .rdd
         .map(lambda row: row.asDict().values())
         .collect())
    
    return [ (str(int(rating)), (percentage(rating_count / float(total_count))))
        for rating_count, rating
        in rating_counts ]

In [73]:
def evaluate_metrics(dataframe, filename):
    print('Evaluating \'%s\'' % filename.split('.')[0])
    
    total_count = dataframe.count()
    
    return dict([('dataset_name', filename)] 
      + average_review_per_product(dataframe, total_count) 
      + percentages_per_rating(dataframe, total_count)
      + [('number_of_reviews', total_count)])

In [74]:
import os
import pandas as pd

def show_metrics_comparison_matrix(data_directory):
    return (pd
        .DataFrame
        .from_dict(
            [ evaluate_metrics(
                    (spark
                         .read
                         .json(os.path.join(data_directory, filename))), 
                    filename)
                for filename in sorted(os.listdir(data_directory)) ])
        .set_index('dataset_name'))

In [75]:
show_metrics_comparison_matrix('./data/raw_data')

Evaluating 'reviews_Amazon_Instant_Video_5'
Evaluating 'reviews_Apps_for_Android_5'
Evaluating 'reviews_Automotive_5'
Evaluating 'reviews_Baby_5'
Evaluating 'reviews_Beauty_5'
Evaluating 'reviews_Cell_Phones_and_Accessories_5'
Evaluating 'reviews_Clothing_Shoes_and_Jewelry_5'
Evaluating 'reviews_Digital_Music_5'
Evaluating 'reviews_Grocery_and_Gourmet_Food_5'
Evaluating 'reviews_Health_and_Personal_Care_5'
Evaluating 'reviews_Home_and_Kitchen_5'
Evaluating 'reviews_Kindle_Store_5'
Evaluating 'reviews_Musical_Instruments_5'
Evaluating 'reviews_Office_Products_5'
Evaluating 'reviews_Patio_Lawn_and_Garden_5'
Evaluating 'reviews_Pet_Supplies_5'
Evaluating 'reviews_Sports_and_Outdoors_5'
Evaluating 'reviews_Tools_and_Home_Improvement_5'
Evaluating 'reviews_Toys_and_Games_5'
Evaluating 'reviews_Video_Games_5'


Unnamed: 0_level_0,1,2,3,4,5,dataset_size,reviews_per_product
dataset_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
reviews_Amazon_Instant_Video_5.json.gz,4%,5%,11%,22%,56%,37126,4%
reviews_Apps_for_Android_5.json.gz,10%,5%,11%,20%,51%,752937,1%
reviews_Automotive_5.json.gz,2%,2%,6%,19%,68%,20473,8%
reviews_Baby_5.json.gz,4%,5%,10%,20%,58%,160792,4%
reviews_Beauty_5.json.gz,5%,5%,11%,20%,57%,198502,6%
reviews_Cell_Phones_and_Accessories_5.json.gz,6%,5%,11%,20%,55%,194439,5%
reviews_Clothing_Shoes_and_Jewelry_5.json.gz,4%,5%,10%,20%,58%,278677,8%
reviews_Digital_Music_5.json.gz,4%,4%,10%,25%,54%,64706,5%
reviews_Grocery_and_Gourmet_Food_5.json.gz,3%,5%,11%,21%,57%,151254,5%
reviews_Health_and_Personal_Care_5.json.gz,4%,4%,9%,19%,61%,346355,5%
