# Metrics to consider

1. **The ratio `(#different products)/(total #reviews)`.** Ideally we would like to have several reviews per product, thus this should be high.

1. **The similar products should be considered as alternatives for the buyers.** E.g. a hair lotion for dry hair would not replace a hair lotion for greasy hair though different shovels could be considered as possible alternative choices. Unfortunately this is hard to extract thus should be decided manually.  

1. **Data skewness in favor of the ratings 1 and 5.** It would be easier to answer our questions when we have many 5's and 1's, thus this should be high.

1. **The dataset should be loaded instantly.** In order to have short feedback loops - at least in the beginning - we need to pick datasets with small size. We can still consider the large data sets, as long as we use only a random sample of them.

1. **Existing bibliography.** Data sets which have already been used by others are preferred since we can get benchmarks, exploaratory analysis data and notebook kernels we can reuse and extend.

## Queries to extract the metrics

In [1]:
def average_review_per_product(dataframe, total_count):
    number_of_distinct_products = dataframe.groupBy('asin').count().count()
    
    return [('reviews_per_product', number_of_distinct_products / float(total_count))]

In [2]:
def percentages_per_rating(dataframe, total_count):
    rating_counts = (dataframe
         .groupBy('overall')
         .count()
         .rdd
         .map(lambda row: row.asDict().values())
         .collect())
    
    return [ (str(int(rating)), rating_count / float(total_count))
        for rating_count, rating
        in rating_counts ]

In [3]:
import re

def evaluate_metrics(dataframe, filename):
    name = (re
      .search('^reviews_(.+)_5\.json\.gz*', filename)
      .group(1)
      .replace('_', ' '))
    
    print(name)
    
    total_count = dataframe.count()
    
    return dict([('dataset_name', name)] 
      + average_review_per_product(dataframe, total_count) 
      + percentages_per_rating(dataframe, total_count)
      + [('number_of_reviews', total_count)])

## Extract the metrics from all the data files of a given directory into a pandas dataframe

In [4]:
import os
import pandas as pd

def extract_metrics_from_directory(data_directory):
    return (pd
        .DataFrame
        .from_dict(
            [ evaluate_metrics(
                    (spark
                         .read
                         .json(os.path.join(data_directory, filename))), 
                    filename)
                for filename in sorted(os.listdir(data_directory)) ])
        .set_index('dataset_name'))

metrics = extract_metrics_from_directory('./data/raw_data')
metrics.to_csv('./metadata/initial-data-evaluation-metrics.csv')

Amazon Instant Video
Apps for Android
Automotive
Baby
Beauty
Cell Phones and Accessories
Clothing Shoes and Jewelry
Digital Music
Grocery and Gourmet Food
Health and Personal Care
Home and Kitchen
Kindle Store
Musical Instruments
Office Products
Patio Lawn and Garden
Pet Supplies
Sports and Outdoors
Tools and Home Improvement
Toys and Games
Video Games


## Print a metrics comparison matrix

In [5]:
def percentage(some_float):
    return '%i%%' % int(100 * some_float)

def show_metrics_comparison_matrix(dataframe):
    return dataframe.apply(
        lambda row: [ percentage(row[i]) for i in range(0, 5) ] + [ int(row[5]), percentage(row[6]) ], 
        axis=1)

show_metrics_comparison_matrix(metrics)

Unnamed: 0_level_0,1,2,3,4,5,number_of_reviews,reviews_per_product
dataset_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Amazon Instant Video,4%,5%,11%,22%,56%,37126,4%
Apps for Android,10%,5%,11%,20%,51%,752937,1%
Automotive,2%,2%,6%,19%,68%,20473,8%
Baby,4%,5%,10%,20%,58%,160792,4%
Beauty,5%,5%,11%,20%,57%,198502,6%
Cell Phones and Accessories,6%,5%,11%,20%,55%,194439,5%
Clothing Shoes and Jewelry,4%,5%,10%,20%,58%,278677,8%
Digital Music,4%,4%,10%,25%,54%,64706,5%
Grocery and Gourmet Food,3%,5%,11%,21%,57%,151254,5%
Health and Personal Care,4%,4%,9%,19%,61%,346355,5%
