# Importing the large datasets to a postgresql server

It is not possible to load the larger data sets in the memory of a local machine therefeore an alternative is to import them to a psql table and query them from there. By adding the right indices this can make the queries fast enough. After this import one can extract some basic statistics using sql and also export smaller portions of the data which can be handled by spark or pandas on a local machine.

## Unzipping the data and converting it to csv format

Unfortunately psql does not support an import of record json files therefore we need to convert the data sets to csv. We use here the command line tool [json2csv](https://github.com/jehiah/json2csv).

**WARNING:** The following two commands will run for a while, especially the second one. You can expect approximately 1 minute per GB of unzipped data.

In [None]:
!ls ./data/large-datasets/*.gz | grep -Po '.*(?=.gz)' | xargs -I {} gunzip {}.gz {}.json

In [130]:
!ls ./data/large-datasets/*.json | grep -Po '.*(?=.json)' | xargs -I {} json2csv -p -d '|' -k asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime -i {}.json -o {}.csv
!rm ./data/large-datasets/*.json

[]

## Importing the data in psql

To import the data in psql we create a table with the appropriate shape and import form the csv files generated above.

### Some preparation to run psql transactions and queries in python

In [142]:
import psycopg2 as pg
import pandas as pd

db_conf = { 
    'user': 'mariosk',
    'database': 'amazon_reviews'
}

connection_factory = lambda: pg.connect(user=db_conf['user'], database=db_conf['database'])

def transaction(*statements):
    try:
        connection = connection_factory()
        cursor = connection.cursor()
        for statement in statements:
            cursor.execute(statement)
        connection.commit()
        cursor.close()
    except pg.DatabaseError as error:
        print(error)
    finally:
        if connection is not None:
            connection.close()
    
def query(statement):
    try:
        connection = connection_factory()
        cursor = connection.cursor()
        cursor.execute(statement)
        
        header = [ description[0] for description in cursor.description ]
        rows = cursor.fetchall()
        
        cursor.close()
        return pd.DataFrame.from_records(rows, columns=header)
    except (Exception, pg.DatabaseError) as error:
        print(error)
        return None
    finally:
        if connection is not None:
            connection.close()

### Creating tables for with indices for the large datasets

In [167]:
import re

filenames = [ re.search('reviews_(.*)_5.csv', filename).group(1) 
    for filename 
    in sorted(os.listdir('./data/large-datasets'))]

In [166]:
def create_table(name):
    transaction(
        'create table %s (asin text, helpful text, overall double precision, reviewText text, reviewTime text, reviewerID text, reviewerName text, summary text, unixReviewTime int);' % name,
        'create index {0}asin ON {0} (asin);'.format('name'),
        'create index {0}overall ON {0} (overall);'.format('name'),
        'create index {0}reviewerID ON {0} (reviewerID);'.format('name'),
        'create index {0}unixReviewTime ON {0} (unixReviewTime);'.format('name'))

for filename in filenames:
    create_table(filename)

relation "books" already exists

relation "cds_and_vinyl" already exists

relation "electronics" already exists

relation "movies_and_tv" already exists



### Importing the datasets to psql

**WARNING:** The following command will take long time to complete. Estimate ~1 minute for each GB of csv data.

In [148]:
!ls ./data/large-datasets | grep -Po '(?<=reviews_).*(?=_5.csv)' | xargs -I {} psql -U mariosk -d amazon_reviews -c "\copy {} from './data/large-datasets/reviews_{}_5.csv' with (format csv, delimiter '|', header true);"

ERROR:  extra data after last expected column
CONTEXT:  COPY books, line 8791050: "B00JL1H75A|[0 0]|5|Love this series! I even like Elijah. He has been through a lot and it sounds lik..."
COPY 1097592
COPY 1689188
COPY 1697533


## Querying the metrics

In [165]:
def average_reviews_per_product(table_name):
    return (query('''
        with distinct_products as (select count(distinct asin) as products from {0}),
             reviews_count as (select cast(count(*) as double precision) as reviews from {0})
        select reviews / products as reviews_per_product
        from distinct_products cross join reviews_count
    '''.format(table_name))
    .rename(index={0: table_name}))

In [162]:
def average_reviews_per_reviewer(table_name):
    return (query('''
        with distinct_reviewers as (select count(distinct reviewerID) as reviewers from {0}),
             reviews_count as (select cast(count(*) as double precision) as reviews from {0})
        select reviews / reviewers as reviews_per_reviewer
        from distinct_reviewers cross join reviews_count
    '''.format(table_name))
    .rename(index={ 0: table_name}))

In [161]:
def percentages_per_rating(table_name):
    return (query('''
            with rating_counts as (select count(overall) as rating_count from {0} group by overall),
                 reviews_count as (select cast(count(*) as double precision) as reviews from {0})
            select rating_count / reviews as row
            from rating_counts cross join reviews_count
        '''.format(table_name))
        .transpose()
        .rename(index={'row': table_name}, columns=lambda x: str(x + 1)))

In [164]:
def all_metrics(table_name):
    print(table_name)
    
    return pd.concat(
        [ f(table_name) 
            for f
            in [ percentages_per_rating, average_reviews_per_product, average_reviews_per_reviewer ]], 
        axis=1)

In [170]:
metrics = pd.concat([ all_metrics(table) for table in filenames if table != 'Books' ])

CDs_and_Vinyl
Electronics
Movies_and_TV


In [171]:
metrics.to_csv('./metadata/large-datasets-evaluation-metrics.csv')

In [172]:
metrics

Unnamed: 0,1,2,3,4,5,reviews_per_product,reviews_per_reviewer
CDs_and_Vinyl,0.224424,0.04243,0.042088,0.09277,0.598288,17.031982,14.58439
Electronics,0.048626,0.205448,0.064365,0.084216,0.597344,26.812082,8.779427
Movies_and_TV,0.225618,0.060329,0.118585,0.061394,0.534074,33.915388,13.6942
