# Importing the large datasets to a postgresql server

It is not possible to load the larger data sets in the memory of a local machine therefeore an alternative is to import them to a psql table and query them from there. By adding the right indices this can make the queries fast enough. After this import one can extract some basic statistics using sql and also export smaller portions of the data which can be handled by spark or pandas on a local machine.

## Unzipping the data and converting it to csv format

Unfortunately psql does not support an import of record json files therefore we need to convert the data sets to csv. We use here the command line tool [json2csv](https://github.com/jehiah/json2csv).

In [2]:
!gunzip ./data/large-datasets/reviews_CDs_and_Vinyl_5.json.gz ./data/large-datasets/reviews_CDs_and_Vinyl_5.json

gzip: ./data/large-datasets/reviews_CDs_and_Vinyl_5.json: unknown suffix -- ignored
rm: cannot remove 'reviews_CDs_and_Vinyl_5': No such file or directory


In [6]:
!json2csv -p -d '|' -k asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime -i ./data/large-datasets/reviews_CDs_and_Vinyl_5.json -o ./data/large-datasets/reviews_CDs_and_Vinyl_5.csv

## Importing the data in psql

To import the data in psql we create a table with the appropriate shape and import form the csv files generated above.

### Some preparation to run psql transactions and queries in python

In [61]:
import psycopg2 as pg
import pandas as pd

db_conf = { 
    'user': 'mariosk',
    'database': 'amazon_reviews'
}

connection_factory = lambda: pg.connect(user=db_conf['user'], database=db_conf['database'])

def transaction(*statements):
    try:
        connection = connection_factory()
        cursor = connection.cursor()
        for statement in statements:
            cursor.execute(statement)
        connection.commit()
        cursor.close()
    except pg.DatabaseError as error:
        print(error)
    finally:
        if connection is not None:
            connection.close()
    
def query(statement):
    try:
        connection = connection_factory()
        cursor = connection.cursor()
        cursor.execute(statement)
        
        header = [ description[0] for description in cursor.description ]
        rows = cursor.fetchall()
        
        cursor.close()
        return pd.DataFrame.from_records(rows, columns=header)
    except (Exception, pg.DatabaseError) as error:
        print(error)
        return None
    finally:
        if connection is not None:
            connection.close()

### Creating tables for with indices for the large datasets

In [30]:
transaction(
    'create table cds (asin text, helpful text, overall double precision, reviewText text, reviewTime text, reviewerID text, reviewerName text, summary text, unixReviewTime int);',
    'create index asin ON cds (asin);',
    'create index overall ON cds (overall);',
    'create index reviewerID ON cds (reviewerID);',
    'create index unixReviewTime ON cds (unixReviewTime);')

### Importing the datasets to psql

In [46]:
!psql -U mariosk -d amazon_reviews -c "\copy cds from './data/large-datasets/reviews_CDs_and_Vinyl_5.csv' with (format csv, delimiter '|', header true);"

COPY 1097592


## Querying the metrics

In [62]:
query('select * from cds limit 10')

Unnamed: 0,asin,helpful,overall,reviewtext,reviewtime,reviewerid,reviewername,summary,unixreviewtime
0,0307141985,[14 15],5.0,I don't know who owns the rights to this wonde...,"10 6, 2005",A3IEV6R2B7VW5Z,J. Anderson,LISTEN TO THE PUBLIC!!!,1128556800
1,0307141985,[2 2],4.0,Thanksgiving is devoid of icons to make it a f...,"11 23, 2011",A2H3ISQ4QB95XN,Joseph Brando,Rankin/Bass Does Thanksgiving!!,1322006400
2,0307141985,[38 38],5.0,This is a Thanksgiving tale that begins with t...,"07 14, 2003",A6GMEO3VRY51S,microjoe,Thanksgiving Holiday fun from Rankin/Bass,1058140800
3,0307141985,[15 16],5.0,This is the BEST THANKSGIVING special around.....,"11 6, 2003",A3E102F6LPUF1J,"Richard J. Goldschmidt ""Rick Goldschmidt""",BEST THANKSGIVING special out there!,1068076800
4,0307141985,[11 12],5.0,It's been a number of years since I've seen Mo...,"03 1, 2006",A2JP0URFHXP6DO,Tim Janson,A THANKSGIVING TRADITION,1141171200
5,073890015X,[0 0],5.0,ok I guess a little over 2 hours was not enoug...,"09 11, 2013",A31GBCW6YPY9OW,Dave Childress,great late 90's concert,1378857600
6,073890015X,[1 16],5.0,I read one of the review's on here for woodsto...,"01 23, 2003",A3QAV7LALVG1F7,"Dianne Papineau ""Brock Papineau""",Future woodstock's will be better than the first,1043280000
7,073890015X,[10 12],2.0,"I paid, I went, I saw, I experienced. Unfortu...","03 10, 2000",A1BFRIT70VHDF8,"Doogie the Audio Junkie ""dackley""",2% Of The Real Woodstock 99,952646400
8,073890015X,[0 0],5.0,"As a fan of filmed concerts, I avoided this on...","04 13, 2009",AEFGR6NFTNWH0,Emile Pinsonneault,"Don't Get Hung Up On The ""Woodstock"" Thing...",1239580800
9,073890015X,[11 13],4.0,I dock this DVD one star only because I wanted...,"06 14, 2000",A24GD1AWG77IDJ,"E. Uthman ""Ed""",Many great performances; superb audio and video,960940800
