<div style="font-size: 200%; font-weight: bold; color: gray; padding-bottom: 20px">Loading Data into Hive</div>
Our Yelp data sets are stored in JSON format. They include nested structures which cannot be directly translated into SQL/Hive tables.

In some cases we have to produce multiple tables from a single data set and then join them in queries. Alternatively, we may have to replicate certain values across rows to generate a "flat" table. Sometimes *proper database normalization* and a*nalysis tools* are at odds...

To learn more about database normalization go to https://en.wikipedia.org/wiki/Database_normalization

# 'user' data set

The JSON schema

we want to create the following HIVE table

In [None]:
%%sql
CREATE TABLE IF NOT EXISTS yelp_user (
    user_id STRING,
    name STRING,
    review_count INT,
    average_stars DOUBLE,
    votes INT,
    yelping_since DATE,
    fans INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/pmolnar/yelp/hivetables/user'

We need to write a MapReduce mapper script that transforms records from the 'user' data set to the above format

In [None]:
# %load user2csv_mpr.py
#!/usr/bin/env python

import sys
import json
# input comes from STDIN (standard input)
for line in sys.stdin:
    try:
        r = json.loads(line)
        ##for f in r['friends']:
        ## printing COMMA seperated lines from a list of values
        print ','.join([r['user_id'], r['name'], r['review_count'], r['average_stars']])



... run MapReduce

In [None]:
%%sh
# use the current directory as location for program files
WD=`pwd`

OUTDIR=/user/$USER/yelp/output
OUTPUT=$OUTDIR/user2csv

# make sure output directory exists
hdfs dfs -mkdir -p $OUTDIR 

# make sure the output files don't exist
hdfs dfs -rm -r -f -skipTrash $OUTPUT

INPUT=/user/$USER/yelp/data/user/*
yarn \
    jar /usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar \
    -mapper "$WD/user2csv_mpr.py" \
    -input $INPUT \
    -output $OUTPUT

Now, we can load the output of the mapreduce

In [None]:
%%sql
LOAD DATA INPATH '/user/pmolnar/help/output/user2csv/part-*' INTO TABLE yelp_user

Check it out

In [None]:
%%sql
SELECT * FROM yelp_user LIMIT 10