<div style="font-size: 200%; font-weight: bold; color: gray; padding-bottom: 20px">Loading Data into Hive</div>
Our Yelp data sets are stored in JSON format. They include nested structures which cannot be directly translated into SQL/Hive tables.

In some cases we have to produce multiple tables from a single data set and then join them in queries. Alternatively, we may have to replicate certain values across rows to generate a "flat" table. Sometimes *proper database normalization* and a*nalysis tools* are at odds...

To learn more about database normalization go to https://en.wikipedia.org/wiki/Database_normalization

In [52]:
import numpy as np
import pandas as pd
import sqlalchemy as sa
%matplotlib inline
import matplotlib.pyplot as plt
%load_ext sql
%config SqlMagic.autolimit=200
%config SqlMagic.displaylimit=20

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [4]:
%%sql hive://backend-0-1:10000/pmolnar
SHOW TABLES

Done.


tab_name
users


# 'user' data set

The JSON schema

## Normalized tables
### Table: **users**

|user_id|name|review_count|average_stars|yelping_since|fans|
|-------|----|------------|-------------|-------------|----| 
| x | x |x  | x | x | x |
| x | x |x  | x | x | x |
| x | x |x  | x | x | x |

### Table: **user_votes**

|user_id|vote_type|count|
|-------|---------|-----|
| x| x | x |
| x| x | x |
| x| x | x |

### Table: **user_friends**

|user_id|friends_user_id|
|-------|---------------|
| x |x  |
| x |x  |
| x |x  |

### Table: **user_years_elite**

|user_id|year|
|-------|----|
| x |x  |
| x |x  |
| x |x  |

### Table: **user_complements**

|user_id|compliment_type|count|
|-------|---------------|-----|
| x | x |x |
| x | x |x |
| x | x |x |

we want to create the following HIVE table

In [19]:
%%sql
CREATE TABLE IF NOT EXISTS users (
    user_id STRING,
    name STRING,
    review_count INT,
    average_stars DOUBLE,
    yelping_since,
    fans INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Done.


[]

In [49]:
%%sh
hdfs dfs -ls -R /apps/hive/warehouse/pmolnar.db/

drwxr-xr-x   - pmolnar hdfs          0 2017-01-21 08:41 /apps/hive/warehouse/pmolnar.db/user_compliments
drwxr-xr-x   - pmolnar hdfs          0 2017-01-21 08:41 /apps/hive/warehouse/pmolnar.db/user_friends
drwxr-xr-x   - pmolnar hdfs          0 2017-01-21 08:41 /apps/hive/warehouse/pmolnar.db/user_votes
drwxr-xr-x   - pmolnar hdfs          0 2017-01-21 08:41 /apps/hive/warehouse/pmolnar.db/user_years_elite
drwxr-xr-x   - pmolnar hdfs          0 2017-01-21 08:52 /apps/hive/warehouse/pmolnar.db/users
-rwxr-xr-x   3 pmolnar hadoop    3450808 2017-01-21 00:49 /apps/hive/warehouse/pmolnar.db/users/part-00000
-rwxr-xr-x   3 pmolnar hadoop    2907271 2017-01-21 08:46 /apps/hive/warehouse/pmolnar.db/users/part-00000_copy_1
-rwxr-xr-x   3 pmolnar hadoop   24760070 2017-01-21 08:47 /apps/hive/warehouse/pmolnar.db/users/part-00000_copy_2
-rwxr-xr-x   3 pmolnar hadoop    6658543 2017-01-21 08:47 /apps/hive/warehouse/pmolnar.db/users/part-00000_copy_3
-rwxr-xr-x   3 pmolnar hadoop     500685 2017-0

In [50]:
%%sh
hdfs dfs -cat /apps/hive/warehouse/pmolnar.db/users/part-00000_copy_4 | head



--qhwKkTzgBeCH3wEJjg2g,2014	
--qhwKkTzgBeCH3wEJjg2g,2015	
--qhwKkTzgBeCH3wEJjg2g,2016	
-0nLkzsZsFiX3nE4UKw5vg,2011	
-0nLkzsZsFiX3nE4UKw5vg,2012	
-1Q1s_NMGjBLBULA8z_npg,2005	
-1Q1s_NMGjBLBULA8z_npg,2006	
-1Q1s_NMGjBLBULA8z_npg,2007	
-1ZSWpyW6Qf5gKeHUOLv6Q,2010	
-1zY3QZ4vS2wdqTfnFLs1Q,2006	


cat: Unable to write to output stream.


In [51]:
%%sql
SELECT * FROM users LIMIT 10

Done.


users.user_id,users.name,users.review_count,users.average_stars,users.yelping_since,users.fans
--2QZsyXGz1OhiD4-0FQLQ,Kay,7,4.86,2014-04-01,
--519Rh5sTtkoUraGzAaKQ,Eric,8,4.5,2014-12-01,
--80yFOfe6nZKLhxTMZjEg,Moe,8,4.12,2009-07-01,
--K8RaywcHmmFtIXIHKZJg,Susan,1,5.0,2013-04-01,
--LzFD0UDbYE-Oho3AhsOg,Shumai,133,3.9,2011-01-01,
--MJXewYKgIGpKvtfwBkfg,Jen,1,2.0,2014-04-01,
--VxRvXk3b8FwsSbC2Zpxw,B,41,4.44,2010-07-01,
--WHJIfhj7M-ntd65kUy7Q,Kadie,13,4.23,2010-07-01,
--ZBhtxi8VwI-x9GzCIyxw,Sharon,5,3.0,2012-04-01,
--ZNzQbjx8FdCuJAkjl_vA,Anita,2,5.0,2012-07-01,


We need to write a MapReduce mapper script that transforms records from the 'user' data set to the above format

In [None]:
# %load users2csv_mpr.py
#!/usr/bin/env python

import sys
import json
# input comes from STDIN (standard input)
for line in sys.stdin:
    try:
        r = json.loads(line.strip())
        print ','.join([r['user_id'], r['name'], r['review_count'],
                        r['average_stars'], r['yelping_since'], r['fans'] ])




... run MapReduce

In [53]:
%%sh
# use the current directory as location for program files
WD=`pwd`

OUTDIR=/user/$USER/yelp/output
OUTPUT=$OUTDIR/users2csv

# make sure output directory exists
hdfs dfs -mkdir -p $OUTDIR 

# make sure the output files don't exist
hdfs dfs -rm -r -f -skipTrash $OUTPUT

INPUT=/user/pmolnar/yelp/data/user/*
yarn \
    jar /usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar \
    -mapper "$WD/users2csv_mpr.py" \
    -input $INPUT \
    -output $OUTPUT

Deleted /user/pmolnar/yelp/output/users2csv
packageJobJar: [] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob2001893774028961093.jar tmpDir=null


17/01/21 12:39:59 INFO impl.TimelineClientImpl: Timeline service address: http://backend-0-2.insight.gsu.edu:8188/ws/v1/timeline/
17/01/21 12:39:59 INFO client.RMProxy: Connecting to ResourceManager at backend-0-1.insight.gsu.edu/192.168.1.253:8050
17/01/21 12:39:59 INFO impl.TimelineClientImpl: Timeline service address: http://backend-0-2.insight.gsu.edu:8188/ws/v1/timeline/
17/01/21 12:39:59 INFO client.RMProxy: Connecting to ResourceManager at backend-0-1.insight.gsu.edu/192.168.1.253:8050
17/01/21 12:39:59 INFO mapred.FileInputFormat: Total input paths to process : 1
17/01/21 12:39:59 INFO mapreduce.JobSubmitter: number of splits:1
17/01/21 12:40:00 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1484597252711_0167
17/01/21 12:40:00 INFO impl.YarnClientImpl: Submitted application application_1484597252711_0167
17/01/21 12:40:00 INFO mapreduce.Job: The url to track the job: http://backend-0-1.insight.gsu.edu:8088/proxy/application_1484597252711_0167/
17/01/21 12:40:00 IN

Now, we can load the output of the mapreduce

In [23]:
%%sh
hdfs dfs -ls /user/pmolnar/yelp/output/users2csv

Found 2 items
-rw-r--r--   3 pmolnar hadoop          0 2017-01-21 00:49 /user/pmolnar/yelp/output/users2csv/_SUCCESS
-rw-r--r--   3 pmolnar hadoop    3450808 2017-01-21 00:49 /user/pmolnar/yelp/output/users2csv/part-00000


In [24]:
%%sql
LOAD DATA INPATH '/user/pmolnar/yelp/output/users2csv/part-*' INTO TABLE users

Done.


[]

Check it out

In [25]:
%%sql
SELECT * FROM pmolnar.users LIMIT 10

Done.


users.user_id,users.name,users.review_count,users.average_stars,users.yelping_since,users.fans
--2QZsyXGz1OhiD4-0FQLQ,Kay,7,4.86,2014-04-01,
--519Rh5sTtkoUraGzAaKQ,Eric,8,4.5,2014-12-01,
--80yFOfe6nZKLhxTMZjEg,Moe,8,4.12,2009-07-01,
--K8RaywcHmmFtIXIHKZJg,Susan,1,5.0,2013-04-01,
--LzFD0UDbYE-Oho3AhsOg,Shumai,133,3.9,2011-01-01,
--MJXewYKgIGpKvtfwBkfg,Jen,1,2.0,2014-04-01,
--VxRvXk3b8FwsSbC2Zpxw,B,41,4.44,2010-07-01,
--WHJIfhj7M-ntd65kUy7Q,Kadie,13,4.23,2010-07-01,
--ZBhtxi8VwI-x9GzCIyxw,Sharon,5,3.0,2012-04-01,
--ZNzQbjx8FdCuJAkjl_vA,Anita,2,5.0,2012-07-01,


Now, let's create the remaining tables...

In [27]:
%%sql
CREATE TABLE IF NOT EXISTS user_votes (
    user_id STRING,
    vote_type STRING,
    count INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Done.


[]

In [28]:
%%sql
CREATE TABLE IF NOT EXISTS user_friends (
    user_id STRING,
    friends_user_id STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Done.


[]

In [29]:
%%sql
CREATE TABLE IF NOT EXISTS user_years_elite (
    user_id STRING,
    year INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Done.


[]

In [30]:
%%sql
CREATE TABLE IF NOT EXISTS user_compliments (
    user_id STRING,
    compliment_type STRING,
    count INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Done.


[]

Let's run the MapReduce with the following mappers

In [31]:
%ls -l user*2csv_mpr.py

-rwxrwxr-x 1 pmolnar pmolnar 305 Jan 21 00:18 [0m[01;32muser_compliments2csv_mpr.py[0m*
-rwxrwxr-x 1 pmolnar pmolnar 259 Jan 21 00:11 [01;32muser_friends2csv_mpr.py[0m*
-rwxrwxr-x 1 pmolnar pmolnar 343 Jan 21 00:48 [01;32musers2csv_mpr.py[0m*
-rwxrwxr-x 1 pmolnar pmolnar 293 Jan 21 00:07 [01;32muser_votes2csv_mpr.py[0m*
-rwxrwxr-x 1 pmolnar pmolnar 262 Jan 21 00:15 [01;32muser_years_elite2csv_mpr.py[0m*


In [None]:
# %load user_compliments2csv_mpr.py
#!/usr/bin/env python

import sys
import json
# input comes from STDIN (standard input)
for line in sys.stdin:
    try:
        r = json.loads(line.strip())
        for c in r['compliments'].keys():
            print ','.join([r['user_id'], c, str(r['compliments'][c])])        
    except:
        None



In [54]:
%%sh
# use the current directory as location for program files
WD=`pwd`

OUTDIR=/user/$USER/yelp/output
OUTPUT=$OUTDIR/users2csv

# make sure output directory exists
hdfs dfs -mkdir -p $OUTDIR 

for TAB in user_compliments user_friends user_votes user_years_elite; do
    echo "Creating table '$TAB'"
    
    OUTPUT=$OUTDIR/${TAB}2csv
    # make sure the output files don't exist
    hdfs dfs -rm -r -f -skipTrash $OUTPUT

    INPUT=/user/pmolnar/yelp/data/user/*
    yarn \
        jar /usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar \
        -mapper "$WD/${TAB}2csv_mpr.py" \
        -input $INPUT \
        -output $OUTPUT
done

Creating table 'user_compliments'
Deleted /user/pmolnar/yelp/output/user_compliments2csv
packageJobJar: [] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob3550170667214430559.jar tmpDir=null
Creating table 'user_friends'
Deleted /user/pmolnar/yelp/output/user_friends2csv
packageJobJar: [] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob382411696673713235.jar tmpDir=null
Creating table 'user_votes'
Deleted /user/pmolnar/yelp/output/user_votes2csv
packageJobJar: [] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob5881628468780169706.jar tmpDir=null
Creating table 'user_years_elite'
Deleted /user/pmolnar/yelp/output/user_years_elite2csv
packageJobJar: [] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar

17/01/21 12:41:06 INFO impl.TimelineClientImpl: Timeline service address: http://backend-0-2.insight.gsu.edu:8188/ws/v1/timeline/
17/01/21 12:41:06 INFO client.RMProxy: Connecting to ResourceManager at backend-0-1.insight.gsu.edu/192.168.1.253:8050
17/01/21 12:41:06 INFO impl.TimelineClientImpl: Timeline service address: http://backend-0-2.insight.gsu.edu:8188/ws/v1/timeline/
17/01/21 12:41:06 INFO client.RMProxy: Connecting to ResourceManager at backend-0-1.insight.gsu.edu/192.168.1.253:8050
17/01/21 12:41:07 INFO mapred.FileInputFormat: Total input paths to process : 1
17/01/21 12:41:08 INFO mapreduce.JobSubmitter: number of splits:1
17/01/21 12:41:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1484597252711_0168
17/01/21 12:41:08 INFO impl.YarnClientImpl: Submitted application application_1484597252711_0168
17/01/21 12:41:08 INFO mapreduce.Job: The url to track the job: http://backend-0-1.insight.gsu.edu:8088/proxy/application_1484597252711_0168/
17/01/21 12:41:08 IN

In [45]:
%%sh
hdfs dfs -ls -R /user/pmolnar/yelp/output/

drwxr-xr-x   - pmolnar hadoop          0 2017-01-15 14:35 /user/pmolnar/yelp/output/business_by_city
-rw-r--r--   3 pmolnar hadoop          0 2017-01-15 14:35 /user/pmolnar/yelp/output/business_by_city/_SUCCESS
-rw-r--r--   3 pmolnar hadoop      10636 2017-01-15 14:35 /user/pmolnar/yelp/output/business_by_city/part-00000
drwxr-xr-x   - pmolnar hadoop          0 2017-01-15 21:22 /user/pmolnar/yelp/output/checkin_by_city
-rw-r--r--   3 pmolnar hadoop          0 2017-01-15 21:22 /user/pmolnar/yelp/output/checkin_by_city/_SUCCESS
-rw-r--r--   3 pmolnar hadoop     413406 2017-01-15 21:22 /user/pmolnar/yelp/output/checkin_by_city/part-00000
drwxr-xr-x   - pmolnar hadoop          0 2017-01-15 21:21 /user/pmolnar/yelp/output/checkin_join
-rw-r--r--   3 pmolnar hadoop          0 2017-01-15 21:21 /user/pmolnar/yelp/output/checkin_join/_SUCCESS
-rw-r--r--   3 pmolnar hadoop   89505143 2017-01-15 21:21 /user/pmolnar/yelp/output/checkin_join/part-00000
drwxr-xr-x   - pmolnar hadoop          0 2017-

...and load into Hive

In [42]:
conn = sa.create_engine('hive://backend-0-1:10000/pmolnar')

In [55]:
for TAB in ['user_compliments', 'user_friends', 'user_votes', 'user_years_elite']:
    q = "LOAD DATA INPATH '/user/pmolnar/yelp/output/%s2csv/part-*' INTO TABLE pmolnar.%s"%(TAB, TAB)
    print q
    #res = conn.execute(q)
    #print '\n'.join(res)

LOAD DATA INPATH '/user/pmolnar/yelp/output/user_compliments2csv/part-*' INTO TABLE pmolnar.user_compliments
LOAD DATA INPATH '/user/pmolnar/yelp/output/user_friends2csv/part-*' INTO TABLE pmolnar.user_friends
LOAD DATA INPATH '/user/pmolnar/yelp/output/user_votes2csv/part-*' INTO TABLE pmolnar.user_votes
LOAD DATA INPATH '/user/pmolnar/yelp/output/user_years_elite2csv/part-*' INTO TABLE pmolnar.user_years_elite


In [57]:
r = conn.execute(q)
print str(r)

<sqlalchemy.engine.result.ResultProxy object at 0x5e69690>


In [59]:
%%sql
SELECT * FROM user_years_elite LIMIT 20

Done.


user_years_elite.user_id,user_years_elite.year
--qhwKkTzgBeCH3wEJjg2g,
--qhwKkTzgBeCH3wEJjg2g,
--qhwKkTzgBeCH3wEJjg2g,
-0nLkzsZsFiX3nE4UKw5vg,
-0nLkzsZsFiX3nE4UKw5vg,
-1Q1s_NMGjBLBULA8z_npg,
-1Q1s_NMGjBLBULA8z_npg,
-1Q1s_NMGjBLBULA8z_npg,
-1ZSWpyW6Qf5gKeHUOLv6Q,
-1zY3QZ4vS2wdqTfnFLs1Q,


In [60]:
for TAB in ['user_compliments', 'user_friends', 'user_votes']:
    q = "LOAD DATA INPATH '/user/pmolnar/yelp/output/%s2csv/part-*' INTO TABLE pmolnar.%s"%(TAB, TAB)
    print q
    conn.execute(q)
    #print '\n'.join(res)

LOAD DATA INPATH '/user/pmolnar/yelp/output/user_compliments2csv/part-*' INTO TABLE pmolnar.user_compliments
LOAD DATA INPATH '/user/pmolnar/yelp/output/user_friends2csv/part-*' INTO TABLE pmolnar.user_friends
LOAD DATA INPATH '/user/pmolnar/yelp/output/user_votes2csv/part-*' INTO TABLE pmolnar.user_votes


In [65]:
%%sql
SELECT * FROM pmolnar.user_votes

Done.


user_votes.user_id,user_votes.vote_type,user_votes.count
--2QZsyXGz1OhiD4-0FQLQ,cool,
--2QZsyXGz1OhiD4-0FQLQ,funny,
--2QZsyXGz1OhiD4-0FQLQ,useful,
--519Rh5sTtkoUraGzAaKQ,cool,
--519Rh5sTtkoUraGzAaKQ,funny,
--519Rh5sTtkoUraGzAaKQ,useful,
--80yFOfe6nZKLhxTMZjEg,cool,
--80yFOfe6nZKLhxTMZjEg,funny,
--80yFOfe6nZKLhxTMZjEg,useful,
--K8RaywcHmmFtIXIHKZJg,cool,


In [68]:
%%sql
SHOW TABLES

Done.


tab_name
user_compliments
user_friends
user_votes
user_years_elite
users


In [69]:
%%sql
USE yelp

Done.


[]

In [70]:
%%sql
SHOW TABLES

Done.


tab_name
review
tip


In [77]:
%%sql
USE yelp

Done.


[]

In [78]:
%%sql
show tables

Done.


tab_name
review
tip


In [79]:
%%sql

CREATE TABLE IF NOT EXISTS users (
    user_id STRING,
    name STRING,
    review_count INT,
    average_stars DOUBLE,
    yelping_since DATE,
    fans INT
)

Done.


[]

In [80]:
%%sql
INSERT OVERWRITE TABLE yelp.users
SELECT * FROM pmolnar.users

Done.


[]

In [81]:
%%sql
DESCRIBE yelp.users

Done.


col_name,data_type,comment
user_id,string,
name,string,
review_count,int,
average_stars,double,
yelping_since,date,
fans,int,
