<div style="font-size: 200%; font-weight: bold; color: gray; padding-bottom: 20px">Loading Data into Hive</div>
Our Yelp data sets are stored in JSON format. They include nested structures which cannot be directly translated into SQL/Hive tables.

In some cases we have to produce multiple tables from a single data set and then join them in queries. Alternatively, we may have to replicate certain values across rows to generate a "flat" table. Sometimes *proper database normalization* and a*nalysis tools* are at odds...

To learn more about database normalization go to https://en.wikipedia.org/wiki/Database_normalization

In [39]:
import numpy as np
import pandas as pd
import sqlalchemy as sa
%matplotlib inline
import matplotlib.pyplot as plt
%load_ext sql
%config SqlMagic.autolimit=200
%config SqlMagic.displaylimit=20

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [4]:
%%sql hive://backend-0-1:10000/pmolnar
SHOW TABLES

Done.


tab_name
users


# 'user' data set

The JSON schema

## Normalized tables
### Table: **users**

|user_id|name|review_count|average_stars|yelping_since|fans|
|-------|----|------------|-------------|-------------|----| 
| x | x |x  | x | x | x |
| x | x |x  | x | x | x |
| x | x |x  | x | x | x |

### Table: **user_votes**

|user_id|vote_type|count|
|-------|---------|-----|
| x| x | x |
| x| x | x |
| x| x | x |

### Table: **user_friends**

|user_id|friends_user_id|
|-------|---------------|
| x |x  |
| x |x  |
| x |x  |

### Table: **user_years_elite**

|user_id|year|
|-------|----|
| x |x  |
| x |x  |
| x |x  |

### Table: **user_complements**

|user_id|compliment_type|count|
|-------|---------------|-----|
| x | x |x |
| x | x |x |
| x | x |x |

we want to create the following HIVE table

In [19]:
%%sql
CREATE TABLE IF NOT EXISTS users (
    user_id STRING,
    name STRING,
    review_count INT,
    average_stars DOUBLE,
    yelping_since,
    fans INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Done.


[]

We need to write a MapReduce mapper script that transforms records from the 'user' data set to the above format

In [None]:
# %load users2csv_mpr.py
#!/usr/bin/env python

import sys
import json
# input comes from STDIN (standard input)
for line in sys.stdin:
    try:
        r = json.loads(line.strip())
        print ','.join([r['user_id'], r['name'], r['review_count'], r['average_stars'], r['yelping_since'], r['fans'] ])




... run MapReduce

In [22]:
%%sh
# use the current directory as location for program files
WD=`pwd`

OUTDIR=/user/$USER/yelp/output
OUTPUT=$OUTDIR/users2csv

# make sure output directory exists
hdfs dfs -mkdir -p $OUTDIR 

# make sure the output files don't exist
hdfs dfs -rm -r -f -skipTrash $OUTPUT

INPUT=/user/pmolnar/yelp/data/user/*
yarn \
    jar /usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar \
    -mapper "$WD/users2csv_mpr.py" \
    -input $INPUT \
    -output $OUTPUT

Deleted /user/pmolnar/yelp/output/users2csv
packageJobJar: [] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob7721050826789992328.jar tmpDir=null


17/01/21 00:48:59 INFO impl.TimelineClientImpl: Timeline service address: http://backend-0-2.insight.gsu.edu:8188/ws/v1/timeline/
17/01/21 00:48:59 INFO client.RMProxy: Connecting to ResourceManager at backend-0-1.insight.gsu.edu/192.168.1.253:8050
17/01/21 00:49:00 INFO impl.TimelineClientImpl: Timeline service address: http://backend-0-2.insight.gsu.edu:8188/ws/v1/timeline/
17/01/21 00:49:00 INFO client.RMProxy: Connecting to ResourceManager at backend-0-1.insight.gsu.edu/192.168.1.253:8050
17/01/21 00:49:00 INFO mapred.FileInputFormat: Total input paths to process : 1
17/01/21 00:49:00 INFO mapreduce.JobSubmitter: number of splits:1
17/01/21 00:49:00 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1484597252711_0134
17/01/21 00:49:01 INFO impl.YarnClientImpl: Submitted application application_1484597252711_0134
17/01/21 00:49:01 INFO mapreduce.Job: The url to track the job: http://backend-0-1.insight.gsu.edu:8088/proxy/application_1484597252711_0134/
17/01/21 00:49:01 IN

Now, we can load the output of the mapreduce

In [23]:
%%sh
hdfs dfs -ls /user/pmolnar/yelp/output/users2csv

Found 2 items
-rw-r--r--   3 pmolnar hadoop          0 2017-01-21 00:49 /user/pmolnar/yelp/output/users2csv/_SUCCESS
-rw-r--r--   3 pmolnar hadoop    3450808 2017-01-21 00:49 /user/pmolnar/yelp/output/users2csv/part-00000


In [24]:
%%sql
LOAD DATA INPATH '/user/pmolnar/yelp/output/users2csv/part-*' INTO TABLE users

Done.


[]

Check it out

In [25]:
%%sql
SELECT * FROM pmolnar.users LIMIT 10

Done.


users.user_id,users.name,users.review_count,users.average_stars,users.yelping_since,users.fans
--2QZsyXGz1OhiD4-0FQLQ,Kay,7,4.86,2014-04-01,
--519Rh5sTtkoUraGzAaKQ,Eric,8,4.5,2014-12-01,
--80yFOfe6nZKLhxTMZjEg,Moe,8,4.12,2009-07-01,
--K8RaywcHmmFtIXIHKZJg,Susan,1,5.0,2013-04-01,
--LzFD0UDbYE-Oho3AhsOg,Shumai,133,3.9,2011-01-01,
--MJXewYKgIGpKvtfwBkfg,Jen,1,2.0,2014-04-01,
--VxRvXk3b8FwsSbC2Zpxw,B,41,4.44,2010-07-01,
--WHJIfhj7M-ntd65kUy7Q,Kadie,13,4.23,2010-07-01,
--ZBhtxi8VwI-x9GzCIyxw,Sharon,5,3.0,2012-04-01,
--ZNzQbjx8FdCuJAkjl_vA,Anita,2,5.0,2012-07-01,


Now, let's create the remaining tables...

In [27]:
%%sql
CREATE TABLE IF NOT EXISTS user_votes (
    user_id STRING,
    vote_type STRING,
    count INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Done.


[]

In [28]:
%%sql
CREATE TABLE IF NOT EXISTS user_friends (
    user_id STRING,
    friends_user_id STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Done.


[]

In [29]:
%%sql
CREATE TABLE IF NOT EXISTS user_years_elite (
    user_id STRING,
    year INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Done.


[]

In [30]:
%%sql
CREATE TABLE IF NOT EXISTS user_compliments (
    user_id STRING,
    compliment_type STRING,
    count INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Done.


[]

Let's run the MapReduce with the following mappers

In [31]:
%ls -l user*2csv_mpr.py

-rwxrwxr-x 1 pmolnar pmolnar 305 Jan 21 00:18 [0m[01;32muser_compliments2csv_mpr.py[0m*
-rwxrwxr-x 1 pmolnar pmolnar 259 Jan 21 00:11 [01;32muser_friends2csv_mpr.py[0m*
-rwxrwxr-x 1 pmolnar pmolnar 343 Jan 21 00:48 [01;32musers2csv_mpr.py[0m*
-rwxrwxr-x 1 pmolnar pmolnar 293 Jan 21 00:07 [01;32muser_votes2csv_mpr.py[0m*
-rwxrwxr-x 1 pmolnar pmolnar 262 Jan 21 00:15 [01;32muser_years_elite2csv_mpr.py[0m*


In [33]:
%%sh
# use the current directory as location for program files
WD=`pwd`

OUTDIR=/user/$USER/yelp/output
OUTPUT=$OUTDIR/users2csv

# make sure output directory exists
hdfs dfs -mkdir -p $OUTDIR 

for TAB in user_compliments user_friends user_votes user_years_elite; do
    echo "Creating table '$TAB'"
    
    OUTPUT=$OUTDIR/${TAB}2csv
    # make sure the output files don't exist
    hdfs dfs -rm -r -f -skipTrash $OUTPUT

    INPUT=/user/pmolnar/yelp/data/user/*
    yarn \
        jar /usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar \
        -mapper "$WD/${TAB}2csv_mpr.py" \
        -input $INPUT \
        -output $OUTPUT
done

Creating table 'user_compliments'
packageJobJar: [] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob6182295092184223707.jar tmpDir=null
Creating table 'user_friends'
packageJobJar: [] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob7413506657821228189.jar tmpDir=null
Creating table 'user_votes'
packageJobJar: [] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob1837506799785948537.jar tmpDir=null
Creating table 'user_years_elite'
packageJobJar: [] [/usr/hdp/2.4.2.0-258/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.2.0-258.jar] /var/lib/ambari-agent/tmp/hadoop_java_io_tmpdir/streamjob5081912236038291172.jar tmpDir=null


17/01/21 08:46:37 INFO impl.TimelineClientImpl: Timeline service address: http://backend-0-2.insight.gsu.edu:8188/ws/v1/timeline/
17/01/21 08:46:37 INFO client.RMProxy: Connecting to ResourceManager at backend-0-1.insight.gsu.edu/192.168.1.253:8050
17/01/21 08:46:37 INFO impl.TimelineClientImpl: Timeline service address: http://backend-0-2.insight.gsu.edu:8188/ws/v1/timeline/
17/01/21 08:46:37 INFO client.RMProxy: Connecting to ResourceManager at backend-0-1.insight.gsu.edu/192.168.1.253:8050
17/01/21 08:46:38 INFO mapred.FileInputFormat: Total input paths to process : 1
17/01/21 08:46:38 INFO mapreduce.JobSubmitter: number of splits:1
17/01/21 08:46:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1484597252711_0135
17/01/21 08:46:38 INFO impl.YarnClientImpl: Submitted application application_1484597252711_0135
17/01/21 08:46:38 INFO mapreduce.Job: The url to track the job: http://backend-0-1.insight.gsu.edu:8088/proxy/application_1484597252711_0135/
17/01/21 08:46:38 IN

In [45]:
%%sh
hdfs dfs -ls -R /user/pmolnar/yelp/output/

drwxr-xr-x   - pmolnar hadoop          0 2017-01-15 14:35 /user/pmolnar/yelp/output/business_by_city
-rw-r--r--   3 pmolnar hadoop          0 2017-01-15 14:35 /user/pmolnar/yelp/output/business_by_city/_SUCCESS
-rw-r--r--   3 pmolnar hadoop      10636 2017-01-15 14:35 /user/pmolnar/yelp/output/business_by_city/part-00000
drwxr-xr-x   - pmolnar hadoop          0 2017-01-15 21:22 /user/pmolnar/yelp/output/checkin_by_city
-rw-r--r--   3 pmolnar hadoop          0 2017-01-15 21:22 /user/pmolnar/yelp/output/checkin_by_city/_SUCCESS
-rw-r--r--   3 pmolnar hadoop     413406 2017-01-15 21:22 /user/pmolnar/yelp/output/checkin_by_city/part-00000
drwxr-xr-x   - pmolnar hadoop          0 2017-01-15 21:21 /user/pmolnar/yelp/output/checkin_join
-rw-r--r--   3 pmolnar hadoop          0 2017-01-15 21:21 /user/pmolnar/yelp/output/checkin_join/_SUCCESS
-rw-r--r--   3 pmolnar hadoop   89505143 2017-01-15 21:21 /user/pmolnar/yelp/output/checkin_join/part-00000
drwxr-xr-x   - pmolnar hadoop          0 2017-

...and load into Hive

In [42]:
conn = sa.create_engine('hive://backend-0-1:10000/pmolnar')

In [43]:
for TAB in ['user_compliments', 'user_friends', 'user_votes', 'user_years_elite']:
    q = "LOAD DATA INPATH '/user/pmolnar/yelp/output/%s2csv/part-*' INTO TABLE pmolnar.%s"%(TAB, TAB)
    print q
    res = conn.execute(q)
    print '\n'.join(res)

LOAD DATA INPATH '/user/pmolnar/yelp/output/user_compliments2csv/part-*' INTO TABLE pmolnar.user_compliments


OperationalError: (pyhive.exc.OperationalError) TExecuteStatementResp(status=TStatus(errorCode=40000, errorMessage="Error while compiling statement: FAILED: SemanticException Line 1:17 Invalid path ''/user/pmolnar/yelp/output/user_compliments2csv/part-*'': No files matching path hdfs://backend-0-2.insight.gsu.edu:8020/user/pmolnar/yelp/output/user_compliments2csv/part-*", sqlState='42000', infoMessages=["*org.apache.hive.service.cli.HiveSQLException:Error while compiling statement: FAILED: SemanticException Line 1:17 Invalid path ''/user/pmolnar/yelp/output/user_compliments2csv/part-*'': No files matching path hdfs://backend-0-2.insight.gsu.edu:8020/user/pmolnar/yelp/output/user_compliments2csv/part-*:28:27", 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:315', 'org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:112', 'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:181', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:257', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:419', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:400', 'sun.reflect.GeneratedMethodAccessor23:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:497', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:422', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1709', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy22:executeStatement::-1', 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:263', 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:486', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1317', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1302', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:285', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1142', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:617', 'java.lang.Thread:run:Thread.java:745', "*org.apache.hadoop.hive.ql.parse.SemanticException:Line 1:17 Invalid path ''/user/pmolnar/yelp/output/user_compliments2csv/part-*'': No files matching path hdfs://backend-0-2.insight.gsu.edu:8020/user/pmolnar/yelp/output/user_compliments2csv/part-*:34:7", 'org.apache.hadoop.hive.ql.parse.LoadSemanticAnalyzer:applyConstraintsAndGetFiles:LoadSemanticAnalyzer.java:146', 'org.apache.hadoop.hive.ql.parse.LoadSemanticAnalyzer:analyzeInternal:LoadSemanticAnalyzer.java:227', 'org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer:analyze:BaseSemanticAnalyzer.java:227', 'org.apache.hadoop.hive.ql.Driver:compile:Driver.java:459', 'org.apache.hadoop.hive.ql.Driver:compile:Driver.java:316', 'org.apache.hadoop.hive.ql.Driver:compileInternal:Driver.java:1189', 'org.apache.hadoop.hive.ql.Driver:compileAndRespond:Driver.java:1183', 'org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:110'], statusCode=3), operationHandle=None) [SQL: "LOAD DATA INPATH '/user/pmolnar/yelp/output/user_compliments2csv/part-*' INTO TABLE pmolnar.user_compliments"]

In [None]:
%%sql
LOAD DATA INPATH '/user/pmolnar/yelp/output/$2csv/part-*' INTO TABLE pmolnar.users

In [38]:
%%sql
SELECT * FROM pmolnar.user_years_elite

Done.


user_years_elite.user_id,user_years_elite.year
