## JEG Base + Hive - Quick Dataframe exploration

In [1]:
spark

In [2]:
spark.sql("show tables").show(10, truncate = False)

+--------+------------------------+-----------+
|database|tableName               |isTemporary|
+--------+------------------------+-----------+
|default |atlas_higgs_100x        |false      |
|default |atlas_higgs_100x_parquet|false      |
|default |customer_history        |false      |
|default |customers               |false      |
|default |hive_test               |false      |
|default |sample_07               |false      |
|default |sample_071              |false      |
|default |sample_0711             |false      |
|default |sample_072              |false      |
|default |sample_073              |false      |
+--------+------------------------+-----------+
only showing top 10 rows



## Native Hadoop Jupyter Shell Actions

In [3]:
# Cloudera Only 
import os
os.environ['HADOOP_CONF_DIR']=os.environ['HADOOP_CLIENT_CONF_DIR']

In [4]:
!hostname -f

cdhdemo4.fyre.ibm.com


In [5]:
!hdfs dfs -ls /user/user1/atl*.csv

-rw-r--r--   3 user1 user1   55253673 2018-10-01 15:22 /user/user1/atlas_higgs.csv
-rw-r--r--   3 user1 user1   55253673 2018-10-02 23:35 /user/user1/atlas_higgsx.csv


<br>

---

### **SECURE.** hdfs shell access

**Kerberos + Hadoop Access Controls enforced natively.** 

Ex: `user1` should NOT be abllowed to read `user2`

In [6]:
!hdfs dfs -ls /user/user2

ls: Permission denied: user=user1, access=READ_EXECUTE, inode="/user/user2":user2:user2:drwx------


---
## Quickly and securly wrangle Data from Hdfs<->DF<->Hive

In [7]:
# Read raw data from hdfs
df1 = spark.read.csv("/user/user1/atlas_higgs.csv",header=True)

**Tab Complete operations work via JEG**

In [None]:
df1.

In [8]:
df1.createTempView("higgs_tmp")

In [9]:
# spark.sql("drop table atlas_higgs_demo")

In [10]:
# Views can be persisted back to Hive as a new table
spark.sql("create table atlas_higgs_demo as select * from higgs_tmp")

DataFrame[]

In [11]:
df1.count()

250000

## Quickly Benchmark different Hive formats

Extremely useful when working with Large Datasets. 

**Example:**

25 Million Row Dataset, Using `%%time` to quickly benchmark spark reads

In [12]:
%%time
df_lg_parquet = spark.sql('select * from atlas_higgs_100x_parquet')
print(df_lg_parquet.count())

25000000
CPU times: user 6.2 ms, sys: 1.94 ms, total: 8.15 ms
Wall time: 24.2 s


In [13]:
%%time
df_lg = spark.sql('select * from atlas_higgs_100x')
print(df_lg.count())

25000000
CPU times: user 6.09 ms, sys: 15.3 ms, total: 21.4 ms
Wall time: 22.2 s



## Summary

- Quick Data Wrangling Experience
- Native Jupyter Magics/Actions empower HDFS/Kerberos experience


---

#### Additional Resources/DataGen Notes 

- Creating 100x table with 25M rows

In [11]:
spark.sql("create table atlas_higgs_100x as (select * from higgs_tmp)")
for x in range(1,99):
    spark.sql("insert into atlas_higgs_100x ( select * from atlas_higgs)")

In [12]:
%%time
df_sm = spark.sql('select * from atlas_higgs')
print(df_sm.count())

250000
CPU times: user 1.31 ms, sys: 937 µs, total: 2.25 ms
Wall time: 480 ms


In [13]:
%%time
df_lg = spark.sql('select * from atlas_higgs_100x')
print(df_lg.count())

24750000
CPU times: user 3.5 ms, sys: 1.78 ms, total: 5.27 ms
Wall time: 23.2 s


#### Create same table, stored as parquet

In [14]:
spark.sql("create table atlas_higgs_100x_parquet stored as parquet as select * from atlas_higgs_100x")

DataFrame[]

In [None]:
spark.sql("drop table atlas_higgs_demo")