## How to Easily Write a CSV File to the Data Lake with Spark

In [1]:
#If you are starting from this notebook, please ensure you have uncommented and run:
#If you need help running the command, please visit notebook "1_CML_Session_Basics.ipynb"
#!pip3 install -r requirements.txt

#### This notebook showcases how to write a CSV file into a Hive or Imapala table with PySpark. There are three simple steps:
##### 1. Save Storage Environment Variable (One Off)
##### 2. Read the CSV file from a local CML Project Folder into Pandas
##### 3. Transform the Pandas Dataframe to a Spark Dataframe and write that as a Hive table to the Data Lake
##### 4. Optional: Validate in CDW

### Step 1: Save Storage Environment Variable 

This step only needs to be executed once for the CML Project. The storage variable is saved as a CML Prokect Environment Variable

Optionally, you could save this in a script and build automation with [APIv2](https://github.com/pdefusco/CML_AMP_APIv2) to execute it upon project creation

In [2]:
from cmlbootstrap import CMLBootstrap
from IPython.display import Javascript, HTML
import os
import time
import json
import requests
import xml.etree.ElementTree as ET
import datetime
import subprocess

In [3]:
def set_storage():
    run_time_suffix = datetime.datetime.now()
    run_time_suffix = run_time_suffix.strftime("%d%m%Y%H%M%S")

    # Set the setup variables needed by CMLBootstrap
    HOST = os.getenv("CDSW_API_URL").split(":")[0] + "://" + os.getenv("CDSW_DOMAIN")
    USERNAME = os.getenv("CDSW_PROJECT_URL").split("/")[6]  # args.username  # "vdibia"
    API_KEY = os.getenv("CDSW_API_KEY")
    PROJECT_NAME = os.getenv("CDSW_PROJECT")

    # Instantiate API Wrapper
    cml = CMLBootstrap(HOST, USERNAME, API_KEY, PROJECT_NAME)

    # Set the STORAGE environment variable
    try:
        storage = os.environ["STORAGE"]
    except:
        if os.path.exists("/etc/hadoop/conf/hive-site.xml"):
            tree = ET.parse("/etc/hadoop/conf/hive-site.xml")
            root = tree.getroot()
            for prop in root.findall("property"):
                if prop.find("name").text == "hive.metastore.warehouse.dir":
                    storage = (
                        prop.find("value").text.split("/")[0]
                        + "//"
                        + prop.find("value").text.split("/")[2]
                    )
        else:
            storage = "/user/" + os.getenv("HADOOP_USER_NAME")
        storage_environment_params = {"STORAGE": storage}
        storage_environment = cml.create_environment_variable(storage_environment_params)
        os.environ["STORAGE"] = storage
    print("Storage Var Was Saved Permanently to the Project")

In [4]:
set_storage()

Storage Var Was Saved Permanently to the Project


### Step 2: Read the CSV file from a local CML Project Folder into Pandas

In [16]:
import pandas as pd

In [17]:
df = pd.read_csv("data/LoanStats_2015_subset_071821.csv")

In [31]:
print("Pandas Dataframe Shape")
df.shape

Pandas Dataframe Shape


(18656, 79)

In [32]:
df.dtypes

all_util                               float64
annual_inc                             float64
annual_inc_joint                       float64
bc_open_to_buy                         float64
bc_util                                float64
                                        ...   
sec_app_open_act_il                    float64
sec_app_num_rev_accts                  float64
sec_app_chargeoff_within_12_mths       float64
sec_app_collections_12_mths_ex_med     float64
sec_app_mths_since_last_major_derog    float64
Length: 79, dtype: object

### Step 3: Transform the Pandas Dataframe to a Spark Dataframe and write that as a Hive table to the Data Lake

In [19]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [20]:
spark = SparkSession\
    .builder\
    .appName("PythonSQL")\
    .config("spark.hadoop.fs.s3a.s3guard.ddb.region", "us-east-2")\
    .config("spark.yarn.access.hadoopFileSystems", os.environ["STORAGE"])\
    .getOrCreate()

In [21]:
# Transform Pandas DF to Spark DF

In [25]:
sparkDf = spark.createDataFrame(df)

In [28]:
sparkDf.write.format('parquet').mode("overwrite").saveAsTable('default.my_table')

Hive Session ID = 9faa6309-892f-4773-98f0-658a6fdc4b96
                                                                                

In [29]:
spark.stop()

### Step 4: Verify Table in CDW

##### Navigate to CDW from the CDP Home Page. Open the CDW Virtual Warehouse associated with the same Data Lake as this Workspace

![alt text](img/cml_howo_7A.png)

##### Query the Data

![alt text](img/cml_howto_7B.png)