## 2. Data Engineering - Process CSV files into BQ Tables

### Create Spark session with BQ connector

Create a Spark session, connect to Hive Metastore and enable Hive support in Spark

In [32]:
# Run python kernel not pyspark kernel
# https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/examples/notebooks/Top%20words%20in%20Shakespeare%20by%20work.ipynb
from pyspark.sql import SparkSession
from pyspark.sql.types import FloatType, IntegerType, StructField, StructType

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('Spark - Data Eng Demo') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar') \
.getOrCreate()

Check the first 1000 bytes of a file on GCS

In [33]:
!gsutil cat -h -r 0-1000 gs://datalake-vol2-data/bank-marketing-train.csv

CommandException: No URLs matched: gs://datalake-vol2-data/bank-marketing-train.csv


In [37]:
path_to_train_csv = "gs://datalake-vol2-data/banking_train_set.csv"

### Get Spark application ID 

This is useful to easily fine application in the Spark History UI

In [38]:
spark.conf.get("spark.app.id")

'application_1610100240292_0004'

Load the CSV file into a Spark Dataframe

In [39]:
df_bank_marketing_from_csv = spark \
.read \
.option("inferSchema" , "true") \
.option("header" , "true") \
.csv(path_to_train_csv)

In [40]:
df_bank_marketing_from_csv.printSchema()

root
 |-- call_id: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Job: string (nullable = true)
 |-- MaritalStatus: string (nullable = true)
 |-- Education: string (nullable = true)
 |-- Default: boolean (nullable = true)
 |-- Balance: integer (nullable = true)
 |-- Housing: boolean (nullable = true)
 |-- Loan: boolean (nullable = true)
 |-- Contact: string (nullable = true)
 |-- Day: integer (nullable = true)
 |-- Month: string (nullable = true)
 |-- Duration: integer (nullable = true)
 |-- Campaign: integer (nullable = true)
 |-- PDays: integer (nullable = true)
 |-- Previous: integer (nullable = true)
 |-- POutcome: string (nullable = true)
 |-- Deposit: integer (nullable = true)



In [41]:
# spark to bq datatypes -> https://github.com/GoogleCloudDataproc/spark-bigquery-connector#data-types
schema_inline = df_bank_marketing_from_csv.schema.simpleString().replace('struct<', '').replace('>', '').replace('int', 'int64')
schema_inline

'call_id:string,Age:int64,Job:string,MaritalStatus:string,Education:string,Default:boolean,Balance:int64,Housing:boolean,Loan:boolean,Contact:string,Day:int64,Month:string,Duration:int64,Campaign:int64,PDays:int64,Previous:int64,POutcome:string,Deposit:int64'

In [42]:
df_bank_marketing_from_csv.show(5)

+--------------------+---+------+-------------+---------+-------+-------+-------+-----+--------+---+-----+--------+--------+-----+--------+--------+-------+
|             call_id|Age|   Job|MaritalStatus|Education|Default|Balance|Housing| Loan| Contact|Day|Month|Duration|Campaign|PDays|Previous|POutcome|Deposit|
+--------------------+---+------+-------------+---------+-------+-------+-------+-----+--------+---+-----+--------+--------+-----+--------+--------+-------+
|dafaab78-8b86-43d...| 31|admin.|       single|secondary|  false|    410|  false|false|cellular| 23|  apr|     342|       1|   -1|       0| unknown|      2|
|918c328b-b08c-479...| 30|admin.|       single|secondary|  false|    213|  false|false|cellular| 30|  apr|     168|       1|   -1|       0| unknown|      1|
|096ec89c-c034-417...| 34|admin.|      married|secondary|  false|   2984|   true|false|cellular| 20|  apr|      11|       3|   -1|       0| unknown|      1|
|2a47ba0f-5f7f-4b1...| 31|admin.|      married|secondary| 

Run transformations on the data

In [43]:
## Any transformations on your data can be done at this point

In [44]:
# get name for dataset in BQ
project_id = !gcloud config list --format 'value(core.project)' 2>/dev/null 
dataset_raw_name = project_id[0] + '-raw'
dataset_raw_name = dataset_raw_name.replace('-', '_')
dataset_raw_name

'datalake_vol2_raw'

Create BQ dataset

In [45]:
!bq --location=europe-west3 mk -d \
{dataset_raw_name}

BigQuery error in mk operation: Dataset 'datalake-vol2:datalake_vol2_raw'
already exists.


In [46]:
# create path to new table for creation
bq_table_path= 'datalake_vol2_raw.banking_marketing_train'

In [47]:
!bq mk --table \
{bq_table_path} \
{schema_inline}

BigQuery error in mk operation: Table 'datalake-
vol2:datalake_vol2_raw.banking_marketing_train' could not be created; a table
with this name already exists.


#### Check that table was created

In [48]:
# spark.sql("SHOW TABLES in bank_demo_db").show()
table = "datalake-vol2:datalake_vol2_raw.banking_marketing_train"
df_bank_marketing_from_bq_table = spark.read \
.format("bigquery") \
.option("table", table) \
.load()

In [49]:
df_bank_marketing_from_bq_table.printSchema()

root
 |-- call_id: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- Job: string (nullable = true)
 |-- MaritalStatus: string (nullable = true)
 |-- Education: string (nullable = true)
 |-- Default: boolean (nullable = true)
 |-- Balance: long (nullable = true)
 |-- Housing: boolean (nullable = true)
 |-- Loan: boolean (nullable = true)
 |-- Contact: string (nullable = true)
 |-- Day: long (nullable = true)
 |-- Month: string (nullable = true)
 |-- Duration: long (nullable = true)
 |-- Campaign: long (nullable = true)
 |-- PDays: long (nullable = true)
 |-- Previous: long (nullable = true)
 |-- POutcome: string (nullable = true)
 |-- Deposit: long (nullable = true)



In [50]:
df_bank_marketing_from_bq_table.show()

+--------------------+---+-------------+-------------+---------+-------+-------+-------+-----+---------+---+-----+--------+--------+-----+--------+--------+-------+
|             call_id|Age|          Job|MaritalStatus|Education|Default|Balance|Housing| Loan|  Contact|Day|Month|Duration|Campaign|PDays|Previous|POutcome|Deposit|
+--------------------+---+-------------+-------------+---------+-------+-------+-------+-----+---------+---+-----+--------+--------+-----+--------+--------+-------+
|b7962146-4ab9-4fa...| 33|       admin.|       single|secondary|  false|    285|   true|false| cellular|  1|  apr|     427|       1|  329|       1|   other|      1|
|eb2315f2-6f0c-4b6...| 49|       admin.|      married|  primary|  false|    686|   true| true| cellular|  1|  apr|     286|       2|  225|       4|   other|      1|
|4e62cd36-1db9-4fe...| 34|       admin.|       single|secondary|  false|    528|  false|false| cellular|  1|  jun|     165|       1|  124|       1|   other|      1|
|78f207ba-

In [51]:
# create temp GCS bucket for writing spark df to bq table
gcs_bucket = project_id[0] + '-data'
gcs_bucket

'datalake-vol2-data'

In [52]:
df_bank_marketing_from_csv.write \
.format("bigquery") \
.option("table", table) \
.option("temporaryGcsBucket", gcs_bucket) \
.mode('overwrite') \
.save()

In [53]:
df_bank_marketing_from_bq_table.show()

+--------------------+---+-------------+-------------+---------+-------+-------+-------+-----+---------+---+-----+--------+--------+-----+--------+--------+-------+
|             call_id|Age|          Job|MaritalStatus|Education|Default|Balance|Housing| Loan|  Contact|Day|Month|Duration|Campaign|PDays|Previous|POutcome|Deposit|
+--------------------+---+-------------+-------------+---------+-------+-------+-------+-----+---------+---+-----+--------+--------+-----+--------+--------+-------+
|d763cf02-26cc-4c9...| 37|  blue-collar|      married|  unknown|  false|    444|  false|false| cellular|  1|  oct|     143|       1|   94|       9|   other|      1|
|562c50b8-7aa5-44d...| 30|self-employed|      married| tertiary|  false|    805|   true|false|  unknown|  1|  sep|      20|       1|  478|       2|   other|      1|
|b7962146-4ab9-4fa...| 33|       admin.|       single|secondary|  false|    285|   true|false| cellular|  1|  apr|     427|       1|  329|       1|   other|      1|
|eb2315f2-

In [54]:
%%bigquery
SELECT *
FROM `datalake-vol2.datalake_vol2_raw.banking_marketing_train`
LIMIT 10

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 375.97query/s]                          
Downloading: 100%|██████████| 10/10 [00:01<00:00,  9.82rows/s]


Unnamed: 0,call_id,Age,Job,MaritalStatus,Education,Default,Balance,Housing,Loan,Contact,Day,Month,Duration,Campaign,PDays,Previous,POutcome,Deposit
0,d763cf02-26cc-4c9c-a8f3-ff85a0b15109,37,blue-collar,married,unknown,False,444,False,False,cellular,1,oct,143,1,94,9,other,1
1,562c50b8-7aa5-44de-92b4-beaed66aed07,30,self-employed,married,tertiary,False,805,True,False,unknown,1,sep,20,1,478,2,other,1
2,b7962146-4ab9-4fa7-b143-c37e51896082,33,admin.,single,secondary,False,285,True,False,cellular,1,apr,427,1,329,1,other,1
3,eb2315f2-6f0c-4b63-87da-b081d8d00fde,49,admin.,married,primary,False,686,True,True,cellular,1,apr,286,2,225,4,other,1
4,4e62cd36-1db9-4fee-b75e-125c4aced19a,34,admin.,single,secondary,False,528,False,False,cellular,1,jun,165,1,124,1,other,1
5,78f207ba-0443-4a78-b862-967a5dccafc2,46,admin.,divorced,secondary,False,2087,False,False,cellular,1,jun,111,1,119,4,other,1
6,a0294a19-b604-40ba-8d62-74417e531b32,60,admin.,married,secondary,False,4348,True,False,cellular,1,oct,131,2,98,12,other,1
7,e7951c43-3fe7-4492-b158-6be248ccd10d,41,admin.,married,secondary,False,158,True,False,cellular,1,oct,250,2,120,4,other,1
8,7eae3ca6-b2ef-4502-83d1-d326a620c739,77,retired,married,primary,False,1492,False,False,telephone,1,sep,663,1,208,2,other,1
9,6994d89b-a2d1-418a-ad1b-fc7a02d2ec1c,74,retired,married,primary,False,2894,False,False,telephone,1,sep,97,5,204,2,other,1


### Compute statistics for columns in table

In [55]:
# spark.sql("DESCRIBE TABLE EXTENDED bank_demo_db.bank_marketing Age").show()
df_bank_marketing_from_bq_table.describe().show()

+-------+--------------------+-----------------+-------+-------------+---------+------------------+--------+------------------+-----+------------------+-----------------+------------------+------------------+--------+-------------------+
|summary|             call_id|              Age|    Job|MaritalStatus|Education|           Balance| Contact|               Day|Month|          Duration|         Campaign|             PDays|          Previous|POutcome|            Deposit|
+-------+--------------------+-----------------+-------+-------------+---------+------------------+--------+------------------+-----+------------------+-----------------+------------------+------------------+--------+-------------------+
|  count|               40780|            40780|  40780|        40780|    40780|             40780|   40780|             40780|40780|             40780|            40780|             40780|             40780|   40780|              40780|
|   mean|                null|40.94936243256498|