## Ingestion Step

**Input** : tblName -> name of table to ingest data to HDFS, executionDate -> date to partition in DHFS Datalake <br>
**Output**: data in DHFS Datalake is updated
1. Load data from PostgreSQL in **tblName** table.
2. Update data in **tblName** folder with these below steps:  
 - 2.1 Get the lastest record_id in datalake (if **tblName** folder isn't empty) 
 - 2.2 Get the lastest records in PostgreSQL
 - 2.3 Append records in PostreSQL from lastest record_id in HDFS Datalake

### Import Neccessary Libraris

In [76]:
import pyspark
from pyspark import SparkContext, SQLContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, max
import sys
import subprocess

### Receive 2 arguments: tblName, executionDate

In [77]:
tblName = input("Input table name from PostgreSQL which load to HDFS: ") 
executionDate = input("Input date you want ingest data from PostgreSQL to HDFS DataLake: ")

In [78]:
executionDate

'2023-07-25'

In [79]:
runTime = executionDate.split("-")
year = runTime[0]
month = runTime[1]
day = runTime[2]

### Load data from tblName table in PostgreSQL

In [80]:
# create spark session
spark = pyspark.sql.SparkSession \
   .builder \
   .appName("Ingestion - from Postgres to HDFS") \
   .config('spark.driver.extraClassPath', "postgresql-42.6.0.jar") \
   .getOrCreate()

In [81]:
# read table from db using spark jdbc
df = spark.read \
   .format("jdbc") \
   .option("url", "jdbc:postgresql://localhost:5432/my_company") \
   .option("dbtable", tblName) \
   .option("user", "postgres") \
   .option("password", "loc//14122000") \
   .option("driver", "org.postgresql.Driver") \
   .load()

In [82]:
df.show(10)

[Stage 15:>                                                         (0 + 1) / 1]

+---+------+-----------+--------+-------+
| id| total|    payment|order_id|user_id|
+---+------+-----------+--------+-------+
|  1|710051|credit_card|       1| 209279|
|  2|375643|       cash|       2| 242546|
|  3|975362|       cash|       3| 135215|
|  4|417644|credit_card|       4| 111433|
|  5|481473|credit_card|       5|  44346|
|  6|389161| instalment|       6| 112586|
|  7|376682|credit_card|       7| 133477|
|  8|551975|credit_card|       8| 232025|
|  9|263441|       cash|       9| 177652|
| 10|908849|       cash|      10| 179390|
+---+------+-----------+--------+-------+
only showing top 10 rows



                                                                                

### Update data in tblName folder in DHFS

#### Get the lastest record_id in datalake (if **tblName** folder isn't empty) 

In [83]:
# function to interact with hdfs storage
def run_cmd(args_list):
    print('Running system command: {0}'.format(' '.join(args_list)))
    proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    s_output, s_err = proc.communicate()
    s_return = proc.returncode
    return s_return, s_output, s_err

In [84]:
tblLocation = f'hdfs://localhost:9000/datalake/{tblName}'

In [85]:
# check whether folder exist of not
(ret, out, err) = run_cmd(['hdfs', 'dfs', '-du', '-s', tblLocation])
exists = True if len(str(out).split()) > 1 else False
print(exists)

Running system command: hdfs dfs -du -s hdfs://localhost:9000/datalake/order_detail


False


In [86]:
tblQuery = ""
if exists:
    datalake_df = spark.read.format('parquet').load(tblLocation)
    record_id = datalake_df.agg(max("id")).head()[0]
    tblQuery = f"SELECT * FROM {tblName} WHERE id > {record_id} AS tmp"
else:
    tblQuery = f"SELECT * FROM {tblName} AS tmp"

In [87]:
tblQuery

'SELECT * FROM order_detail AS tmp'

#### Get the lastest records in PostgreSQL

In [88]:
jdbc_df = spark.read \
   .format("jdbc") \
   .option("url", "jdbc:postgresql://localhost:5432/my_company") \
   .option("dbtable", tblName) \
   .option("user", "postgres") \
   .option("password", "loc//14122000") \
   .option("driver", "org.postgresql.Driver") \
   .load(tblQuery)

In [89]:
jdbc_df.show(5)

[Stage 16:>                                                         (0 + 1) / 1]

+---+------+-----------+--------+-------+
| id| total|    payment|order_id|user_id|
+---+------+-----------+--------+-------+
|  1|710051|credit_card|       1| 209279|
|  2|375643|       cash|       2| 242546|
|  3|975362|       cash|       3| 135215|
|  4|417644|credit_card|       4| 111433|
|  5|481473|credit_card|       5|  44346|
+---+------+-----------+--------+-------+
only showing top 5 rows



                                                                                

#### Append records in PostreSQL from lastest record_id in HDFS Datalake

In [90]:
output_df = jdbc_df.withColumn("year", lit(year)).withColumn("month", lit(month)).withColumn("day", lit(day))
output_df.write.partitionBy("year", "month", "day").mode("append").parquet(tblLocation)

                                                                                