Skip to content

melwinmpk/UserTransactions_DataPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UserTransactions_DataPipeline

Objective

To solve the some of the Use case of the Data PipeLine and Implement the SCD1 in Hive

Scenario

You are given Data that is stored in multiple CSV files (its about the User Transactions)

  1. load the data from the CSV file to the SQL table.
  2. Using Sqoop Job (Incremental Load) send the data from the SQL to hdfs.
  3. Using Hive first load the Data from hdfs to the Manage table then load the data to the External Table partition Year wise and then Month wise Implement the SCD 1 in this process.
  4. Finally, load back the data to another SQL table to cross verify the data from the Source to Destination

Input :

given three csv's stored in this format

"custid","username","quote_count","ip","entry_time","prp_1","prp_2","prp_3","ms","http_type","purchase_category", "total_count","purchase_sub_category","http_info","status_code"

My Approach :

  • Using Python load the Data to the SQL table store it in the two tables one for the validation and another for shifting the data from the SQL to the HDFS (sqoop job)
  • Using the Sqoop Job load the data from the SQL table to the HDFS
  • In the Hive create two manage tables
    One for loading each CSV file and truncating it after shifting the data.
    another to store entire data having the SCD1 implementation (as we cannot apply the Acid property to the External table)
  • Override the Manage Table that had SCD1 got implemented to the External table
  • Load the data from the Manage that keeps truncating for every file

About

User Transactions Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive. Implemented the Slowly changing Dimensions (SCD) 1.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published