Skip to content

Extract data from JSON file, transform it and load it to PostgreSQL

Notifications You must be signed in to change notification settings

gkampitakis/ETL-json-to-SQL

Repository files navigation

ETL-json-to-SQL

Extract data from JSON file, transform it and load it to PostgreSQL

Description

This repository contains two solutions for "extracting", "transforming" and "loading" data to Postgresql. One solution is written in NodeJS and the second one in Golang.

The focus is how to load data in a performant way (streaming ??) and after transforming save them in Postgresql (in bulk inserts).

In NodeJS solution I used multi rows insert, couldn't make the solution with COPY work. pg-copy-streams

In Golang solution I was able to use the COPY that Postgresql supports.

Dataset

The file used for running the ETL pipeline was taken from here. If it's not available you can also find it in this Google Drive. The default path that the file is loaded from is ./data-to-load but you can specify another path by setting the env variable FILE_PATH.

Number of records inside the matchups.json 1.312.252

This is the origin

{
  "p_match_id":"TR1_1201957752_top",
  "goldearned":14425,
  "totalminionskilled":194,
  "win":"false",
  "kills":14,
  "assists":5,
  "deaths":7,
  "champion":"Kassadin",
  "visionscore":17,
  "puuid":"phduyQLB8gBjUerFwiVOtyLLHE9jxw7Jq7dwab_CtRddAvzJ7L1uo5kWzLTKSqStAzml_3yGHiNPFA",
  "totaldamagedealttochampions":33426,
  "summonername":"Borke",
  "gameversion":"11.14.384.6677"
}

and we load it in Postgresql as

id UUID NOT NULL DEFAULT uuid_generate_v4() PRIMARY KEY,
gold_earned INTEGER,
minions_killed INTEGER,
kda INTEGER,
champion VARCHAR,
vision_score INTEGER,
summoner_name VARCHAR,
win BOOLEAN,
game_version VARCHAR,
damage_dealt_to_champions INTEGER,
lane VARCHAR,
region VARCHAR

Queries

Some queries you can run after loading the data

# Get the gold_earned for each lane

SELECT sum(gold_earned),lane 
FROM matchups
GROUP BY lane
ORDER BY sum DESC;
# Get the champions and number of games with the average kda

SELECT count(*) as games,champion,round(AVG(kda),0) as avg_kda 
FROM matchups 
GROUP BY champion 
ORDER BY avg_kda DESC;
# Get number of unique records in table

SELECT COUNT(*) 
FROM (
  SELECT DISTINCT * 
  FROM matchups
  ) as unique_rows;

Useful commands

Insert Data to Postgres from CSV:

\copy matchups(champion,
damage_dealt_to_champions,game_version,gold_earned,win,minions_killed,kda,lane,region,summoner_name,vision_score) from '/usr/log.csv' (FORMAT csv,DELIMITER ',');

Connect to postgresql container:

docker exec -it postgres psql -U ETL_user -d ETL_db

Running the project

For running etl in both versions

  • you need to have setup the .env or providing the correct environmental variables
  • running postgres, in the repo a docker-compose.yaml is provided for running postgres. You can run it with make docker-start
Node
  • Build code make node-build
  • Run code (after building) make node-run
Golang
  • Build code make go-build
  • Run code (after building) make go-run
  • Run linter make go-lint

Resources

About

Extract data from JSON file, transform it and load it to PostgreSQL

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published