Improve data saving pipeline #43

mithi · 2020-10-27T05:37:55Z

Allow updates in the `seed` function not just `inserts`

seed should update data if it already exists in the database
The current seed function only allows insertion of new data. Currently, if the data exists, it will not update the existing data;seed/populate.js only allows for inserts and not updates.
The current implementation of db:reset is that the use/db/tables are dropped and recreated and all the data are populated all over again . This is VERY inefficient. https://github.com/mithi/kingdom-rush-graphql/blob/main/src/seed

Refactor python scripts that generate json.

Refactor python script because they contain a lot of duplicate code.
https://github.com/mithi/kingdom-rush-graphql/tree/main/scripts

Add new scripts for seeding

Write a python script to convert generated json to csv

Update pipeline

Update data saving pipeline

The current pipeline is like this:

update yaml
python scripts generate json file from yaml
Populate the database using the json files with populate.js
Populate the csv

It might be so much faster if the pipeline is like this:

update yaml
python scripts generate json files form yaml
use python script to generate csv files from json files
populate the database using the csv with the psql command COPY

The text was updated successfully, but these errors were encountered:

mithi · 2020-11-02T11:12:37Z

optimize: Bulk insert into tables

We insert one entry to the database at a time.
To be more efficient, we should instead try to insert in bulk.

EDIT: Converting did not speed up the process significantly. The effect is negligible.

const populateTowers = async ({ dbName = "default", verbose = true } = {}) => {
    const count = await getRepository(Tower, dbName).count()
    console.log("towers in this database:", count)
    if (count !== Number(0)) {
        throw Error(
            `There are already towers in this database. Number of towers: ${count}`
        )
    }

    const towersData: [TowerData] = (<any>towerJson).towers
    const newTowers = towersData.map(towerData => {
        if (verbose) {
            console.log("...")
            console.log(towerData.name, ",", towerData.kingdom)
        }
        return {
            name: towerData.name,
            towerType: mapStringToTowerType[towerData.towerType],
            level: mapIntToLevel[towerData.level],
            kingdom: mapStringToKingdom[towerData.kingdom],
        }
    })

    await getConnection(dbName)
        .createQueryBuilder()
        .insert()
        .into(Tower)
        .values(newTowers)
        .execute()
}

…data pipeline 1. Tower names, ability names, descriptions, are now all stored in lowercase in the database. The graphql query arguments will be converted to lower case first before we use them to query the database. Updated all tests accordingly. Solves: #79 2. Fix data saving pipeline - Improve code format of scripts - Drop database and user if they exist before creating them - The data saving pipeline was incorrect, we won't be able to resave the data since the current `seed/populate.js` only allows for` inserts` and not `updates`. In the future, I might change this, but since the data isn't that much anyway, the current implementation is that the user is dropped and recreated and all the data are populated all over again #43

mithi added the low priority label Oct 29, 2020

mithi changed the title ~~Refactor python scripts that generate json~~ Refactor python scripts that generate json etal Nov 2, 2020

mithi changed the title ~~Refactor python scripts that generate json etal~~ Refactor python scripts that generate json et al Nov 2, 2020

mithi changed the title ~~Refactor python scripts that generate json et al~~ Refactor python scripts that generate json and update data saving pipeline Nov 2, 2020

mithi changed the title ~~Refactor python scripts that generate json and update data saving pipeline~~ Improve data saving pipeline Nov 2, 2020

mithi mentioned this issue Nov 3, 2020

Data would all be lowercase + fix setup + update data pipeline #86

Merged

mithi removed the low priority label Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve data saving pipeline #43

Improve data saving pipeline #43

mithi commented Oct 27, 2020 •

edited

Loading

mithi commented Nov 2, 2020

Improve data saving pipeline #43

Improve data saving pipeline #43

Comments

mithi commented Oct 27, 2020 • edited Loading

Allow updates in the seed function not just inserts

Refactor python scripts that generate json.

Add new scripts for seeding

Update pipeline

mithi commented Nov 2, 2020

optimize: Bulk insert into tables

EDIT: Converting did not speed up the process significantly. The effect is negligible.

mithi commented Oct 27, 2020 •

edited

Loading

Allow updates in the `seed` function not just `inserts`