Skip to content
This repository has been archived by the owner on Nov 24, 2023. It is now read-only.

Improve data saving pipeline #43

Open
4 tasks
mithi opened this issue Oct 27, 2020 · 1 comment
Open
4 tasks

Improve data saving pipeline #43

mithi opened this issue Oct 27, 2020 · 1 comment

Comments

@mithi
Copy link
Owner

mithi commented Oct 27, 2020

Allow updates in the seed function not just inserts

  • seed should update data if it already exists in the database
    The current seed function only allows insertion of new data. Currently, if the data exists, it will not update the existing data;seed/populate.js only allows for inserts and not updates.
    The current implementation of db:reset is that the use/db/tables are dropped and recreated and all the data are populated all over again . This is VERY inefficient. https://github.com/mithi/kingdom-rush-graphql/blob/main/src/seed

Refactor python scripts that generate json.

Add new scripts for seeding

  • Write a python script to convert generated json to csv

Update pipeline

  • Update data saving pipeline

The current pipeline is like this:

  1. update yaml
  2. python scripts generate json file from yaml
  3. Populate the database using the json files with populate.js
  4. Populate the csv

It might be so much faster if the pipeline is like this:

  1. update yaml
  2. python scripts generate json files form yaml
  3. use python script to generate csv files from json files
  4. populate the database using the csv with the psql command COPY
@mithi
Copy link
Owner Author

mithi commented Nov 2, 2020

optimize: Bulk insert into tables

We insert one entry to the database at a time.
To be more efficient, we should instead try to insert in bulk.

EDIT: Converting did not speed up the process significantly. The effect is negligible.

const populateTowers = async ({ dbName = "default", verbose = true } = {}) => {
    const count = await getRepository(Tower, dbName).count()
    console.log("towers in this database:", count)
    if (count !== Number(0)) {
        throw Error(
            `There are already towers in this database. Number of towers: ${count}`
        )
    }

    const towersData: [TowerData] = (<any>towerJson).towers
    const newTowers = towersData.map(towerData => {
        if (verbose) {
            console.log("...")
            console.log(towerData.name, ",", towerData.kingdom)
        }
        return {
            name: towerData.name,
            towerType: mapStringToTowerType[towerData.towerType],
            level: mapIntToLevel[towerData.level],
            kingdom: mapStringToKingdom[towerData.kingdom],
        }
    })

    await getConnection(dbName)
        .createQueryBuilder()
        .insert()
        .into(Tower)
        .values(newTowers)
        .execute()
}

@mithi mithi changed the title Refactor python scripts that generate json Refactor python scripts that generate json etal Nov 2, 2020
@mithi mithi changed the title Refactor python scripts that generate json etal Refactor python scripts that generate json et al Nov 2, 2020
@mithi mithi changed the title Refactor python scripts that generate json et al Refactor python scripts that generate json and update data saving pipeline Nov 2, 2020
@mithi mithi changed the title Refactor python scripts that generate json and update data saving pipeline Improve data saving pipeline Nov 2, 2020
mithi added a commit that referenced this issue Nov 3, 2020
…data pipeline

1. Tower names, ability names, descriptions, are now all stored in lowercase in the database.
The graphql query arguments will be converted to lower case first before we use them to query 
the database. Updated all tests accordingly.
Solves: #79

2. Fix data saving pipeline
- Improve code format of scripts
- Drop database and user if they exist before creating them
- The data saving pipeline was incorrect, we won't be able to resave the data since the current `seed/populate.js` only allows for` inserts` and not `updates`. In the future, I might change this, but since the data isn't that much anyway, the current implementation is that the user is dropped and recreated and all the data are populated all over again
#43
@mithi mithi removed the low priority label Nov 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant