Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import Data #176

Open
3 tasks
benloh opened this issue Nov 18, 2021 · 20 comments
Open
3 tasks

Import Data #176

benloh opened this issue Nov 18, 2021 · 20 comments

Comments

@benloh
Copy link
Collaborator

benloh commented Nov 18, 2021

For the Feb 2022 pilots/tests, we want to prioritize:

  1. Importing nodes/edges to an existing database

The node/edge data might be created from scratch or created by first exporting existing nodes/edges. (e.g. Main Use Model and Secondary Use Model, above)

  1. Importing nodes/edges to a NEW database

This may be addressed in the future. It will not be implemented at this moment as there is a workaround via manual template editing and nc-multiplex.

To Do

  • Node validation -- replace existing nodes
  • Edge validation -- replace existing edges if the same id exists
  • Make sure tables and filters are updated after import
  • [ ]
@benloh
Copy link
Collaborator Author

benloh commented Dec 21, 2021

@jdanish @kalanicraig I have some questions about how to handle imports.

Currently NetCreate is designed to work with a single starting database. When you start up the app, you have to specify a specific database file (e.g. ./nc.js --dataset=tacitus).

So when you import new data, do you intend to:

a) Add new records to existing database? And if there is an existing record, do you want to overwrite it?

or

b) Replace all existing records in the current database with the imported records?

or

c) Create a new database with the imported records, giving the database a new name (or really starting with a new empty database).

Each one of the three would require a slightly different use model and workflow. Or do you need to support all three different use models?

@kalanicraig
Copy link
Collaborator

kalanicraig commented Dec 22, 2021 via email

@kalanicraig
Copy link
Collaborator

kalanicraig commented Dec 22, 2021 via email

@benloh
Copy link
Collaborator Author

benloh commented Dec 22, 2021

  1. Actually A) and B) are relatively easy to do. C) is much harder because of the way the system is currently designed. We'd need to build a new framework for creating and loading a database AFTER the app has already started. It's possible it'll be more straightforward to build it as a new parameter when starting the app from the command line, e.g.:
    ./nc.js --import_nodes=tacitus_nodes.csv --import_edges=tacitus_edges.csv

If C) is the priority, then importing is much more complex.

  1. One more question: Which fields are required and which are optional when importing? Right now we are blindly requiring ALL of the fields in the import data table in order to be valid:
// For Nodes
id,label,attributes:Node_Type,attributes:Extra Info,attributes:Notes,degrees,meta:created,meta:updated

// For Edges
id,source,target,attributes:Relationship,attributes:Info,attributes:Citations,attributes:Category,attributes:Notes,meta:created,meta:updated

For any given record, you can have empty fields, but we expect the table format to have all of these fields defined. I'm guessing you probably need more flexibility than that?

@jdanish
Copy link
Collaborator

jdanish commented Dec 22, 2021

If we use the nc-multiplex to create the new file does that make it easier? Not sure that fits Kalani’s use case but figured I’d ask.

@benloh
Copy link
Collaborator Author

benloh commented Dec 22, 2021

If we added import to the regular nc.js startup script, we'd probably have to make a corresponding change with nc-multiplex to make it work. What might be slightly easier would be to figure out a way to initiate a new blank db with a new name, then allow the upload via the web interface (otherwise you'd have to have direct access to the server to upload files there and import files directly, which now that I think about it, sounds like a terrible solution).

@kalanicraig
Copy link
Collaborator

kalanicraig commented Dec 22, 2021 via email

@kalanicraig
Copy link
Collaborator

kalanicraig commented Dec 22, 2021 via email

@benloh
Copy link
Collaborator Author

benloh commented Jan 11, 2022

@kalanicraig This is getting complicated. It sounds like we should at least take a pass through the template editing design before we fully address this. But is this correct:

Main Use Model

  1. Export nodes and edges from an existing database
  2. Modify it externally (either by importing into another application or Excel)
  3. Re-import nodes and edges, replacing existing nodes and edges, and adding new ones

Secondary Use Model

  1. Create new nodes and edges in another application or Excel
  2. Import the nodes and edges into an existing database, replacing any existing nodes and edges

Tertiary Use Model

  1. Create new nodes and edges file either by exporting or creating in another application or Excel
  2. Create, copy, or edit a template
  3. Create a new database with the new template, node file, and edge file.

In all cases, I imagine it might be useful to have a Dry Run feature where you can test the import and get a report that lists the nodes and edges that are added or replaced? If you like the Dry Run, you can then press Import do to the actual import?

@jdanish
Copy link
Collaborator

jdanish commented Jan 11, 2022

I'll defer to Kalani but wanted to clarify: if we have 2 and 3 from the tertiary model, then really the only difference between the tertiary and secondary would be doing it "all at once" in which case I think we can drop the tertiary? Or am I missing something? Thanks!

@benloh
Copy link
Collaborator Author

benloh commented Jan 11, 2022

I think the main difference in the tertiary is the addition of the template file and not modifying an existing database.
Part of the reason I'm teasing these all out is to make sure that the workflow is supported by whatever scheme we come up with, especially if they require slightly different methods (e.g. creating a new db), and biasing the design towards one model vs another (e.g. if you only rarely do the tertiary model, then it's OK if it's a little more difficult to do).

@kalanicraig
Copy link
Collaborator

kalanicraig commented Jan 11, 2022 via email

@benloh
Copy link
Collaborator Author

benloh commented Jan 17, 2022

@kalanicraig
Just confirming then, that with the emphasis on importing for the Feb 2022 pilots/tests, we want to prioritize:

  1. Importing nodes/edges to an existing database

The node/edge data might be created from scratch or created by first exporting existing nodes/edges. (e.g. Main Use Model and Secondary Use Model, above)

  1. Importing nodes/edges to a NEW database

It sounds like this is not as urgent and can be handled via other existing means for editing templates and creating new databases. (e.g. Tertiary Use Model above). While this would be a nice addition, it requires a substantial amount of rework of both netcreate and nc-multiplex.

@kalanicraig
Copy link
Collaborator

kalanicraig commented Jan 17, 2022 via email

@benloh
Copy link
Collaborator Author

benloh commented Mar 5, 2022

@kalanicraig @jdanish One more question about importing:

What kind of restrictions should we place on importing?

  • Should ANYONE be able to import?
  • Or should only admins be able to import?
  • Or should only people who are logged in (on projects requiring login)?

On a related, note, I was assuming that we don't need to place similar restrictions on exporting. But perhaps you do want a way to lock a database too so that people can't arbitrarily export data?

@jdanish
Copy link
Collaborator

jdanish commented Mar 5, 2022

If login is required to see the network, doesn't that mean you can't get to the import tab without logging in?
Either way, let's say you need to be logged in and be in admin mode to import.
My inclination is to say that if you can see the network, you can export it.

Also, by the way, what are you using to edit the csv files in your testing? The reason I ask is that we did a quick test and Excel appears to cause problems. Literally opening and saving a csv in excel seems to break the import even without intentionally editing.

@benloh
Copy link
Collaborator Author

benloh commented Mar 5, 2022

If login is required to see the network, doesn't that mean you can't get to the import tab without logging in?

For a project that requires login, yes, you wouldn't see the tab.
But for a project that doesn't require login, you see the tab immediately. However, we can still restrict it so that you have to still login to be able to import. We would just add an extra level of hiding: e.g. if you're not logged in, the import buttons are grayed out or missing.

So for example, you could allow users to import if they're logged in even if they are NOT admins.

what are you using to edit the csv files in your testing?

Export was broken -- it was not properly accounting for missing data, so the fields were getting shifted. It's fixed in the latest branch (import), but there's lots of other stuff that is still broken.

I sometimes open the file directly in VSCode, other times I use Numbers and Excel.
But if you look at the exported data directly in VSCode and count the number of data points vs the headers, you'll probably find that you're missing a few data points -- that's causing the data corruption.

@jdanish
Copy link
Collaborator

jdanish commented Mar 5, 2022 via email

@benloh
Copy link
Collaborator Author

benloh commented Mar 5, 2022

Sorry, thinking this through some more, I can see a situation where you don't want any old user doing imports: e.g. it's the first class with 50 students, everyone's working on a shared network. You don't want some wiseass to clobber the whole network.

But later on you might want to allow students to import mini-networks.

This suggests that we add a Template option allowImport. By default, it's false and only admins can import. If it's true, then anyone logged in can import. For a network that does not require login, the Import section is hidden or grayed out.

Or maybe I'm overthinking it?

@kalanicraig
Copy link
Collaborator

kalanicraig commented Mar 5, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants