Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL Support Would be Amazing #12

Closed
tfmorris opened this issue Oct 15, 2012 · 9 comments
Closed

MySQL Support Would be Amazing #12

tfmorris opened this issue Oct 15, 2012 · 9 comments
Labels
imported from old code repo Issue imported from Google Code in 2010 logic Changes to the data model, to the way operations, expressions work persistence Issues about the way user data is saved on disk Priority: Low Indicates less critical issues that can be dealt with at a later stage Theme: UX/Usability Focuses on issues related to improving the overall user experience and interaction flow. Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Milestone

Comments

@tfmorris
Copy link
Member

tfmorris commented Oct 15, 2012

Original author: mjlissner (May 10, 2010 17:39:53)

What steps will reproduce the problem?

  1. Try to connect to a MySQL database

As a DB administrator, all of my data is locked up in a MySQL database. I
COULD export it to CSV, then import it into Gridworks, then do some
cleanup, then export it to CSV, then import it to MySQL again, overwriting
the old data, but that seems very complicated.

Since this works in the browser anyway, it would be amazing if it could be
connected to MySQL databases, and if data manipulation could happen from there.

This would unlock a TON of new data, though I'm unsure how the program
would scale to the quantities of information that would be pulled in.

Original issue: http://code.google.com/p/google-refine/issues/detail?id=12

@tfmorris
Copy link
Member Author

From tfmorris on May 10, 2010 19:26:52:
That sounds more like an enhancement request than a defect report.

To generalize things a bit, support for a standard data access API would allow people
to plug in multiple DB backends.

@tfmorris
Copy link
Member Author

From dfhu...@gmail.com on May 10, 2010 19:32:58:
One of the major challenges here is how to support undo/redo when changes can get into
the back-end database without going through Gridworks. Another major challenge is where
to store metadata (such as reconciliation records) that is specific to Gridworks and
not native to any existing back-end database.

@tfmorris
Copy link
Member Author

From mjlissner on May 10, 2010 19:36:53:
Hmm...I presume a new project would have to be created when doing this, and that
could hold the meta information.

As for undo/redo, maybe adding a commit button would make that easier. So changes can
be made to a snapshot of the data, but then no changes are made to the DB itself
until commit is pressed?

@tfmorris
Copy link
Member Author

From iainsproat on May 10, 2010 19:46:01:
You can either keep it synchronous with the database (effectively using the database
as the backend), but lose undo/redo support and reconciliation with Freebase (unless
you add suitable tables to the database). You're then using Gridworks for just the
facets really.

The other way, as you suggest, is to make it similar to a disconnected session with a
commit transaction. The issue is then ensuring consistency between the remote
database and the snapshot held in Gridworks. Merging the two back in would be an
issue. You'd also need to hold keys from the remote database in Gridworks for
updating records.

Frameworks such as Hibernate or Spring would be worth considering for their database
abstraction layers.

@tfmorris
Copy link
Member Author

From thadguidry on May 10, 2010 20:05:13:
I have read that good ORMs such as Hibernate now support ordered lists and other
features now incorporated into JPA 2.0 as of Dec 2009.

@tfmorris
Copy link
Member Author

From 7...@ericjarvies.com on November 11, 2010 07:08:54:
Initially, what I believe is most important/useful, is simply having the ability to direct-connect to MySQL/PostgreSQL/etc. data sources from the get-go(create a new project). Also initially, being able to set/save multiple data sources, and multiple dBs within those sources. When the user creates a new project, this would in effect create a disconnected session(as mentioned above), wherein the data is treated as is now the data. Adding 'commit' features can come next, followed by more advanced connectivity options, until such a point synchronous functionality is in place to one degree or another. But for now, it would certainly be nice to add data sources and pull data from those sources!

Eric Jarvies

@tfmorris
Copy link
Member Author

From thadguidry on November 11, 2010 14:36:42:
Eric, A cleanup tool using industry best practices is best used offline within a process. There are existing ETL tools that easily consume from MySQL/PostgreSQL etc and offer excellent flow control, exporting, and connectivity to produce delimited files with relative ease. Talend is one such product that I use along with Google Refine. Talend (Open Source Community edition) does my scheduled daily gathering from 3 databases (MySQL and Oracle) and then dumps a customized TSV file that I open with Google Refine for further analysis and sometimes clean up. There are other tools that provide ETL (Extract, Transform, Load) like Talend. I'm not 100% sure if the team really feels the need to copy that and flesh out a full ETL platform, since Talend and other tools fill that need very nicely. Incidentally, using Google Refine and a bit of clustering, I was able to find a few loop holes in our data storage processing that we fixed with a few stored procedures within Oracle. Google Refine was instrumental as a discovery tool for that. Talend does have an MDM component but does not have the interactivity of a discovery tool like Google Refine does. If you do NOT need a daily process, but only one time cleanup, just dumping with MySQL or PostgreSQL would offer about the same and depending on the size of database takes only secs to minutes. Dumping can also avoid potential live database locks, that if Refine supported might have to tip-toe around, depending on the teams' chosen implementation of database connectivity. If you have large database size needs, give Talend or another ETL tool a try with Google Refine, and you'll soon see the powerful left-right combination. I'm not sure how far the team ultimately decides to absorb direct connectivity support within Google Refine. I'd like to hear other opinions as well on this Issue-12.

@tfmorris
Copy link
Member Author

From Chris.Go...@gmail.com on April 14, 2011 14:23:04:
I agree with thaguidry. Let the Refine team focus on bringing data quality issues to light. Let Talend focus on data Quality (they do have an data profiling tool that can identify some of this stuff) Talend is what we use for basic ETL. You could write some SQL to get the data out of MySQL anyway. If you analyze directly connected to db server for data quality against an entire large table your dba might become angry too.

C

@thadguidry
Copy link
Member

#1277 Is being worked on to support this issue !

@thadguidry thadguidry added this to the 3.0 milestone May 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported from old code repo Issue imported from Google Code in 2010 logic Changes to the data model, to the way operations, expressions work persistence Issues about the way user data is saved on disk Priority: Low Indicates less critical issues that can be dealt with at a later stage Theme: UX/Usability Focuses on issues related to improving the overall user experience and interaction flow. Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
Development

No branches or pull requests

2 participants