MySQL Support Would be Amazing #12

tfmorris · 2012-10-15T02:25:05Z

Original author: mjlissner (May 10, 2010 17:39:53)

What steps will reproduce the problem?

Try to connect to a MySQL database

As a DB administrator, all of my data is locked up in a MySQL database. I
COULD export it to CSV, then import it into Gridworks, then do some
cleanup, then export it to CSV, then import it to MySQL again, overwriting
the old data, but that seems very complicated.

Since this works in the browser anyway, it would be amazing if it could be
connected to MySQL databases, and if data manipulation could happen from there.

This would unlock a TON of new data, though I'm unsure how the program
would scale to the quantities of information that would be pulled in.

Original issue: http://code.google.com/p/google-refine/issues/detail?id=12

tfmorris · 2012-10-15T02:25:09Z

From tfmorris on May 10, 2010 19:26:52:
That sounds more like an enhancement request than a defect report.

To generalize things a bit, support for a standard data access API would allow people
to plug in multiple DB backends.

tfmorris · 2012-10-15T02:25:09Z

From dfhu...@gmail.com on May 10, 2010 19:32:58:
One of the major challenges here is how to support undo/redo when changes can get into
the back-end database without going through Gridworks. Another major challenge is where
to store metadata (such as reconciliation records) that is specific to Gridworks and
not native to any existing back-end database.

tfmorris · 2012-10-15T02:25:09Z

From mjlissner on May 10, 2010 19:36:53:
Hmm...I presume a new project would have to be created when doing this, and that
could hold the meta information.

As for undo/redo, maybe adding a commit button would make that easier. So changes can
be made to a snapshot of the data, but then no changes are made to the DB itself
until commit is pressed?

tfmorris · 2012-10-15T02:25:09Z

From iainsproat on May 10, 2010 19:46:01:
You can either keep it synchronous with the database (effectively using the database
as the backend), but lose undo/redo support and reconciliation with Freebase (unless
you add suitable tables to the database). You're then using Gridworks for just the
facets really.

The other way, as you suggest, is to make it similar to a disconnected session with a
commit transaction. The issue is then ensuring consistency between the remote
database and the snapshot held in Gridworks. Merging the two back in would be an
issue. You'd also need to hold keys from the remote database in Gridworks for
updating records.

Frameworks such as Hibernate or Spring would be worth considering for their database
abstraction layers.

tfmorris · 2012-10-15T02:25:09Z

From thadguidry on May 10, 2010 20:05:13:
I have read that good ORMs such as Hibernate now support ordered lists and other
features now incorporated into JPA 2.0 as of Dec 2009.

tfmorris · 2012-10-15T02:25:10Z

From 7...@ericjarvies.com on November 11, 2010 07:08:54:
Initially, what I believe is most important/useful, is simply having the ability to direct-connect to MySQL/PostgreSQL/etc. data sources from the get-go(create a new project). Also initially, being able to set/save multiple data sources, and multiple dBs within those sources. When the user creates a new project, this would in effect create a disconnected session(as mentioned above), wherein the data is treated as is now the data. Adding 'commit' features can come next, followed by more advanced connectivity options, until such a point synchronous functionality is in place to one degree or another. But for now, it would certainly be nice to add data sources and pull data from those sources!

Eric Jarvies

tfmorris · 2012-10-15T02:25:10Z

From thadguidry on November 11, 2010 14:36:42:
Eric, A cleanup tool using industry best practices is best used offline within a process. There are existing ETL tools that easily consume from MySQL/PostgreSQL etc and offer excellent flow control, exporting, and connectivity to produce delimited files with relative ease. Talend is one such product that I use along with Google Refine. Talend (Open Source Community edition) does my scheduled daily gathering from 3 databases (MySQL and Oracle) and then dumps a customized TSV file that I open with Google Refine for further analysis and sometimes clean up. There are other tools that provide ETL (Extract, Transform, Load) like Talend. I'm not 100% sure if the team really feels the need to copy that and flesh out a full ETL platform, since Talend and other tools fill that need very nicely. Incidentally, using Google Refine and a bit of clustering, I was able to find a few loop holes in our data storage processing that we fixed with a few stored procedures within Oracle. Google Refine was instrumental as a discovery tool for that. Talend does have an MDM component but does not have the interactivity of a discovery tool like Google Refine does. If you do NOT need a daily process, but only one time cleanup, just dumping with MySQL or PostgreSQL would offer about the same and depending on the size of database takes only secs to minutes. Dumping can also avoid potential live database locks, that if Refine supported might have to tip-toe around, depending on the teams' chosen implementation of database connectivity. If you have large database size needs, give Talend or another ETL tool a try with Google Refine, and you'll soon see the powerful left-right combination. I'm not sure how far the team ultimately decides to absorb direct connectivity support within Google Refine. I'd like to hear other opinions as well on this Issue-12.

tfmorris · 2012-10-15T02:25:10Z

From Chris.Go...@gmail.com on April 14, 2011 14:23:04:
I agree with thaguidry. Let the Refine team focus on bringing data quality issues to light. Let Talend focus on data Quality (they do have an data profiling tool that can identify some of this stuff) Talend is what we use for basic ETL. You could write some SQL to get the data out of MySQL anyway. If you analyze directly connected to db server for data quality against an entire large table your dba might become angry too.

C

thadguidry · 2017-11-28T04:08:54Z

#1277 Is being worked on to support this issue !

ostephens mentioned this issue Dec 1, 2015

Are there any "connectors" to allow Open Refine to read/write directly to a database (eg Sybase, MySQL etc)? LibraryCarpentry/week-four-library-carpentry--DEPRECATED#15

Open

wetneb mentioned this issue Aug 2, 2017

support sql operations #751

Closed

thadguidry mentioned this issue Nov 28, 2017

Feature Request: Export SqlDump #205

Closed

OpenRefine deleted a comment from tfmorris Nov 28, 2017

thadguidry closed this as completed Feb 22, 2018

thadguidry added this to the 3.0 milestone May 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MySQL Support Would be Amazing #12

MySQL Support Would be Amazing #12

tfmorris commented Oct 15, 2012 •

edited by thadguidry

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

thadguidry commented Nov 28, 2017

MySQL Support Would be Amazing #12

MySQL Support Would be Amazing #12

Comments

tfmorris commented Oct 15, 2012 • edited by thadguidry

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

tfmorris commented Oct 15, 2012

thadguidry commented Nov 28, 2017

tfmorris commented Oct 15, 2012 •

edited by thadguidry