Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: DataTable or IDataReader support for Bulk Insert #737

Closed
Grauenwolf opened this issue Nov 9, 2019 · 25 comments
Closed

Enhancement: DataTable or IDataReader support for Bulk Insert #737

Grauenwolf opened this issue Nov 9, 2019 · 25 comments

Comments

@Grauenwolf
Copy link

It would be really nice if we could just pass in a DataTable or IDataReader like we do with SQL Server and let the driver handle the CSV conversion and steam management.

This way we don't have people guessing about things like how to encode strings, dates, and nulls.

@Grauenwolf
Copy link
Author

To do it 'right' I'm thinking a subclass of IO.Stream is needed so that it can consume the IDataReader at the same time MySqlBulkLoader consumes the stream. But I don't know the implementation details well enough to say that with certainty.

In my own code I just dump everything into a massive MemoryStream, which isn't great for performance.

@bgrainger
Copy link
Member

Are you talking about the SqlBulkCopy.WriteToServer method?

It seems like a helper class could be written that sets up a MySqlBulkLoader and sets the SourceStream property to a custom Stream implementation that reads from a DataTable (or IDataReader, but there are better options for inserting directly from one table to another if that IDataReader is from the same database).

@Grauenwolf
Copy link
Author

Essentially yes.

As for IDataReader, it makes for a useful abstraction. For SQL Server I'll wrap a collection of objects, a enumeration of objects, or a file parser in an IDataReader and stream it in instead of building a fat DataTable.

In fact, DataTable exposes an IDataReader so we'd only need one implementation.

@Grauenwolf
Copy link
Author

Whether it is part of MySqlBulkLoader or a separate extension method doesn't matter to me. Though it should be easily discoverable.

@bgrainger
Copy link
Member

I've developed a prototype version of this at e0a2de5. Current API (subject to change):

using var connection = new MySqlConnection("...;AllowLoadLocalInfile=true");
var bulkLoader = new MySqlBulkLoader(connection, dataTableOrDataReader)
{
    TableName = "destination_table",
};
bulkLoader.Load();

@Grauenwolf
Copy link
Author

That's not at all like I expected it to look. It's certainly a lot cleaner than what I could have done from outside the library.

@bgrainger
Copy link
Member

Updated API proposal: 98db02d#diff-e5e5c76f457fba4a79beedb1995a9f5f

using (var connection = new MySqlConnection("...;AllowLoadLocalInfile=True"))
{
	await connection.OpenAsync();
	var bulkCopy = new MySqlBulkCopy(connection);
	bulkCopy.DestinationTableName = "some_table_name";
	await bulkCopy.WriteToServerAsync(dataTable);
}

@Grauenwolf how important/useful to you would it be to have the capability to load from IEnumerable<object[]> (or IAsyncEnumerable<object[]>)?

@Grauenwolf
Copy link
Author

Do you mean just passing in stuff like new object[] ("Tom", "Jones", 15, true)?

If so, I don't need it personally, but I think it would be useful to others.

@bgrainger
Copy link
Member

Yes, I was just referring to the latter (which would be a simple extension to the current code); I didn't have any plans to add reflection, matching property names with field names, etc.

@dennis-gr
Copy link
Contributor

I've been testing this and it's working nicely except for text columns that contain newlines:

Testcase:

CREATE TABLE `test` (
  `ID` INT NOT NULL,
  `Content` TEXT NOT NULL,
  `AnotherID` INT NOT NULL,
  PRIMARY KEY (`ID`));
  private const string ContentWithNewLine = @"foo
                                              bar";
  [Fact]
  public void MySqlBulkCopy_With_NewLines()
  {
      var table = new DataTable();
      table.Columns.AddRange(new[] { new DataColumn("ID", typeof(int)), new DataColumn("Content", typeof(string)), new DataColumn("AnotherID", typeof(int)) });
      
      table.Rows.Add(1, "text", 2);
      table.Rows.Add(2, ContentWithNewLine, 3);
      table.Rows.Add(3, "just some text", 4);

      using (var connection = new MySqlConnection())
      {
          connection.Open();

          var bulkCopy = new MySqlBulkCopy(connection) { DestinationTableName = "test" };
          bulkCopy.WriteToServer(table);
      }
  }   

Produces this:

mysql> select * from test;
+----+----------------+-----------+
| ID | Content        | AnotherID |
+----+----------------+-----------+
|  0 | 3              |         0 |
|  1 | text           |         2 |
           |         0 |
|  3 | just some text |         4 |
+----+----------------+-----------+
4 rows in set (0.00 sec)

@bgrainger
Copy link
Member

it's working nicely except for text columns that contain newlines

The string-escaping routine is missing a check for \n:

if (value[index] == '\t' || value[index] == '\\')

Thanks for pointing this out; a fix will be in the next release.

@bgrainger
Copy link
Member

@dennis-gr 0.62.0-beta3 is available and should fix this problem; thanks for reporting it!

@Grauenwolf
Copy link
Author

Thank you both for this!

@dennis-gr
Copy link
Contributor

Thanks for the quick fix, working as expected now.

@mikeTWC1984
Copy link

There are few more useful items missing (comparing to SqlBulkCopy):

  • BatchSize property (is there any default value?)
  • Total rows copied count (if datareader is used). Original bulkcopy has _rowsCopied private property which can be retrieved thru reflection. There is also NotifyAfter property to fire events after N rows are copied.
  • Original WriteToServer also accept DataRow[] which could be a product of filtering DataTable

@bgrainger
Copy link
Member

BatchSize

For SqlClient, this is documented as "Number of rows in each batch. At the end of each batch, the rows in the batch are sent to the server. ... Zero (the default) indicates that each WriteToServer operation is a single batch."

The current implementation streams all the rows to the MySQL server, although there are implicit batches formed every 16MiB since that's the maximum size of a single MySQL network packet.

Is there any specific outcome you want to accomplish by setting a BatchSize?

private property which can be retrieved thru reflection.

😱

Total rows copied count

It seems reasonable that a MySqlBulkCopy.RowsCopied read-only property could be added (that would be populated after WriteToServer completes).

NotifyAfter

Also seems reasonable to implement (along with the SqlRowsCopied event).

also accept DataRow[]

Accepting IReadOnlyList<DataRow> (or even IEnumerable<DataRow>?) seems like a reasonable addition.

@Grauenwolf
Copy link
Author

BatchSize is important for SQL Server because it has a measurable impact on performance.

If that's not the case for MySQL, I think it's safe to omit it.

@mikeTWC1984
Copy link

I think BatchSize in SQL server will limit memory/network usage. I guess that 16MB parameter would play similar role, but is it configurable though?

One more item to consider is ColumnMappings property

@dennis-gr
Copy link
Contributor

dennis-gr commented Dec 18, 2019

One more issue that I've run into: If the data size exceeds the maximum network packet size of 16 MB, the copy operation fails:

MySql.Data.MySqlClient.MySqlException (0x80004005): Error during LOAD DATA LOCAL INFILE
System.ArgumentException: The output byte buffer is too small to contain the encoded data, encoding 'Unicode (UTF-8)' fallback 'System.Text.EncoderReplacementFallback'. (Parameter 'bytes')
   at System.Text.Encoding.ThrowBytesOverflow()
   at System.Text.Encoding.ThrowBytesOverflow(EncoderNLS encoder, Boolean nothingEncoded)
   at System.Text.Encoding.GetBytesWithFallback(ReadOnlySpan`1 chars, Int32 originalCharsLength, Span`1 bytes, Int32 originalBytesLength, EncoderNLS encoder)
   at System.Text.Encoding.GetBytesWithFallback(Char* pOriginalChars, Int32 originalCharCount, Byte* pOriginalBytes, Int32 originalByteCount, Int32 charsConsumedSoFar, Int32 bytesWrittenSoFar)
   at System.Text.UTF8Encoding.GetBytes(ReadOnlySpan`1 chars, Span`1 bytes)
   at MySql.Data.MySqlClient.MySqlBulkCopy.<SendDataReaderAsync>g__WriteString|15_1(String value, Span`1 output, Int32& bytesWritten) in C:\projects\mysqlconnector\src\MySqlConnector\MySql.Data.MySqlClient\MySqlBulkCopy.cs:line 355
   at MySql.Data.MySqlClient.MySqlBulkCopy.<SendDataReaderAsync>g__WriteValue|15_0(MySqlConnection connection, Object value, Span`1 output, Int32& bytesWritten) in C:\projects\mysqlconnector\src\MySqlConnector\MySql.Data.MySqlClient\MySqlBulkCopy.cs:line 212
   at MySql.Data.MySqlClient.MySqlBulkCopy.SendDataReaderAsync(MySqlConnection connection, IDataReader dataReader, IOBehavior ioBehavior, CancellationToken cancellationToken)

Test to reproduce:

CREATE TABLE `test` (
`ID` INT NOT NULL,
`Content` TEXT NOT NULL,
PRIMARY KEY (`ID`));
[Fact]
public void Data_Size_Bigger_16MB()
{
    var table = new DataTable();
    table.Columns.AddRange(new[] { new DataColumn("ID", typeof(int)), new DataColumn("Content", typeof(string)) });
    
    var stringBuilder = new StringBuilder();
    for (int i = 0; i < Math.Pow(2, 24); i++)
        stringBuilder.Append(i);
    
    table.Rows.Add(1, stringBuilder.ToString());
    
    using (var connection = new MySqlConnection())
    {
        connection.Open();
    
        var bulkCopy = new MySqlBulkCopy(connection) { DestinationTableName = "test" };
        bulkCopy.WriteToServer(table);
    }
}

@bgrainger
Copy link
Member

One more issue that I've run into: If the data size exceeds the maximum network packet size of 16 MB, the copy operation fails

Yes; this is not currently supported.

@Grauenwolf
Copy link
Author

I've been giving this some more thought. If we support batch size, then we could also support notifications.

In SQL Server, you can ask it do trigger an event after every N records are uploaded.

ref: https://docs.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlbulkcopy.sqlrowscopied?view=netframework-4.8

@Grauenwolf
Copy link
Author

Quick question: will this support transactions? I can't find any information on it one way or the other. (Back in 2006 it would silently commit any pending transactions, but that was a long time ago.)

@bgrainger
Copy link
Member

Transaction support would be up to MySQL Server. I haven't found any definitive documentation about it, so I would assume LOAD DATA can be used inside a transaction (and rolled back, etc.).

@bgrainger
Copy link
Member

If the data size exceeds the maximum network packet size of 16 MB, the copy operation fails

You cannot currently send a single string that is longer than 16MiB (this is a known limitation). However, if you were reporting that any string that crosses the 16 MiB packet boundary fails to be sent, this is indeed a bug that should be fixed.

@bgrainger
Copy link
Member

"v1" of this feature is implemented in 0.62.0 (documentation: https://mysqlconnector.net/api/mysql-bulk-copy/); please open new issues for bugs or feature requests to MySqlBulkCopy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants