Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INSERT statements get lost #262

Closed
stolpovsky opened this issue Aug 3, 2017 · 16 comments
Closed

INSERT statements get lost #262

stolpovsky opened this issue Aug 3, 2017 · 16 comments

Comments

@stolpovsky
Copy link

stolpovsky commented Aug 3, 2017

Environment

  • Python: 3.5.1
  • pyodbc: 4.0.17
  • OS: Windows 7 SP1 64-bit
  • DB: MS SQL Server 2012
  • driver: ODBC driver 11 for sql server

Issue

I am executing around 1289 insert statements , only about half of them are reflected in the target table. There is no exception raised, it just silently ignores part of the input. String length is 427118.

        with pyodbc.connect(connection_string) as con:
            cursor = con.cursor()
            cursor.execute('\n'.join(insert_statements))
            cursor.commit()

Interestingly, the number of insert statements that get executed keeps changing (from around 600 to 1000) from run to run. There are no SQL errors - If I execute generated string in the GUI, there are no errors and all rows are inserted.

If I execute one INSERT at a time, all rows are inserted, but it runs noticeably slower (around 30 s vs 1 s).

        with pyodbc.connect(connection_string) as con:
            cursor = con.cursor()
            for s in insert_statements:
                cursor.execute(s)
            cursor.commit()
@v-chojas
Copy link
Contributor

v-chojas commented Aug 4, 2017

Each insert statement will generate one resultset (containing the row count), which you are not consuming. SQL Server will block processing of further statements in a batch, if there are pending resultsets filling up its write buffer. Try this instead:

        with pyodbc.connect(connection_string) as con:
            cursor = con.cursor()
            cursor.execute('\n'.join(insert_statements))
            while cursor.nextset():
                pass
            cursor.commit()

Please see https://support.microsoft.com/en-us/help/827575/sql-server-does-not-finish-execution-of-a-large-batch-of-sql-statement for more information.

@stolpovsky
Copy link
Author

stolpovsky commented Aug 4, 2017

Thanks a lot for providing a link to the issue discussion and for suggesting a workaround. My first reaction was to blame it on the API, but upon some thinking - pausing to execute statements to allow the caller to consume output seems reasonable.

Can this situation be handled in pyodbc? For example, on commit(), if there are any pending executions, then throw.

@v-chojas
Copy link
Contributor

v-chojas commented Aug 4, 2017

Although I don't know the specifics on what SQL Server does, I can understand and explain why this behaviour is by design and quite natural: the protocol works on a request-response model, where the client sends the request to the server and then waits for a reply. Multiple SQL statements can be submitted in one batch in order to minimise the number of network round-trips, but it is important that the client still know the status/result of each statement in a batch. Thus, the server's main loop is: read a statement; execute it; write the result. If the server can't write, it (implicitly) waits until it can.

This is where it would be appropriate to throw on buffer overflow, or to throw on timeout (while waiting for client to consume from output buffer).

This is a network buffer, whose fullness is highly dependent on network conditions --- thus your observation that the number of inserts executed varies somewhat randomly --- and also why "throw on" doesn't make much sense in this context: I suppose it could time out and close the connection, but that would have a huge negative effect on reliability, and be essentially unusable over slower links or interactive applications/those which take a longer time to process resultsets.

Regardless, the server expects you to read (and do error checking) the replies it sends. Hence,

Microsoft recommends that you always consume all result sets from SQL Server regardless of the size of the batch that you are executing. If you do not flush this data and there are successful result sets to be returned ahead of the error in the result set batch, the client might not discover the server errors.

Try to clear the pending results first, and if that does not fix this issue, we can investigate further.

@gordthompson
Copy link
Collaborator

Perhaps this might help. If I execute the code

sql = """\
INSERT INTO #tmp (id, txt) VALUES (1, 'FOO');
INSERT INTO #tmp (id, txt) VALUES (2, 'BAR');
INSERT INTO #tmp (id, txt) VALUES (3, 'BAZ');
"""

crsr.execute(sql)
row_count = crsr.rowcount
status_msg = "result returned. {} row(s) affected."
print(status_msg.format(row_count))
while crsr.nextset():
    row_count = crsr.rowcount
    print(status_msg.format(row_count))

I get three (3) results returned, one for each of the individual INSERT statements

result returned. 1 row(s) affected.
result returned. 1 row(s) affected.
result returned. 1 row(s) affected.

However, if I add a SET NOCOUNT ON; statement at the beginning of the anonymous code block

sql = """\
SET NOCOUNT ON;
INSERT INTO #tmp (id, txt) VALUES (1, 'FOO');
INSERT INTO #tmp (id, txt) VALUES (2, 'BAR');
INSERT INTO #tmp (id, txt) VALUES (3, 'BAZ');
"""

then I only get one result returned for the entire batch

result returned. -1 row(s) affected.

@bjquast
Copy link

bjquast commented Aug 7, 2017

If the insert statement is always the same it should be much faster to use a multi row insert:

sql = """\
-- SET NOCOUNT ON; not needed as only one result returned
INSERT INTO #tmp (id, txt) 
VALUES (1, 'FOO'), (2, 'BAR'), (3, 'BAZ');
"""

@gordthompson
Copy link
Collaborator

@stolpovsky - If you saw fit to close the pypyodbc issue then perhaps you might want to close this one, too.

@stolpovsky
Copy link
Author

pypyodbc is practically unsupported, there is nobody there to look into the issue. I believe the behavior is dangerous and unintuitive: random number of statements do not get executed after commit is called, no error raised, and pyodbc allows it. I am not sure what might be the best way to solve it. Maybe read cursor inside commit, if there is anything left to execute?

while cursor.nextset():
    pass

Maybe throw from commit in case cursor has unread results?

@gordthompson
Copy link
Collaborator

You raise a very good point if a subset of the statements really do fail silently. Are you sure that no error is returned? That is, have you looped through all of the results using nextset and verified that there isn't an error result in there somewhere?

@stolpovsky
Copy link
Author

stolpovsky commented Aug 11, 2017 via email

@mkleehammer
Copy link
Owner

This is very interesting. I'd seen the problem before, but did not know the underlying cause.

We all agree it seems like a very dangerous default, and adding a loop to commit should be harmless. I would not expect any significant performance problems - no added network trips - for the "normal" case when there is a single statement.

I would expect any errors to raise an exception in commit. Is everyone ok with that? That means code code look like this:

cursor.execute(sql)
stuff_to_do_when_successful()
commit()  # <- exception here

I think it is fine, because an error could happen here anyway, such as a network disconnect from the server.

@stolpovsky
Copy link
Author

From the user perspective, this makes sense: commit will first execute any remaining statements, ignoring results, and throwing on error. Not returning results is ok, since the user did not bother to write his own result capture.

It seems like an easy fix for pyodbc maintainers, too: just add a loop in the beginning of commit.

@MartinRiddar
Copy link

Did this ever get implemented or is this something I have to think about in my code?

@v-chojas
Copy link
Contributor

Microsoft recommends that you always consume all result sets from SQL Server regardless of the size of the batch that you are executing.

@mneira10
Copy link

This seems to be affecting inserts using sqlalchemy and pandas using:

engine = sqlalchemy.create_engine(
            'mssql+pyodbc:///?odbc_connect={}'.format(dbParams))
dataFrame.to_sql(
            tableName,
            con=engine, ... )

Is there a workaround in this case?

@gordthompson
Copy link
Collaborator

gordthompson commented Jan 30, 2021

@mneira10 - If you are talking about this Stack Overflow question then no, pandas to_sql() will almost certainly not be failing because of this issue. See my answer here for details.

@keitherskine
Copy link
Collaborator

Closed due to inactivity. Feel free to re-open with current information if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants