Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to explicitely set connections to return ANSI bytestrings in... #30

Closed
wants to merge 1 commit into from

Conversation

Lexcon
Copy link

@Lexcon Lexcon commented Dec 18, 2014

Added Possibility to explicitely set connections to return ANSI bytestrings in Py3 using unicode_results = False (it will default to true in Py3). This allows use of bytearrays in case of databases that are not properly unicode encoded.

This patch does add functionality but does not affect any other code paths so is fully compatible. Arguably it makes functioning of the module more congruent with the parameters you can pass. More important It might help adoption of py3 by facilitating certain py2-to-py3 legacy issues.

@mkleehammer mkleehammer self-assigned this Dec 18, 2014
@mkleehammer
Copy link
Owner

I'm not sure I see the reasoning here. In Python 3 all strings are Unicode - why wouldn't you want strings in Unicode even if the DB isn't set up right? Python's bytes is like a binary encoding.

I could certainly see having per-query encodings for certain columns if the DB is all ASCII and you want to treat a couple as binary.

Can you explain some more?

@Lexcon
Copy link
Author

Lexcon commented Apr 30, 2015

Mike,

In certain cases one can have different encodings on a per-record base.

Consider the case of a CMS with these columns:
RECID bigint, CONTENT varchar, LANGUAGEID tinyint

where each record might have a different encoding (for the chinese page,
korean page etc). With a CMS for webpages in different
countries/encodings you can easily end up there. Of course, fixing the
db content would be the ideal solution but

  • other software depends on this data too
  • I'd probably like to fix all that content to the proper encodings
    using a tool like Python, in which case I'd need bytearray retrieval :)
  • SQL Server has varchar and nvarchar. I think almost by definition,
    what's in varchar is not mandatory unicode as opposed to nvarchar

Basically I think there should be the option to use a database 'garbage
in garbage out'. Especially when realising that won't break anything:
all defaults remain unicode and if you don't deliberately light the
match, you can't burn your fingers.

I like your idea about per-column indicators although that might bloat
the library.

In my personal case I've even got to deal with stuff in (binary)
encrypted or compressed format due to field width limitations in the db
or encryption requirements not natively supported by the db. That means
unpacking in the client (python). I would go through the effort of
cleaning it all up if the fix in pyODBC wasn't so trivial.

Thanks.

Robert

On 4/30/2015 4:05 AM, Michael Kleehammer wrote:

I'm not sure I see the reasoning here. In Python 3 all strings are
Unicode - why wouldn't you want strings in Unicode even if the DB
isn't set up right? Python's bytes is like a binary encoding.

I could certainly see having per-query encodings for certain columns
if the DB is all ASCII and you want to treat a couple as binary.

Can you explain some more?


Reply to this email directly or view it on GitHub
#30 (comment).

@mkleehammer
Copy link
Owner

Fair enough.

I'm going to need to manually work this one. First, I manually removed the unicode_results from Python 3 since it wasn't needed. Second, I need to add unit tests to a couple of databases.

@mkleehammer
Copy link
Owner

While working on this it occurred to me you might already have the tools you need. I have already implemented per-type converters. Can you use something like this?

    def test_varchar_converter(self):
        """
        Ensure a conversion function can be used to force CHAR and VARCHAR columns
        to be returned as bytes instead of Unicode strings.
        """
        def tobyte(val):
            if val is None:
                return None
            return val.encode('ascii')

        self.cnxn.add_output_converter(pyodbc.SQL_CHAR, tobyte)
        self.cnxn.add_output_converter(pyodbc.SQL_VARCHAR, tobyte)
        self.cnxn.add_output_converter(pyodbc.SQL_LONGVARCHAR, tobyte)

        value = b'testing'
        self.cursor.execute("create table t1(s varchar(20))")
        self.cursor.execute("insert into t1 values (?)", value)
        v = self.cursor.execute("select * from t1").fetchone()[0]
        self.assertEqual(type(v), bytes)
        self.assertEqual(v, value)

The data is converted to Unicode, regardless of the data type, and passed to the converter. (I couldn't come up with a way to pass "raw" data in Python. Passing a 4-byte array instead of a "long" didn't seem like the right way to go.) I'm open to suggestions on improving the converters too.

@Lexcon
Copy link
Author

Lexcon commented Apr 30, 2015

Mike,

Actually I'm not using the output converters but I understand the mechanism.

The problem is that the actual content isn't correct unicode to start
with, so interpreting it as unicode, then trying to encoding it to
another charset doesn't make sense.

If any converter would be used, its purpose would be to assemble correct
unicode from whatever comes in, instead of creating bytes from what is
supposedly unicode.

btw, our current mail exchange, is that as a reaction to the request I
submitted quite a while ago or to the patch I submitted on github? I've
already implemented the change needed and created a pull request for it.

Robert

On 4/30/2015 2:54 PM, Michael Kleehammer wrote:

While working on this it occurred to me you might already have the
tools you need. I have already implemented per-type converters. Can
you use something like this?

 def  test_varchar_converter(self):
     """
     Ensure a conversion function can be used to force CHAR and VARCHAR columns
     to be returned as bytes instead of Unicode strings.
     """
     def  tobyte(val):
         if  valis  None:
             return  None
         return  val.encode('ascii')

     self.cnxn.add_output_converter(pyodbc.SQL_CHAR, tobyte)
     self.cnxn.add_output_converter(pyodbc.SQL_VARCHAR, tobyte)
     self.cnxn.add_output_converter(pyodbc.SQL_LONGVARCHAR, tobyte)

     value=  b'testing'
     self.cursor.execute("create table t1(s varchar(20))")
     self.cursor.execute("insert into t1 values (?)", value)
     v=  self.cursor.execute("select * from t1").fetchone()[0]
     self.assertEqual(type(v),bytes)
     self.assertEqual(v, value)

The data is converted to Unicode, regardless of the data type, and
passed to the converter. (I couldn't come up with a way to pass "raw"
data in Python. Passing a 4-byte array instead of a "long" didn't seem
like the right way to go.) I'm open to suggestions on improving the
converters too.


Reply to this email directly or view it on GitHub
#30 (comment).

@mkleehammer
Copy link
Owner

Both, but I'd rather use converters if they would do the trick. I'm considering making the converter parameter bytes or bytearray for varchar,, int, etc. Would that help?

@Lexcon
Copy link
Author

Lexcon commented Apr 30, 2015

Mike,

That would be perfect. Wouldn't that break existing functionality?

BTW I'd be happy to make a financial contribution for this.

Robert

On 4/30/2015 3:53 PM, Michael Kleehammer wrote:

Both, but I'd rather use converters if they would do the trick. I'm
considering making the converter parameter |bytes| or |bytearray| for
varchar,, int, etc. Would that help?


Reply to this email directly or view it on GitHub
#30 (comment).

@Lexcon
Copy link
Author

Lexcon commented Apr 30, 2015

Michael,

Actually I was thinking.

When you just pass in the to-be-converted fields you have no context of
the rest of the row, right? So for this table setup to work:

Content (Varchar) LanguageISO(char(5))
ABË En-en
ABË Euc-Kr

one would need, while processing the Content field, access to the
LanguageID field.

Robert

On 4/30/2015 3:53 PM, Michael Kleehammer wrote:

Both, but I'd rather use converters if they would do the trick. I'm
considering making the converter parameter |bytes| or |bytearray| for
varchar,, int, etc. Would that help?


Reply to this email directly or view it on GitHub
#30 (comment).

@mkleehammer
Copy link
Owner

It would break, but I would actually be willing to do that and call it 3.1 or 4.0. I've been uncomfortable with passing Unicode to them.

Until then, I'm going to merge your patch, but I'll need to hand touch it up because I removed unicode_results from Python 3 with an #ifdef, etc.

Also, I thought I asked this earlier but I guess I deleted it. Currently the patch you provided would also affect GUIDs and XML. I really think XML should remain Unicode since most people would expect it. I don't have an opinion on GUIDs, but I think the option makes more sense if is is simply "the flag causes CHAR and VARCHAR to be returned as bytes". So I would want to make a small change for that.

Also, is "unicode_results" really right? Perhaps a new parameter name is better such as "char_as_bytes"? It is a bit more work and makes the C code a little messier, but I think it would be better to have a more self-describing parameter.

Thoughts?

@mkleehammer
Copy link
Owner

And I forgot to answer the last question - I don't pass in context since I'm calling the converter as I process each column. It would be possible to do so afterwards but wouldn't be as efficient. If more than one column need conversion, you would not know in each converter if the other column had already been converted or not. (Not important in your example, but something that could trip someone up.)

Which is why I'm leaning towards the flag now.

@Lexcon
Copy link
Author

Lexcon commented May 1, 2015

I think unicode_results is understandable: it instructs pyODBC to return
everything as unicode results which is an understandable, logical and
expectable process, especially in python2.x.

For python3.x it's an understandable flag because, although it should
default to true, setting it to false clearly indicates that the library
will not return unicode anymore. It's about as descriptive and
appropriate as it gets. My personal opinion is tweaking anything more in
names would complicate stuff, you'd get discussions about the
unicode_results flag as opposed to the char_as_bytes flags and people
would have whole philosophies about what's the difference.

As for your GUID and XML types: I expect there would never be an
incompatabilty issue with GUID's because they are all convertable to
ascii- and back no matter how you present them, so either unicode or
bytes would be 100% usable.

I'm not sure about the XML types so I'm no expert but it seems that at
least SQL Server has different possible encodings, if you pass in UTF-8
you receive UTF-8 and if you pass in ANSI you receive ANSI:

https://msdn.microsoft.com/en-us/library/ms131375.aspx

This means you'd have to make an assumption about the encoding.

If I would have to decide this I would both send also GUIDs and XML as
bytes, XML because it just might be what people need in a certain
scenario and GUIDs just because of consistency. It would make the flag
more understandable: you set 'unicode_results' to False and you know
what to expect without an exception here and an exception there.

Robert

On 1-5-2015 00:01, Michael Kleehammer wrote:

It would break, but I would actually be willing to do that and call it
3.1 or 4.0. I've been uncomfortable with passing Unicode to them.

Until then, I'm going to merge your patch, but I'll need to hand touch
it up because I removed |unicode_results| from Python 3 with an
|#ifdef|, etc.

Also, I thought I asked this earlier but I guess I deleted it.
Currently the patch you provided would also affect GUIDs and XML. I
really think XML should remain Unicode since most people would expect
it. I don't have an opinion on GUIDs, but I think the option makes
more sense if is is simply "the flag causes CHAR and VARCHAR to be
returned as bytes". So I would want to make a small change for that.

Also, is "unicode_results" really right? Perhaps a new parameter name
is better such as "char_as_bytes"? It is a bit more work and makes the
C code a little messier, but I think it would be better to have a more
self-describing parameter.

Thoughts?


Reply to this email directly or view it on GitHub
#30 (comment).

@Lexcon
Copy link
Author

Lexcon commented May 1, 2015

One more addition to my previous mail:

At least sql server has the distinction between varchar and nvarchar.
Varchar can contain bytestreams of arbitral encoding. However nvarchar
is by definition and so guaranteed unicode.
If you'd want to make distinctions between field types, you could choose
to return unicode only when it's guaranteed no encoding conflict exists.
That would mean returning nvarchar and guid's as unicode regardless of
the flag, but xml and varchars as bytestreams.

what you're facing in that scenario is that you'd need a full list of
all field types in all db systems that are guaranteed to contain unicode.

For instance I also use Oracle for the same application and also there
I'm facing the same challenges but I don't know if Oracle has field
types that enforce Unicode.

For Oracle a 'complication' is that number datatypes could return long
as long as there are no decimals. That would reduce the use of
Decimal.decimal objects. But that's a whole different discussion.

Robert

On 1-5-2015 00:07, Michael Kleehammer wrote:

And I forgot to answer the last question - I don't pass in context
since I'm calling the converter as I process each column. It would be
possible to do so afterwards but wouldn't be as efficient. If more
than one column need conversion, you would not know in each converter
if the other column had already been converted or not. (Not important
in your example, but something that could trip someone up.)

Which is why I'm leaning towards the flag now.


Reply to this email directly or view it on GitHub
#30 (comment).

@Lexcon
Copy link
Author

Lexcon commented Sep 21, 2016

Mike,

I was just wondering, do you have any official release scheduled? For now pyodbc is keeping me from moving to Py3 unless I use my manually patched source to facilitate fallback to bytes instead of unicode for varchar fields (versus nvarchar fields). It would be nice to have an 'official' solution.

@mkleehammer
Copy link
Owner

I don't think this is necessary anymore after the complete Unicode rewrite. You can now use something like: Connection.setdecoding(pyodbc.SQL_VARCHAR, encoding='utf-8', to=str)

Thanks though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants