Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_gbq result in UnicodeEncodeError #106

Closed
2legit opened this issue Jan 10, 2018 · 24 comments
Closed

to_gbq result in UnicodeEncodeError #106

2legit opened this issue Jan 10, 2018 · 24 comments

Comments

@2legit
Copy link

2legit commented Jan 10, 2018

Hi, I'm using Heroku to run a python based ETL process where I'm pushing the contents of a Pandas dataframe into Google BQ using to_gbq. However, it's generating a UnicodeEncodeError with the following stack trace, due to some non-latin characters.

Strangely this works fine on my Mac but when I try to run it on Heroku, it's failing. It seems that for some reason, http.client.py is getting an un-encoded string rather than bytes and therefore, it's trying to encode with latin-1, which is the default but obviously would choke on anything non-latin, like Chinese chars.

2018-01-08T04:54:17.307496+00:00 app[run.2251]:
Load is 100.0% Complete044+00:00 app[run.2251]:
2018-01-08T04:54:20.443238+00:00 app[run.2251]: Traceback (most recent call last):
2018-01-08T04:54:20.443267+00:00 app[run.2251]: File "AllCostAndRev.py", line 534, in
2018-01-08T04:54:20.443708+00:00 app[run.2251]: main(yaml.dump(data=ads_dict))
2018-01-08T04:54:20.443710+00:00 app[run.2251]: File "AllCostAndRev.py", line 475, in main
2018-01-08T04:54:20.443915+00:00 app[run.2251]: private_key=environ['skynet_bq_pk']
2018-01-08T04:54:20.443917+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/pandas_gbq/gbq.py", line 989, in to_gbq
2018-01-08T04:54:20.444390+00:00 app[run.2251]: connector.load_data(dataframe, dataset_id, table_id, chunksize)
2018-01-08T04:54:20.444391+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/pandas_gbq/gbq.py", line 590, in load_data
2018-01-08T04:54:20.444653+00:00 app[run.2251]: job_config=job_config).result()
2018-01-08T04:54:20.444656+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/cloud/bigquery/client.py", line 748, in load_table_from_file
2018-01-08T04:54:20.445248+00:00 app[run.2251]: response = upload.transmit_next_chunk(transport)
2018-01-08T04:54:20.445250+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/resumable_media/requests/upload.py", line 395, in transmit_next_chunk
2018-01-08T04:54:20.444942+00:00 app[run.2251]: file_obj, job_resource, num_retries)
2018-01-08T04:54:20.445457+00:00 app[run.2251]: retry_strategy=self._retry_strategy)
2018-01-08T04:54:20.444943+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/cloud/bigquery/client.py", line 777, in _do_resumable_upload
2018-01-08T04:54:20.445458+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/resumable_media/requests/_helpers.py", line 101, in http_request
2018-01-08T04:54:20.445592+00:00 app[run.2251]: func, RequestsMixin._get_status_code, retry_strategy)
2018-01-08T04:54:20.445594+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/resumable_media/_helpers.py", line 146, in wait_and_retry
2018-01-08T04:54:20.445725+00:00 app[run.2251]: response = func()
2018-01-08T04:54:20.445726+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/auth/transport/requests.py", line 186, in request
2018-01-08T04:54:20.445866+00:00 app[run.2251]: method, url, data=data, headers=request_headers, **kwargs)
2018-01-08T04:54:20.445867+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
2018-01-08T04:54:20.446099+00:00 app[run.2251]: resp = self.send(prep, **send_kwargs)
2018-01-08T04:54:20.446101+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
2018-01-08T04:54:20.446456+00:00 app[run.2251]: r = adapter.send(request, **kwargs)
2018-01-08T04:54:20.446457+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
2018-01-08T04:54:20.446728+00:00 app[run.2251]: timeout=timeout
2018-01-08T04:54:20.446730+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
2018-01-08T04:54:20.446969+00:00 app[run.2251]: chunked=chunked)
2018-01-08T04:54:20.446970+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
2018-01-08T04:54:20.447229+00:00 app[run.2251]: conn.request(method, url, **httplib_request_kw)
2018-01-08T04:54:20.447231+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/http/client.py", line 1239, in request
2018-01-08T04:54:20.447690+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/http/client.py", line 1284, in _send_request
2018-01-08T04:54:20.448232+00:00 app[run.2251]: body = _encode(body, 'body')
2018-01-08T04:54:20.448234+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/http/client.py", line 161, in _encode
2018-01-08T04:54:20.448405+00:00 app[run.2251]: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 553626-553628: Body ('信用卡') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.
2018-01-08T04:54:20.447689+00:00 app[run.2251]: self._send_request(method, url, body, headers, encode_chunked)
2018-01-08T04:54:20.448396+00:00 app[run.2251]: (name.title(), data[err.start:err.end], name)) from None
2018-01-08T04:54:20.621819+00:00 heroku[run.2251]: State changed from up to complete
2018-01-08T04:54:20.609814+00:00 heroku[run.2251]: Process exited with status 1

@max-sixty
Copy link
Contributor

Can you slim the dataframe down to a small subset and post it?

@2legit
Copy link
Author

2legit commented Jan 10, 2018

Sure, the following would fail when I try to post to BigQuery using to_gbq( ) on heroku but seems to run fine on my Mac.

df = pd.DataFrame(np.random.randn(6, 4), index=range(6), columns=list('ABCD'))
df.A = '信用卡'

@max-sixty
Copy link
Contributor

It works fine in linux for me (though oddly won't print the characters on the screen, though my mac will).
I'm not strong on Linux. You could look at Locales, but that's a low confidence suggestion. You may get more help on SO too.

@2legit
Copy link
Author

2legit commented Jan 11, 2018

Thanks. I've checked SO and have event tried to set the locale. I think the issue is earlier. I'm not sure why by the time the dataframe gets to http.client that it's still a str. Shouldn't it have been converted to bytes much earlier? The error is a result of http.client having to try to encode the str with latin-1 which it fails one chinese chars.

@northlaender
Copy link

Same issue here with the 0.3.0 version and it works on the 0.2.1 version with exactly the same data. Tried to push with the same df from a csv with encoding='UTF-8' and still the same error

@max-sixty
Copy link
Contributor

Are you on Py2 or Py3?

@max-sixty
Copy link
Contributor

@2legit that makes sense. Do you happen to know where it's attempting to convert to latin-1?

@2legit
Copy link
Author

2legit commented Jan 11, 2018

Hi, yes in the track trace you can see it's during the _encode( ) on line 161 of client.py. I'm using python 3.6.3.

python3.6/http/client.py", line 161, in _encode
2018-01-08T04:54:20.448405+00:00 app[run.2251]: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 553626-553628: Body ('信用卡') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

@2legit
Copy link
Author

2legit commented Jan 11, 2018

I think @northlaender is right. The Mac vs linux is a red-herring. Something is broken in pandas-gbq v 0.3.0 that was working fine in v. 0.2.1. I'm using the latter on my mac but on Heroku, it's pulling the latest pandas-gbq since I didn't pin the version. Just checked and Heroku is running v 0.3.0 which is breaking on non-latin chars.

@northlaender
Copy link

@maxim-lian I'm on Python 3.6.2

@max-sixty
Copy link
Contributor

If you want to try the fix, do pip3 install git+https://github.com/maxim-lian/pandas-gbq.git#encoding and see if that works

@northlaender
Copy link

hi @maxim-lian just did a quick test with the same dataset and still no success.
complete error is:

gbq.__version '0.1.2+66.g61bc28f'

Load is 100.0% Complete

UnicodeEncodeError Traceback (most recent call last)
in ()
----> 1 gbq.to_gbq(dataframes, 'test.Checkout_Web', project_id=projectId, private_key=creds, if_exists='replace')

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\pandas_gbq\gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key, auth_local_webserver)
987 table.create(table_id, table_schema)
988
--> 989 connector.load_data(dataframe, dataset_id, table_id, chunksize)
990
991

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\pandas_gbq\gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize)
588 body,
589 destination_table,
--> 590 job_config=job_config).result()
591 except self.http_error as ex:
592 self.process_http_error(ex)

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\cloud\bigquery\client.py in load_table_from_file(self, file_obj, destination, rewind, size, num_retries, job_id, job_id_prefix, job_config)
770 if size is None or size >= _MAX_MULTIPART_SIZE:
771 response = self._do_resumable_upload(
--> 772 file_obj, job_resource, num_retries)
773 else:
774 response = self._do_multipart_upload(

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\cloud\bigquery\client.py in _do_resumable_upload(self, stream, metadata, num_retries)
799
800 while not upload.finished:
--> 801 response = upload.transmit_next_chunk(transport)
802
803 return response

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\resumable_media\requests\upload.py in transmit_next_chunk(self, transport)
393 result = _helpers.http_request(
394 transport, method, url, data=payload, headers=headers,
--> 395 retry_strategy=self._retry_strategy)
396 self._process_response(result, len(payload))
397 return result

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\resumable_media\requests_helpers.py in http_request(transport, method, url, data, headers, retry_strategy, **transport_kwargs)
99 **transport_kwargs)
100 return _helpers.wait_and_retry(
--> 101 func, RequestsMixin._get_status_code, retry_strategy)

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\resumable_media_helpers.py in wait_and_retry(func, get_status_code, retry_strategy)
144 object: The return value of func.
145 """
--> 146 response = func()
147 if get_status_code(response) not in RETRYABLE:
148 return response

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\auth\transport\requests.py in request(self, method, url, data, headers, **kwargs)
199
200 response = super(AuthorizedSession, self).request(
--> 201 method, url, data=data, headers=request_headers, **kwargs)
202
203 # If the response indicated that the credentials needed to be

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\requests\sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
506 }
507 send_kwargs.update(settings)
--> 508 resp = self.send(prep, **send_kwargs)
509
510 return resp

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\requests\sessions.py in send(self, request, **kwargs)
616
617 # Send the request
--> 618 r = adapter.send(request, **kwargs)
619
620 # Total elapsed time of the request (approximately)

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
438 decode_content=False,
439 retries=self.max_retries,
--> 440 timeout=timeout
441 )
442

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
599 timeout=timeout_obj,
600 body=body, headers=headers,
--> 601 chunked=chunked)
602
603 # If we're going to release the connection in finally:, then

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
355 conn.request_chunked(method, url, **httplib_request_kw)
356 else:
--> 357 conn.request(method, url, **httplib_request_kw)
358
359 # Reset the timeout for the recv() on the socket

c:\users\northlaender\stack\python-stack\p3w\lib\http\client.py in request(self, method, url, body, headers, encode_chunked)
1237 encode_chunked=False):
1238 """Send a complete request to the server."""
-> 1239 self._send_request(method, url, body, headers, encode_chunked)
1240
1241 def _send_request(self, method, url, body, headers, encode_chunked):

c:\users\northlaender\stack\python-stack\p3w\lib\http\client.py in _send_request(self, method, url, body, headers, encode_chunked)
1282 # RFC 2616 Section 3.7.1 says that text default has a
1283 # default charset of iso-8859-1.
-> 1284 body = _encode(body, 'body')
1285 self.endheaders(body, encode_chunked=encode_chunked)
1286

c:\users\northlaender\stack\python-stack\p3w\lib\http\client.py in _encode(data, name)
159 "%s (%.20r) is not valid Latin-1. Use %s.encode('utf-8') "
160 "if you want to send it encoded in UTF-8." %
--> 161 (name.title(), data[err.start:err.end], name)) from None
162
163

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2122' in position 163736: Body ('™') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

@max-sixty
Copy link
Contributor

Thanks for testing. Do you have a reproducible example?

@northlaender
Copy link

hi @maxim-lian happy to help.
so this df produces the error
df = pd.DataFrame({'string':['Skywalker™','lego','hülle'] ,'integer':[200,300,400] ,'Date':['2017-12-13 17:40:39','2017-12-13 17:40:39','2017-12-13 17:40:39']})

to_gbq breaks on the ™
If you remove the ™ then it works but the ü is a � in BigQuery

@max-sixty
Copy link
Contributor

@northlaender I've made a change and added that exact case as a test - if you get a moment it would be super if you could try again. Thank you!

@northlaender
Copy link

hi @maxim-lian the latest fix didn't yield any other results.

@max-sixty
Copy link
Contributor

@northlaender Thanks for checking.

Though when I run the test, I can see the unicode in gbq in Python2 and the test passes fine in Python3. Would you be able to paste a stack trace?

(tbc, you need to reinstall pip3 install git+https://github.com/maxim-lian/pandas-gbq.git#encoding for the new version to install)

image

@northlaender
Copy link

Hi @maxim-lian issue was clearly with updating the code via pip. I now rechecked and it works ! 👍

@DanielWFrancis
Copy link

I can't get this to work :(

I'm installing with pip as I don't have pip3 and can't figure out if it's really different given my root install is Python3.

@max-sixty
Copy link
Contributor

max-sixty commented Jan 24, 2018

@DanielWFrancis what do you see when you run which pip? If it's linked to 2.7, then you're not installing to your python2. But you must have some way to install to your python3?

This will confirm where python is sourcing the library from:

In [1]: import pandas_gbq

In [2]: pandas_gbq.__file__
Out[2]: '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_gbq/__init__.py'

@smred
Copy link

smred commented Feb 7, 2018

Hi,
similar problem on the Mac.

Python 3.6
Tried @maxim-lian link but not helped
I work just with cyryllic symbols

@max-sixty
Copy link
Contributor

@smred Can you post your versions?

@DanielWFrancis
Copy link

DanielWFrancis commented Feb 8, 2018

@smred I worked around by opening the JSON file that I built my dataframe from with

import codecs

codecs.open('filename.json', "r", "ISO-8859-1").read()

I presume this will work for any data source you might be using.

@max-sixty
Copy link
Contributor

@tswast I think we can close this given #108

If this crops up again, please post here and we can reopen

@tswast tswast closed this as completed Feb 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants