to_gbq result in UnicodeEncodeError #106

2legit · 2018-01-10T01:18:31Z

Hi, I'm using Heroku to run a python based ETL process where I'm pushing the contents of a Pandas dataframe into Google BQ using to_gbq. However, it's generating a UnicodeEncodeError with the following stack trace, due to some non-latin characters.

Strangely this works fine on my Mac but when I try to run it on Heroku, it's failing. It seems that for some reason, http.client.py is getting an un-encoded string rather than bytes and therefore, it's trying to encode with latin-1, which is the default but obviously would choke on anything non-latin, like Chinese chars.

2018-01-08T04:54:17.307496+00:00 app[run.2251]:
Load is 100.0% Complete044+00:00 app[run.2251]:
2018-01-08T04:54:20.443238+00:00 app[run.2251]: Traceback (most recent call last):
2018-01-08T04:54:20.443267+00:00 app[run.2251]: File "AllCostAndRev.py", line 534, in
2018-01-08T04:54:20.443708+00:00 app[run.2251]: main(yaml.dump(data=ads_dict))
2018-01-08T04:54:20.443710+00:00 app[run.2251]: File "AllCostAndRev.py", line 475, in main
2018-01-08T04:54:20.443915+00:00 app[run.2251]: private_key=environ['skynet_bq_pk']
2018-01-08T04:54:20.443917+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/pandas_gbq/gbq.py", line 989, in to_gbq
2018-01-08T04:54:20.444390+00:00 app[run.2251]: connector.load_data(dataframe, dataset_id, table_id, chunksize)
2018-01-08T04:54:20.444391+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/pandas_gbq/gbq.py", line 590, in load_data
2018-01-08T04:54:20.444653+00:00 app[run.2251]: job_config=job_config).result()
2018-01-08T04:54:20.444656+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/cloud/bigquery/client.py", line 748, in load_table_from_file
2018-01-08T04:54:20.445248+00:00 app[run.2251]: response = upload.transmit_next_chunk(transport)
2018-01-08T04:54:20.445250+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/resumable_media/requests/upload.py", line 395, in transmit_next_chunk
2018-01-08T04:54:20.444942+00:00 app[run.2251]: file_obj, job_resource, num_retries)
2018-01-08T04:54:20.445457+00:00 app[run.2251]: retry_strategy=self._retry_strategy)
2018-01-08T04:54:20.444943+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/cloud/bigquery/client.py", line 777, in _do_resumable_upload
2018-01-08T04:54:20.445458+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/resumable_media/requests/_helpers.py", line 101, in http_request
2018-01-08T04:54:20.445592+00:00 app[run.2251]: func, RequestsMixin._get_status_code, retry_strategy)
2018-01-08T04:54:20.445594+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/resumable_media/_helpers.py", line 146, in wait_and_retry
2018-01-08T04:54:20.445725+00:00 app[run.2251]: response = func()
2018-01-08T04:54:20.445726+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/auth/transport/requests.py", line 186, in request
2018-01-08T04:54:20.445866+00:00 app[run.2251]: method, url, data=data, headers=request_headers, **kwargs)
2018-01-08T04:54:20.445867+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
2018-01-08T04:54:20.446099+00:00 app[run.2251]: resp = self.send(prep, **send_kwargs)
2018-01-08T04:54:20.446101+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
2018-01-08T04:54:20.446456+00:00 app[run.2251]: r = adapter.send(request, **kwargs)
2018-01-08T04:54:20.446457+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
2018-01-08T04:54:20.446728+00:00 app[run.2251]: timeout=timeout
2018-01-08T04:54:20.446730+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
2018-01-08T04:54:20.446969+00:00 app[run.2251]: chunked=chunked)
2018-01-08T04:54:20.446970+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
2018-01-08T04:54:20.447229+00:00 app[run.2251]: conn.request(method, url, **httplib_request_kw)
2018-01-08T04:54:20.447231+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/http/client.py", line 1239, in request
2018-01-08T04:54:20.447690+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/http/client.py", line 1284, in _send_request
2018-01-08T04:54:20.448232+00:00 app[run.2251]: body = _encode(body, 'body')
2018-01-08T04:54:20.448234+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/http/client.py", line 161, in _encode
2018-01-08T04:54:20.448405+00:00 app[run.2251]: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 553626-553628: Body ('信用卡') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.
2018-01-08T04:54:20.447689+00:00 app[run.2251]: self._send_request(method, url, body, headers, encode_chunked)
2018-01-08T04:54:20.448396+00:00 app[run.2251]: (name.title(), data[err.start:err.end], name)) from None
2018-01-08T04:54:20.621819+00:00 heroku[run.2251]: State changed from up to complete
2018-01-08T04:54:20.609814+00:00 heroku[run.2251]: Process exited with status 1

max-sixty · 2018-01-10T05:27:37Z

Can you slim the dataframe down to a small subset and post it?

2legit · 2018-01-10T08:28:36Z

Sure, the following would fail when I try to post to BigQuery using to_gbq( ) on heroku but seems to run fine on my Mac.

df = pd.DataFrame(np.random.randn(6, 4), index=range(6), columns=list('ABCD'))
df.A = '信用卡'

max-sixty · 2018-01-10T16:19:15Z

It works fine in linux for me (though oddly won't print the characters on the screen, though my mac will).
I'm not strong on Linux. You could look at Locales, but that's a low confidence suggestion. You may get more help on SO too.

2legit · 2018-01-11T11:15:15Z

Thanks. I've checked SO and have event tried to set the locale. I think the issue is earlier. I'm not sure why by the time the dataframe gets to http.client that it's still a str. Shouldn't it have been converted to bytes much earlier? The error is a result of http.client having to try to encode the str with latin-1 which it fails one chinese chars.

northlaender · 2018-01-11T14:40:11Z

Same issue here with the 0.3.0 version and it works on the 0.2.1 version with exactly the same data. Tried to push with the same df from a csv with encoding='UTF-8' and still the same error

max-sixty · 2018-01-11T16:57:48Z

Are you on Py2 or Py3?

max-sixty · 2018-01-11T16:58:37Z

@2legit that makes sense. Do you happen to know where it's attempting to convert to latin-1?

2legit · 2018-01-11T22:49:55Z

Hi, yes in the track trace you can see it's during the _encode( ) on line 161 of client.py. I'm using python 3.6.3.

python3.6/http/client.py", line 161, in _encode
2018-01-08T04:54:20.448405+00:00 app[run.2251]: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 553626-553628: Body ('信用卡') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

2legit · 2018-01-11T22:57:52Z

I think @northlaender is right. The Mac vs linux is a red-herring. Something is broken in pandas-gbq v 0.3.0 that was working fine in v. 0.2.1. I'm using the latter on my mac but on Heroku, it's pulling the latest pandas-gbq since I didn't pin the version. Just checked and Heroku is running v 0.3.0 which is breaking on non-latin chars.

northlaender · 2018-01-12T07:39:47Z

@maxim-lian I'm on Python 3.6.2

max-sixty · 2018-01-12T13:52:48Z

If you want to try the fix, do pip3 install git+https://github.com/maxim-lian/pandas-gbq.git#encoding and see if that works

northlaender · 2018-01-12T14:17:32Z

hi @maxim-lian just did a quick test with the same dataset and still no success.
complete error is:

gbq.__version '0.1.2+66.g61bc28f'

Load is 100.0% Complete

UnicodeEncodeError Traceback (most recent call last)
in ()
----> 1 gbq.to_gbq(dataframes, 'test.Checkout_Web', project_id=projectId, private_key=creds, if_exists='replace')

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\pandas_gbq\gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key, auth_local_webserver)
987 table.create(table_id, table_schema)
988
--> 989 connector.load_data(dataframe, dataset_id, table_id, chunksize)
990
991

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\pandas_gbq\gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize)
588 body,
589 destination_table,
--> 590 job_config=job_config).result()
591 except self.http_error as ex:
592 self.process_http_error(ex)

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\cloud\bigquery\client.py in load_table_from_file(self, file_obj, destination, rewind, size, num_retries, job_id, job_id_prefix, job_config)
770 if size is None or size >= _MAX_MULTIPART_SIZE:
771 response = self._do_resumable_upload(
--> 772 file_obj, job_resource, num_retries)
773 else:
774 response = self._do_multipart_upload(

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\cloud\bigquery\client.py in _do_resumable_upload(self, stream, metadata, num_retries)
799
800 while not upload.finished:
--> 801 response = upload.transmit_next_chunk(transport)
802
803 return response

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\resumable_media\requests\upload.py in transmit_next_chunk(self, transport)
393 result = _helpers.http_request(
394 transport, method, url, data=payload, headers=headers,
--> 395 retry_strategy=self._retry_strategy)
396 self._process_response(result, len(payload))
397 return result

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\resumable_media\requests_helpers.py in http_request(transport, method, url, data, headers, retry_strategy, **transport_kwargs)
99 **transport_kwargs)
100 return _helpers.wait_and_retry(
--> 101 func, RequestsMixin._get_status_code, retry_strategy)

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\resumable_media_helpers.py in wait_and_retry(func, get_status_code, retry_strategy)
144 object: The return value of func.
145 """
--> 146 response = func()
147 if get_status_code(response) not in RETRYABLE:
148 return response

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\auth\transport\requests.py in request(self, method, url, data, headers, **kwargs)
199
200 response = super(AuthorizedSession, self).request(
--> 201 method, url, data=data, headers=request_headers, **kwargs)
202
203 # If the response indicated that the credentials needed to be

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\requests\sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
506 }
507 send_kwargs.update(settings)
--> 508 resp = self.send(prep, **send_kwargs)
509
510 return resp

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\requests\sessions.py in send(self, request, **kwargs)
616
617 # Send the request
--> 618 r = adapter.send(request, **kwargs)
619
620 # Total elapsed time of the request (approximately)

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
438 decode_content=False,
439 retries=self.max_retries,
--> 440 timeout=timeout
441 )
442

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
599 timeout=timeout_obj,
600 body=body, headers=headers,
--> 601 chunked=chunked)
602
603 # If we're going to release the connection in finally:, then

c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
355 conn.request_chunked(method, url, **httplib_request_kw)
356 else:
--> 357 conn.request(method, url, **httplib_request_kw)
358
359 # Reset the timeout for the recv() on the socket

c:\users\northlaender\stack\python-stack\p3w\lib\http\client.py in request(self, method, url, body, headers, encode_chunked)
1237 encode_chunked=False):
1238 """Send a complete request to the server."""
-> 1239 self._send_request(method, url, body, headers, encode_chunked)
1240
1241 def _send_request(self, method, url, body, headers, encode_chunked):

c:\users\northlaender\stack\python-stack\p3w\lib\http\client.py in _send_request(self, method, url, body, headers, encode_chunked)
1282 # RFC 2616 Section 3.7.1 says that text default has a
1283 # default charset of iso-8859-1.
-> 1284 body = _encode(body, 'body')
1285 self.endheaders(body, encode_chunked=encode_chunked)
1286

c:\users\northlaender\stack\python-stack\p3w\lib\http\client.py in _encode(data, name)
159 "%s (%.20r) is not valid Latin-1. Use %s.encode('utf-8') "
160 "if you want to send it encoded in UTF-8." %
--> 161 (name.title(), data[err.start:err.end], name)) from None
162
163

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2122' in position 163736: Body ('™') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

max-sixty · 2018-01-12T14:56:59Z

Thanks for testing. Do you have a reproducible example?

northlaender · 2018-01-12T15:20:55Z

hi @maxim-lian happy to help.
so this df produces the error
df = pd.DataFrame({'string':['Skywalker™','lego','hülle'] ,'integer':[200,300,400] ,'Date':['2017-12-13 17:40:39','2017-12-13 17:40:39','2017-12-13 17:40:39']})

to_gbq breaks on the ™
If you remove the ™ then it works but the ü is a � in BigQuery

max-sixty · 2018-01-12T18:58:06Z

@northlaender I've made a change and added that exact case as a test - if you get a moment it would be super if you could try again. Thank you!

northlaender · 2018-01-15T12:46:10Z

hi @maxim-lian the latest fix didn't yield any other results.

max-sixty · 2018-01-16T23:27:45Z

@northlaender Thanks for checking.

Though when I run the test, I can see the unicode in gbq in Python2 and the test passes fine in Python3. Would you be able to paste a stack trace?

(tbc, you need to reinstall pip3 install git+https://github.com/maxim-lian/pandas-gbq.git#encoding for the new version to install)

northlaender · 2018-01-17T08:58:04Z

Hi @maxim-lian issue was clearly with updating the code via pip. I now rechecked and it works ! 👍

DanielWFrancis · 2018-01-24T19:17:15Z

I can't get this to work :(

I'm installing with pip as I don't have pip3 and can't figure out if it's really different given my root install is Python3.

max-sixty · 2018-01-24T19:52:59Z

@DanielWFrancis what do you see when you run which pip? If it's linked to 2.7, then you're not installing to your python2. But you must have some way to install to your python3?

This will confirm where python is sourcing the library from:

In [1]: import pandas_gbq

In [2]: pandas_gbq.__file__
Out[2]: '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_gbq/__init__.py'

smred · 2018-02-07T11:34:04Z

Hi,
similar problem on the Mac.

Python 3.6
Tried @maxim-lian link but not helped
I work just with cyryllic symbols

max-sixty · 2018-02-07T16:23:46Z

@smred Can you post your versions?

DanielWFrancis · 2018-02-08T00:23:55Z

@smred I worked around by opening the JSON file that I built my dataframe from with

import codecs

codecs.open('filename.json', "r", "ISO-8859-1").read()

I presume this will work for any data source you might be using.

max-sixty · 2018-02-10T00:58:45Z

@tswast I think we can close this given #108

If this crops up again, please post here and we can reopen

max-sixty mentioned this issue Jan 12, 2018

Encode before uploading #108

Merged

joshcarty mentioned this issue Jan 12, 2018

Authenticate without browser joshcarty/google-searchconsole#1

Closed

tswast mentioned this issue Feb 5, 2018

JSON error python 3 #115

Closed

tswast closed this as completed Feb 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_gbq result in UnicodeEncodeError #106

to_gbq result in UnicodeEncodeError #106

2legit commented Jan 10, 2018

max-sixty commented Jan 10, 2018

2legit commented Jan 10, 2018

max-sixty commented Jan 10, 2018

2legit commented Jan 11, 2018

northlaender commented Jan 11, 2018

max-sixty commented Jan 11, 2018

max-sixty commented Jan 11, 2018

2legit commented Jan 11, 2018

2legit commented Jan 11, 2018

northlaender commented Jan 12, 2018

max-sixty commented Jan 12, 2018

northlaender commented Jan 12, 2018

max-sixty commented Jan 12, 2018

northlaender commented Jan 12, 2018

max-sixty commented Jan 12, 2018

northlaender commented Jan 15, 2018

max-sixty commented Jan 16, 2018

northlaender commented Jan 17, 2018

DanielWFrancis commented Jan 24, 2018

max-sixty commented Jan 24, 2018 •

edited

smred commented Feb 7, 2018 •

edited

max-sixty commented Feb 7, 2018

DanielWFrancis commented Feb 8, 2018 •

edited

max-sixty commented Feb 10, 2018

to_gbq result in UnicodeEncodeError #106

to_gbq result in UnicodeEncodeError #106

Comments

2legit commented Jan 10, 2018

max-sixty commented Jan 10, 2018

2legit commented Jan 10, 2018

max-sixty commented Jan 10, 2018

2legit commented Jan 11, 2018

northlaender commented Jan 11, 2018

max-sixty commented Jan 11, 2018

max-sixty commented Jan 11, 2018

2legit commented Jan 11, 2018

2legit commented Jan 11, 2018

northlaender commented Jan 12, 2018

max-sixty commented Jan 12, 2018

northlaender commented Jan 12, 2018

Load is 100.0% Complete

max-sixty commented Jan 12, 2018

northlaender commented Jan 12, 2018

max-sixty commented Jan 12, 2018

northlaender commented Jan 15, 2018

max-sixty commented Jan 16, 2018

northlaender commented Jan 17, 2018

DanielWFrancis commented Jan 24, 2018

max-sixty commented Jan 24, 2018 • edited

smred commented Feb 7, 2018 • edited

max-sixty commented Feb 7, 2018

DanielWFrancis commented Feb 8, 2018 • edited

max-sixty commented Feb 10, 2018

max-sixty commented Jan 24, 2018 •

edited

smred commented Feb 7, 2018 •

edited

DanielWFrancis commented Feb 8, 2018 •

edited