fix bug 1035946 - urllib.unquote does not like ints, and backslash will #2176

Merged
merged 1 commit into from Jul 9, 2014

Projects

None yet

2 participants

@rhelmer
Member
rhelmer commented Jul 8, 2014

wreak havoc with the CSV->postgres import

@rhelmer
Member
rhelmer commented Jul 8, 2014

Two issues I hit looking at the live data:

  1. some rows contain ints, which urllib.unquote will barf on
  2. saw at least one channel with a backslash, which is an escape character in CSV

Since each column is now getting converted to str I don't think .encoding('utf8') should be needed anymore... the underlying data should be ASCII anyway.

@rhelmer rhelmer changed the title from fix bug 1035946 - urllib.quote does not like ints, and backslash will to fix bug 1035946 - urllib.unquote does not like ints, and backslash will Jul 8, 2014
@adngdb adngdb commented on the diff Jul 9, 2014
socorro/cron/jobs/fetch_adi_from_hive.py
@@ -169,7 +169,7 @@ def run(self, connection, date):
f.write(
"\t"
.join(
- urllib2.unquote(v).decode('utf8')
+ urllib2.unquote(str(v)).replace('\\', '\\\\')
@adngdb
adngdb Jul 9, 2014 Mozilla member

Why replace('\\', '\\\\') and not replace('\', '\\')?

@adngdb
adngdb Jul 9, 2014 Mozilla member

It somehow makes sense in my head but I think it deserves a comment.

@rhelmer
rhelmer Jul 9, 2014 Mozilla member

\ is the escape sequence for Python:
https://docs.python.org/2/reference/lexical_analysis.html#string-literals

For example, it's used to represent things like linefeed (\n), but it can also be used to escape characters like ' and ":

>>> s = '"foo"=\'bar\''

So to encode a single \ you must use \\, otherwise you're escaping the ':

>>> 'foo'.replace('\', '\\')
  File "<stdin>", line 1
    'foo'.replace('\', '\\')
                           ^
SyntaxError: unexpected character after line continuation character
@rhelmer
rhelmer Jul 9, 2014 Mozilla member

The reason the SyntaxError is complaining about the continuation character is because if you follow the first \ as escaping the ', you can see the above is equivalent to writing:

'foo'.replace("', "\\')

Which is of course totally not what is meant at all! In fact since it's not part of a string, now the first \ encountered is being treated as the line continuation character.

@adngdb
adngdb Jul 9, 2014 Mozilla member

Hah, I am stupid! That totally makes sense! :)

@adngdb
Member
adngdb commented Jul 9, 2014

r+ but I'd like to have an explanation around the confusing replace().

@adngdb adngdb merged commit 699824b into mozilla:master Jul 9, 2014

1 check passed

Details default Jenkins build 'socorro-github' #3625 has succeeded
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment