Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Slovenia countrywide addresses #2980

Merged
merged 6 commits into from
May 8, 2017

Conversation

stefanb
Copy link
Contributor

@stefanb stefanb commented May 8, 2017

Most current data can be checked at:
http://raba.openstreetmap.si/openaddresses/si-addresses-2017-05-07.zip (CC-BY Geodetska uprava Republike Slovenije, adapted for openaddresses.io)

fixes #2926

@stefanb
Copy link
Contributor Author

stefanb commented May 8, 2017

Can someone please helpe me with the error reported on
https://s3.amazonaws.com/data.openaddresses.io/runs/187276/output.txt

it says

2017-05-08 17:22:11,902    DEBUG: URL says ".zip" for http://raba.openstreetmap.si/openaddresses/si-addresses-2017-05-07.zip
2017-05-08 17:22:11,902    DEBUG: Guessed si--countrywide-5211a063.zip for http://raba.openstreetmap.si/openaddresses/si-addresses-2017-05-07.zip
2017-05-08 17:22:11,902    DEBUG: Requesting http://raba.openstreetmap.si/openaddresses/si-addresses-2017-05-07.zip with args None
2017-05-08 17:22:14,473     INFO: Downloaded 13520268 bytes for file /tmp/worker-2r2q318w/work-uqfd8yau/out/process_one-aeiv08jb/cache-2517bnx2/http/si--countrywide-5211a063.zip
2017-05-08 17:22:14,611     INFO: Cached data in file:///tmp/worker-2r2q318w/work-uqfd8yau/out/process_one-aeiv08jb/cached/si--countrywide-5211a063.zip
2017-05-08 17:22:14,643    DEBUG: URL says ".zip" for file:///tmp/worker-2r2q318w/work-uqfd8yau/out/process_one-aeiv08jb/cached/si--countrywide-5211a063.zip
2017-05-08 17:22:14,643    DEBUG: Guessed si--countrywide-057e432b.zip for file:///tmp/worker-2r2q318w/work-uqfd8yau/out/process_one-aeiv08jb/cached/si--countrywide-5211a063.zip
2017-05-08 17:22:14,675    DEBUG: File exists /tmp/worker-2r2q318w/work-uqfd8yau/out/process_one-aeiv08jb/conform-g6uk1tng/http/si--countrywide-057e432b.zip
2017-05-08 17:22:14,675     INFO: Downloaded to ['/tmp/worker-2r2q318w/work-uqfd8yau/out/process_one-aeiv08jb/conform-g6uk1tng/http/si--countrywide-057e432b.zip']
2017-05-08 17:22:15,557    DEBUG: Expanded file /tmp/worker-2r2q318w/work-uqfd8yau/out/process_one-aeiv08jb/conform-g6uk1tng/unzipped/si-addresses-2017-05-07.csv
2017-05-08 17:22:15,558     INFO: Decompressed to 1 files
2017-05-08 17:22:15,558     INFO: Sampled 6 records
2017-05-08 17:22:15,559    DEBUG: Converting to /tmp/worker-2r2q318w/work-uqfd8yau/out/process_one-aeiv08jb/conform-g6uk1tng
2017-05-08 17:22:15,568    DEBUG: extract temp file /tmp/openaddr-extracted-g3cut_ec.csv
2017-05-08 17:22:15,568     INFO: Converting source CSV /tmp/worker-2r2q318w/work-uqfd8yau/out/process_one-aeiv08jb/conform-g6uk1tng/unzipped/si-addresses-2017-05-07.csv
2017-05-08 17:22:15,639  WARNING: Error doing conform; skipping
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/openaddr/__init__.py", line 180, in conform
    csv_path, addr_count = task4.convert(data, decompressed_paths, workdir)
  File "/usr/local/lib/python3.4/dist-packages/openaddr/conform.py", line 552, in convert
    rc = conform_cli(source_definition, source_path, dest_path)
  File "/usr/local/lib/python3.4/dist-packages/openaddr/conform.py", line 1216, in conform_cli
    extract_to_source_csv(source_definition, source_path, extract_path)
  File "/usr/local/lib/python3.4/dist-packages/openaddr/conform.py", line 1165, in extract_to_source_csv
    csv_source_to_csv(source_definition, source_path, extract_path)
  File "/usr/local/lib/python3.4/dist-packages/openaddr/conform.py", line 767, in csv_source_to_csv
    for source_row in reader:
  File "/usr/lib/python3.4/csv.py", line 110, in __next__
    row = next(self.reader)
  File "/usr/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 3609: invalid continuation byte
2017-05-08 17:22:15,676  WARNING: Nothing processed

Is position 3609 decimal or hex?

If I interpret the position 3609 as decimal it is e19 in hex:

00000e00  31 33 2e 38 33 31 36 33  32 37 34 31 38 36 33 39  |13.8316327418639|
00000e10  33 33 3b 34 35 2e 38 39  39 32 39 34 36 39 38 32  |33;45.8992946982|
00000e20  32 36 34 37 3b 35 38 3b  4b 61 6d 6e 6a 65 3b 4b  |2647;58;Kamnje;K|
00000e30  61 6d 6e 6a 65 3b 41 6a  64 6f 76 c5 a1 c4 8d 69  |amnje;Ajdov....i|

there is no 0xc8 byte, but 0x39 (digit "9")

And neither is there such byte in hex position 3609:

000035f0  31 38 3b 38 31 42 3b 53  65 6c 6f 3b 53 65 6c 6f  |18;81B;Selo;Selo|
00003600  3b 41 6a 64 6f 76 c5 a1  c4 8d 69 6e 61 3b 47 6f  |;Ajdov....ina;Go|
00003610  72 69 c5 a1 6b 61 3b 35  32 36 32 0d 0a 31 33 2e  |ri..ka;5262..13.|

There is 0x8d in that position, second byte of letter "č", in utf-8 encoded as 0xc4 0x8d
http://www.fileformat.info/info/unicode/char/010d/index.htm

@sergiyprotsiv
Copy link
Contributor

@stefanb the scripts look great! The issue seems to be that the csv in the zip file is not a valid UTF8 (my text editor complains it is not). All the cases I was able to spot concern house numbers (e.g. lines 440, 2122, 2129). Not sure, but if the house numbers include "Č", it seems like the most likely issue (since all the addresses preceding to the offending one have "C" as the letter).

@albarrentine
Copy link
Contributor

Hey @stefanb, it looks like the problem lines have two different encodings.

13.87509424154621;45.897741779820926;77\xc8;Lokavec;Lokavec;Ajdov\xc5\xa1\xc4\x8dina;Gori\xc5\xa1ka;5270\r\n

The house number field "77\xc8" is not UTF-8 (ISO-8859-2 will decode that to "77Č" if that's what was intended), but the rest of the line can be decoded as such. OpenAddresses only allows one encoding per file, so might need to convert the house number to UTF-8 in the preprocessing script with e.g. .decode('ISO-8859-2') followed by .encode('UTF-8').

@stefanb
Copy link
Contributor Author

stefanb commented May 8, 2017

I already use iconv to convert the source windows 1250 encoding to utf-8. But yes, I did not anticipate complex characters in house numbers, so I didn't add it there as I did for all other text fields
Will add that to scripts/si/makeCSVs.sh and rerun it.

@openaddresses-bot
Copy link
Contributor

Preview

More: https://results.openaddresses.io/jobs/7ead03ad-71b3-4d4c-a6a8-cab2496cbfb5

@migurski
Copy link
Member

migurski commented May 8, 2017

I love it when whole new countries res in. Thank you @stefanb!

@albarrentine
Copy link
Contributor

Looks great, thanks @stefanb!

@albarrentine albarrentine merged commit ea80fa7 into openaddresses:master May 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slovenia Addresses
5 participants