-
-
Notifications
You must be signed in to change notification settings - Fork 849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Korean source CSV files have an inconsistent number of columns #851
Comments
It looks like it's the third-to-last column that's supicious. That's labelled "상세건물명" or "DET_BD_NM", which apparently is "Details Building Name". A good value for that is 102동, which means "102 East". A bad value for that is "경로당,관리사무실", which Google Translate tells me means "Per path, administration office". I wonder if the source data just isn't quoted properly, that the comma is intended to be part of the address rather than a field separator? I'm just going off of Google Translate; it'd be nice to have someone with Korean language skills look at this. |
I did some of the research for this data in a private repo (stupidly), but here are my notes. After reviewing them I feel pretty certain that the source data is untouched by my scripts -- if there are varying numbers of columns, it's probably in the source data. I strongly suspect that this is the result of someone rolling their own CSV creation script and failing to escape commas. @NelsonMinar can you ballpark how many rows are affected? We might be able to fix them manually or just throw them out. Alternately, we do have a Korean speaker that we found and retained to make some phone calls to MLIT. I bet he'd be willing to look at this if we can isolate a good example of good rows versus bad rows (and perhaps test the above hypothesis about unescaped commas). Korean Open Address DataThe South Korean open data site is just horrible. The following files were collected by hand, with lots of clicking and filling out of fake filter criteria that match all records. It is probably possible to write a scraper but I do not envy the poor souls who try. LicenseThere is a use license (in Korean) associated with data.seoul.go.kr, which has to be accepted prior to downloading each dataset. The data itself is licensed under CC-BY, as these UI elements of the dataset pages indicate. Clicking the highlighted blue box: Will reveal: TransformationsOriginal downloaded filenames are preserved in the corresponding CSV. I was unable to transliterate them properly, so they are retained for historical interest. CSVs are in EUC-KR encoding, easily managed with Old/New Addresses
Columns
cf http://www.isotc211.org/Workshop_Busan/Presentations/Piotrowski.pdf & Source URL list
|
Thanks for the details! It's a very small number of rows, about 1 in 1000 for kr-seoul-chongnogu.csv and kr-seoul-songpagu.csv. A bit more looking suggests it's not always the third column from the last. None of the source CSV contains quoted text, so I bet it's a lack of quoting that's to blame. I hate to volunteer for the extra work, but I think rather than trying to hand-patch a few sources we should make the conform parsing code more tolerant of malformed input. It's simple enough conceptually to just skip unparseable rows, although it's a PITA to implement with Python's CSV module. |
Yeah, agreed. Tossing broken rows is the way to go. I would probably just do something hacky like
but I haven't done this in python 3. |
OK I moved this to a machine code issue. |
The
kr-seoul-*
files name CSV sources stored in our own S3 bucket, at URLs like http://s3.amazonaws.com/data.openaddresses.io/cache/kr-seoul-songpagu.zipThe CSV files stored there are not well-formed. They mostly have 12 columns per row, but occasionally a row has 13 columns. Is this fixable in the data source?
An example source with this problem is kr-seoul-chongnogu. Here's 5 rows from it, the middle row has an extra column. Note that the source encoding is EUCKR.
Here's a Python script to help identify problem CSV files. A quick shell snippet to find 13 column rows:
The text was updated successfully, but these errors were encountered: