Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Korean source CSV files have an inconsistent number of columns #851

Closed
NelsonMinar opened this issue Jan 10, 2015 · 5 comments
Closed

Korean source CSV files have an inconsistent number of columns #851

NelsonMinar opened this issue Jan 10, 2015 · 5 comments

Comments

@NelsonMinar
Copy link
Contributor

The kr-seoul-* files name CSV sources stored in our own S3 bucket, at URLs like http://s3.amazonaws.com/data.openaddresses.io/cache/kr-seoul-songpagu.zip

The CSV files stored there are not well-formed. They mostly have 12 columns per row, but occasionally a row has 13 columns. Is this fixable in the data source?

An example source with this problem is kr-seoul-chongnogu. Here's 5 rows from it, the middle row has an extra column. Note that the source encoding is EUCKR.

서울특별시,종로구,명륜2가,4,0,창경궁로,265,0,아남아파트,102동,199967.6616,454036.4788
서울특별시,종로구,명륜2가,4,0,창경궁로,265,0,아남아파트,103동,199970.6174,453969.5493
서울특별시,종로구,명륜2가,4,0,창경궁로,265,0,아남아파트,경로당,관리사무실,199985.4176,453957.7999
서울특별시,종로구,효제동,98,0,대학로,28,0,,D동,200272.9246,452712.8438
서울특별시,종로구,효제동,98,0,대학로,28,0,,,200237.5673,452699.3256

Here's a Python script to help identify problem CSV files. A quick shell snippet to find 13 column rows:

iconv -f EUCKR -t utf-8 < /tmp/korea/kr-seoul-chongnogu.csv | egrep '(.*,){12}'
@NelsonMinar
Copy link
Contributor Author

It looks like it's the third-to-last column that's supicious. That's labelled "상세건물명" or "DET_BD_NM", which apparently is "Details Building Name". A good value for that is 102동, which means "102 East". A bad value for that is "경로당,관리사무실", which Google Translate tells me means "Per path, administration office". I wonder if the source data just isn't quoted properly, that the comma is intended to be part of the address rather than a field separator?

I'm just going off of Google Translate; it'd be nice to have someone with Korean language skills look at this.

@sbma44
Copy link
Contributor

sbma44 commented Jan 10, 2015

I did some of the research for this data in a private repo (stupidly), but here are my notes. After reviewing them I feel pretty certain that the source data is untouched by my scripts -- if there are varying numbers of columns, it's probably in the source data. I strongly suspect that this is the result of someone rolling their own CSV creation script and failing to escape commas.

@NelsonMinar can you ballpark how many rows are affected? We might be able to fix them manually or just throw them out.

Alternately, we do have a Korean speaker that we found and retained to make some phone calls to MLIT. I bet he'd be willing to look at this if we can isolate a good example of good rows versus bad rows (and perhaps test the above hypothesis about unescaped commas).


Korean Open Address Data

The South Korean open data site is just horrible. The following files were collected by hand, with lots of clicking and filling out of fake filter criteria that match all records. It is probably possible to write a scraper but I do not envy the poor souls who try.

License

There is a use license (in Korean) associated with data.seoul.go.kr, which has to be accepted prior to downloading each dataset. The data itself is licensed under CC-BY, as these UI elements of the dataset pages indicate.

Clicking the highlighted blue box:
image

Will reveal:
http://data.seoul.go.kr/openinf/sheetview.jsp?infId=OA-1084
image

Transformations

Original downloaded filenames are preserved in the corresponding CSV. I was unable to transliterate them properly, so they are retained for historical interest.

CSVs are in EUC-KR encoding, easily managed with iconv.

Old/New Addresses

Korea recently switched address systems. This data appears to contain fields for both styles of addresses within each record. NOTE: Korea did switch addresses, but on a national level from Tokyo-style to Western-style addressing. juso.go.kr is devoted to this transition. The old and new columns in this dataset seem to be specific to Seoul, and I'm still unsure of their meaning. IIRC, naver.com seems to geocode against the old-style numbers, making me think they're the ones in most common use.

Columns

1. attempt names (?)
2. sigungu name - same for all rows
3. eup/myeon/dong - town/township/neighborhood
4. old address system building bonbeon (street number)
5. old address system building bubeon (door)
6. road name
7. new address system building bonbeon (street number)
8. old address system building bubeon (door)
9. building name
10. building name contd
11. longitude
12. latitude

cf http://www.isotc211.org/Workshop_Busan/Presentations/Piotrowski.pdf & annotated_fieldnames.png in this repo.

Source URL list

@NelsonMinar
Copy link
Contributor Author

Thanks for the details! It's a very small number of rows, about 1 in 1000 for kr-seoul-chongnogu.csv and kr-seoul-songpagu.csv. A bit more looking suggests it's not always the third column from the last. None of the source CSV contains quoted text, so I bet it's a lack of quoting that's to blame.

I hate to volunteer for the extra work, but I think rather than trying to hand-patch a few sources we should make the conform parsing code more tolerant of malformed input. It's simple enough conceptually to just skip unparseable rows, although it's a PITA to implement with Python's CSV module.

@sbma44
Copy link
Contributor

sbma44 commented Jan 10, 2015

Yeah, agreed. Tossing broken rows is the way to go. I would probably just do something hacky like

try:
    row = next(csv_reader)
except:
    continue

but I haven't done this in python 3.

@NelsonMinar
Copy link
Contributor Author

OK I moved this to a machine code issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants