Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timezone problem #31

Closed
alexprengere opened this issue May 26, 2014 · 16 comments
Closed

Timezone problem #31

alexprengere opened this issue May 26, 2014 · 16 comments

Comments

@alexprengere
Copy link
Member

In ori_por_public.csv, there are some timezones which are not listed there:

  • America/Brazil
  • America/Canada
  • America/Chili
  • America/Greenland
  • America/USA
  • Asia/China
  • Asia/Indonesia
  • Asia/Malaysia
  • Asia/Ventiane
  • Atlantic/Portugal
  • Australia/Unknown
  • Europe/Russia
  • Europe/Spain
  • Europe/Ukraine
  • Pacific/Marshall_Islands
  • Pacific/New_Zealand
  • Pacific/US_Minor_Islands

A good way to test and exclude timezones can be to use the pytz python package:

import pytz
pytz.all_timezones
['Africa/Abidjan', 'Africa/Accra', ...
@alexprengere
Copy link
Member Author

Any update on this issue? Many people using GeoBases are reporting this ;)

@wsteitz
Copy link
Member

wsteitz commented Jan 26, 2015

From looking at the script that generates ori_por_public.csv:
Timezone information is taken from ori_tz_light.csv and we rely on a mapping country_code => timezone. So (1) we are using timezones that do not exists and (2) we certainly have errors whenever a country has more than one timezone.

In my opinion, we should forget about the ori_tz_light file and get the timezone using the geolocation information. There are several ways to do this: http://stackoverflow.com/questions/16086962/how-to-get-a-time-zone-from-a-location-using-latitude-and-longitude-coordinates

@alexprengere
Copy link
Member Author

So I partially solved the problem, using ideas from your link.

I realized that GeoBases contains the GeoNames cities (with data='cities'), so there is no need to use the web API. I just wrote a (small!) python script that does:

  • iterate over all point of references with an invalid timezone
  • use its geocode to find the closest city match in GeoNames using GeoBases
  • use the timezone of the GeoNames match, and dump the match in a csv file

The script and its result are available at this gist. To reproduce the result, you need GeoBasesDev installed (pip install GeoBasesDev), and pytz (not in the standard library). Then just run:

git clone https://gist.github.com/7a8a35c691fad5f170d8.git
cd 7a8a35c691fad5f170d8
python fix_tz.py # writes in tz_fixes.csv

The format of the output is:

$ head tz_fixes.csv
NDP,America/USA,America/Chicago,2.57
SPX,America/USA,America/Chicago,3.91
QPF,America/Brazil,Indian/Antananarivo,214.30
SPO,Europe/Spain,Europe/Madrid,7.67
NOH,America/USA,America/Chicago,1.77
HLR,America/USA,America/Chicago,5.83

4 fields:

  • por
  • old broken timezone
  • new pytz-compliant timezone
  • distance of the match (could be used for a clean-up)

My recommendation would be to integrate the new timezones as is in ori_por_public.csv, since this only affect already broken timezones. Some matches may be wrong, but I think the problem may come from an invalid geocode for the por in the first place, and this could be fixed later manually, for example those should be looked at:

$ cat tz_fixes.csv | awk -F',' '{ if ($4 > 500) {print $0}}'
BAR,America/USA,America/Anchorage,1234.26
QDO,America/Brazil,Africa/Mogadishu,508.17
QDP,America/Brazil,Indian/Antananarivo,1004.01
DCK,America/USA,America/Anchorage,716.94
QLB,America/Brazil,Indian/Antananarivo,695.53
QPU,America/Brazil,Indian/Antananarivo,503.76
QIB,America/Brazil,Indian/Antananarivo,725.24
JON,Pacific/US_Minor_Islands,Pacific/Honolulu,1304.40
QGR,America/Greenland,America/Godthab,598.84
JON,Pacific/US_Minor_Islands,Pacific/Honolulu,1304.40
QGK,America/Brazil,Indian/Mauritius,1529.87
UQE,America/USA,America/Anchorage,543.02
ZQB,America/USA,America/Anchorage,1764.81
QMB,America/Brazil,Indian/Antananarivo,832.53
UNS,America/USA,America/Anchorage,1380.22
KPK,,America/Anchorage,541.99
JTI,America/Brazil,Atlantic/Stanley,827.49
MHL,America/USA,America/Anchorage,649.45
QEL,Australia/Unknown,Pacific/Rarotonga,1654.68

Another thing that worried me when developing the script: I ran it on por with valid timezones (a very simple change in the script), and found that actually 1816 of them had a match in GeoNames with a different timezone. This could mean that these 1816 have wrong coordinates (I checked a few of them manually, like XAI, who seems to have an reversed longitude but the right timezone). Right now I see no way to automatically fix this.

@alexprengere
Copy link
Member Author

For people wanting a quick fix of broken timezones, until this is properly merged into opentraveldata, here is how to do it with GeoBases. First download this file locally, then:

from GeoBases import GeoBase
G = GeoBase(data='ori_por', verbose=False)

TZ_FILE = op.join(op.abspath(op.dirname(__file__)), 'tz_fixes.csv')
TZ_FIXES = {}
with open(TZ_FILE ) as f:
    for row in f:
        iata, _, tz, _ = row.rstrip().split(',', 3)
        TZ_FIXES[iata] = tz

for por in G:
    iata = G.get(por, 'iata_code')
    if iata in TZ_FIXES:
        # Now G has no longer the broken timezone
        G.set(por, timezone=TZ_FIXES[iata])

@da115115
Copy link
Member

Many thanks, Alex!

Please note that the optd project has been deprecated a few month ago. The currently mainstream one is now opentraveldata.
Practically, the only thing changing is that the prefix has been renamed from 'ori' to 'optd'. So, the POR (points of reference) data file is now optd_por_public.csv.

Nevertheless, I will try to incorporate your fix, one way or the other, into the data processing code.

@da115115
Copy link
Member

Note that the execution of the fix_tz.py script fails, apparently with the MLC record:

Traceback (most recent call last):
  File "fix_tz.py", line 45, in <module>
    main()
  File "fix_tz.py", line 28, in main
    for p, p_tz, p_iata, p_city, p_geocode in pors_with_unk_tz(db_oripor):
  File "fix_tz.py", line 14, in pors_with_unk_tz
    p_city = db_oripor.get(p, 'city_name_list')[0]
  File "/home/build/.local/lib/python2.7/site-packages/GeoBases-4.23.0-py2.7.egg/GeoBases/GeoBaseModule.py", line 623, in get
    raise KeyError("Field '%s' [for key '%s'] not in %s" % (field, key, self._things[key].keys()))
KeyError: "Field 'city_name_list' [for key 'MLC'] not in ['comment', 'city_name_ascii', 'adm3_code', 'adm2_code', 'icao_code', 'adm2_name_ascii', '__gar__', 'alt_name_section', '__dup__', 'country_code', 'adm2_name_utf', 'timezone', 'lng', '__lno__', 'iata_code', 'gmt_offset', 'wiki_link', 'dst_offset', 'date_from', 'date_until', 'city_name_utf', 'raw_offset', 'cc2', 'fcode', 'is_geonames', 'adm1_name_utf', 'adm1_name_ascii', 'gtopo30', 'country_name', 'city_code', 'elevation', 'tvl_por_list@raw', 'tvl_por_list', '__par__', 'moddate', 'lat', 'state_code', 'location_type', '__key__', 'population', 'fclass', 'name', 'alt_name_section@raw', 'page_rank', 'geoname_id', 'adm4_code', 'faa_code', 'valid_id', 'continent_name', 'asciiname', 'adm1_code']"

@da115115
Copy link
Member

@alexprengere, could you alter your script, so that, from the optd_por_best_known_so_far.csv file, it generates the optd_por_tz.csv file, which currently has got only 426 records, which is fine to fix the current wrong time-zones, but is not future-proofed.

@da115115
Copy link
Member

ab45c19 brings the time-zones, as present in the optd_por_tz.csv file, for which the time-zones of a few POR have been fixed.
As @alexprengere suggests, a lot of coordinates are wrong, especially for POR, whose IATA codes begin with Q. However, none of those POR appear in flight schedules, and they are therefore not important; that can be checked through the absence of a PageRank in the ref_airport_pageranked.csv file.
By the way, the compare_por_files.sh script allows to calculate, and maintain, the optd_por_diff_w_geonames.csv file, giving the distance (when greater than 10 kms) for every POR between what are stated in respectively the optd_por_best_known_so_far.csv file and Geonames dump.

@alexprengere
Copy link
Member Author

First, the reason why the script is failing is because this is not the development version (indeed in legacy version city_code_list does not exist). To ensure that no conflict appear between GeoBases installations, you should either use a virtualenv or uninstall other versions.

pip uninstall GeoBases GeoBasesDev # repeat if necessary
pip install GeoBasesDev

Since the use cases are a bit different, I created another gist to generate the optd_por_tz.csv. The idea is not to break existing valid timezones (for pors like XAI which have a valid timezone but incorrect coordinates), so the script will not use GeoNames matches if the timezone is valid (though might be incorrect nonetheless):

git clone https://gist.github.com/d4ed1527f4c89a697755.git
cd d4ed1527f4c89a697755
wget 'https://raw.githubusercontent.com/opentraveldata/opentraveldata/master/opentraveldata/optd_por_best_known_so_far.csv' 
python generate_optd_por_tz.py optd_por_best_known_so_far.csv > optd_por_tz.csv

The output is enclosed in the gist here.

Note that in the last version of optd_por_public.csv, I still got some invalid timezones (with the first gist):

python fix_tz.py                                                                                                
STF with tz "" matches tz "Pacific/Port_Moresby" (dist 63.6km, "Stephens Island" -> "Daru")
WDB with tz "America/USA" matches tz "America/Vancouver" (dist 228.8km, "Deep Bay" -> "Terrace")
RNU with tz "Asia/Malaysia" matches tz "Asia/Kuching" (dist 0.0km, "Ranau MY" -> "Ranau")
JUC with tz "America/USA" matches tz "America/Los_Angeles" (dist 0.5km, "Los Angeles" -> "Silver Lake")
WLN with tz "America/USA" matches tz "America/Juneau" (dist 280.3km, "Little Naukati AK US" -> "Juneau")
JSN with tz "America/USA" matches tz "America/Los_Angeles" (dist 1.6km, "Los Angeles" -> "Echo Park")
HKP with tz "America/USA" matches tz "Pacific/Honolulu" (dist 20.1km, "Kaanapali Maui" -> "Wailuku")
JON with tz "Pacific/US_Minor_Islands" matches tz "Pacific/Honolulu" (dist 1304.4km, "Johnston Island" -> "Makakilo City")
PII with tz "America/USA" matches tz "America/Anchorage" (dist 8.9km, "Fairbanks" -> "Fairbanks")
XHG with tz "America/Canada" matches tz "America/Toronto" (dist 0.4km, "Ottawa" -> "Ottawa")
JON with tz "Pacific/US_Minor_Islands" matches tz "Pacific/Honolulu" (dist 1304.4km, "Johnston Island" -> "Makakilo City")
NKV with tz "America/USA" matches tz "America/Juneau" (dist 282.4km, "Nichen Cove" -> "Juneau")
KBK with tz "" matches tz "America/Juneau" (dist 172.6km, "Klag Bay" -> "Juneau")
PKS with tz "Asia/Ventiane" matches tz "Asia/Vientiane" (dist 4.6km, "Paksane" -> "Muang Pakxan")
MNP with tz "" matches tz "Pacific/Port_Moresby" (dist 10.8km, "None" -> "Port Moresby")
UNS with tz "America/USA" matches tz "America/Anchorage" (dist 1380.2km, "Umnak Island" -> "Anchorage")
KPK with tz "" matches tz "America/Anchorage" (dist 542.0km, "Parks Spb" -> "Anchorage")
IAT with tz "" matches tz "America/Los_Angeles" (dist 160.0km, "None" -> "Lompoc")
LAC with tz "" matches tz "Asia/Kuching" (dist 279.5km, "Swallow Reef Airstrip" -> "Victoria")
CBA with tz "America/USA" matches tz "America/Juneau" (dist 79.4km, "Corner Bay" -> "Juneau")
EFO with tz "America/USA" matches tz "America/Chicago" (dist 20.5km, "East Fork" -> "Fort Dodge")

@da115115
Copy link
Member

Thanks!

Note that I still have the same issue (with the 'MLC' key) with the first gist, and that I checked that GeoBasesDev was the only installed GeoBases version (with the procedure you give, i.e., uninstall any GeoBases instances and re-install GeoBasesDev).

For the second gist, I have another error:

BPN^Asia/Makassar
BPN^Asia/Makassar
BPO^Asia/Chongqing
BPO^Asia/Chongqing
Traceback (most recent call last):
  File "generate_optd_por_tz.py", line 38, in <module>
    main(sys.argv[1])
  File "generate_optd_por_tz.py", line 24, in main
    tz = db_oripor.get(iata, 'timezone')
  File "/home/build/.local/lib/python2.7/site-packages/GeoBases-4.23.0-py2.7.egg/GeoBases/GeoBaseModule.py", line 614, in get
    raise KeyError("Thing not found: %s" % str(key))
KeyError: 'Thing not found: BPR'

Nevertheless, I went through each of the POR you mentioned above (e.g., STF, WDB, ..., LAC, CBA, EFO) and fixed the corresponding time-zones:

  • For no longer valid POR, just added, or fixed, the time-zone.
  • For still valid POR, most of the time, it was because the geographical coordinates located those POR in the middle of nowhere. So, the fix consisted in fixing the coordinates.

Hence, the fixes will appear in OpenTravelData only once Geonames database dump will have been generated and integrated with OpenTravelData, i.e., not before a few days. Hopefully, next week (beginning of June 2015), it should be fine.

@alexprengere
Copy link
Member Author

I am sorry Denis, but you are still not using the development version. First because I cannot reproduce the error, and second because the traceback is betraying you ;)

Traceback (most recent call last):
...
    File "/home/build/.local/lib/python2.7/site-packages/GeoBases-4.23.0-py2.7.egg ...

The GeoBases-4.23.0-py2.7.egg is old, and is not even the current stable version (5.*), nor the development version (6.*). I do not understand why pip is not uninstalling that, and I also do not understand why you have stuff installed in /home/build, unless build is actually a user name. My recommendation would be to either manually delete those packages or create a virtualenv (you should probably do both to avoid versions shadowing each other in the future).

Here is another list of points where stuff may go wrong:

  • a data cache is created in ~/.GeoBases.d, so to avoid problems during versions upgrade, better clean that
  • depending on how pip works, you might need to explicitely add --pre flag to allow development versions to be installed (or explicitely pin the required version , like pip install GeoBasesDev==6.0.0a26)
  • python may be aliased to a hardcoded path, making virtualenv work failing (because virtualenv change the path to python on-the-fly). The workaround is to call /usr/bin/env python, which will return to "right" path to python

I just uploaded a new version of GeoBases on PyPI with the latest data (GeoBasesDev-6.0.0a26), so we should only have up-to-date results with the following set of commands.

Here is the complete set of commands for clean usage of the gists. If anything is not clear tell me.

# Manual deletion of obsolete packages
rm -rf /home/build/.local/lib/python2.7/site-packages/GeoBases*

# Virtualenv usage, cache cleaning
rm -rf ~/.GeoBases.d 
rm -rf 7a8a35c691fad5f170d8
git clone https://gist.github.com/7a8a35c691fad5f170d8.git 
cd 7a8a35c691fad5f170d8
virtualenv --no-site-packages --clear .venv
source .venv/bin/activate
pip install --pre GeoBasesDev
pip install pytz
/usr/bin/env python fix_tz.py 

In the messages you should see Downloading GeoBasesDev-6.0.0a26.tar.gz. The output should be exactly:

STF with tz "" matches tz "Pacific/Port_Moresby" (dist 60.5km, "Stephens Island QL AU" -> "Daru")
PKS with tz "Asia/Ventiane" matches tz "Asia/Vientiane" (dist 4.6km, "Paksane" -> "Muang Pakxan")
MNP with tz "" matches tz "Pacific/Port_Moresby" (dist 272.6km, "Maron Island PG" -> "Wewak")
LAC with tz "" matches tz "Asia/Kuching" (dist 279.5km, "Layang Layang Island MY" -> "Victoria")

The error for the second gist will probably be fixed with the same technique

# First, forget about the previous virtualenv
deactivate
rm -rf d4ed1527f4c89a697755
rm -rf ~/.GeoBases.d 
git clone https://gist.github.com/d4ed1527f4c89a697755.git
cd d4ed1527f4c89a697755
virtualenv --no-site-packages --clear .venv
source .venv/bin/activate
pip install --pre GeoBasesDev
pip install pytz
wget 'https://raw.githubusercontent.com/opentraveldata/opentraveldata/master/opentraveldata/optd_por_best_known_so_far.csv' 
/usr/bin/env python generate_optd_por_tz.py optd_por_best_known_so_far.csv > optd_por_tz.csv

Anytime you want to run the gists, you should make sure you are using latest version of GeoBases which should contain the latest ori-por data (just the set of commands from the virtualenv creation).
I removed the attached optd_por_tz.csv output of the second gist, the web page was not even loading because the file was too big.

@da115115
Copy link
Member

da115115 commented Jun 1, 2015

With the latest commit (4d6370bc1c) on OpenTravelData, there should not be any more wrong time-zone. Could you check?
We can then close that issue.

Of course, ideally, a script should be run to be sure we do not introduce new wrong time-zone. However, the AWK script raises a warning when a POR has got an unknown time-zone. So, the issue can then be fixed manually.

@alexprengere
Copy link
Member Author

With the latest commit, it is almost perfect ;). Just one remaining typo:

$ python fix_tz.py 
PKS with tz "Asia/Ventiane" matches tz "Asia/Vientiane" ...

The correct timezone seems to be Asia/Vientiane.

In the future, I will manually check the timezone validity when integrating the latest data in GeoBases. If problems occur, I will post here (even if the issue is closed I think we can still post), and wait for the fixes, to always get valid timezones in GeoBases.

@da115115
Copy link
Member

da115115 commented Jun 2, 2015

@da115115 da115115 closed this as completed Jun 2, 2015
@alexprengere
Copy link
Member Author

Awesome! I confirm that the latest data has 0 timezone problem ;).

@da115115
Copy link
Member

da115115 commented Jun 4, 2015

Thanks!

@da115115 da115115 reopened this May 20, 2016
opentraveldata-bot added a commit that referenced this issue Aug 8, 2016
…try. Close #31.

This commit was automatically imported from the repository opentraveldata/opentraveldata:

commit f5001ef44a89a9ada156e7e7f5fd77cb631f9990
tree 0821d603c4063c44bff7c5a1f77ba665d4b5df23
parent 93f71726e330ea53c9e430a87cafaa69d4755db7
author Denis Arnaud <denis.arnaud_fedora@m4x.org> 1470656248 +0300
committer Denis Arnaud <denis.arnaud_fedora@m4x.org> 1470656248 +0300

    [Country] Removed the line for HI/Hawai, as that latter is not a country. Close #31.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants