Permalink
Browse files

finished reconciliation code

1 parent 617b305 commit 99530484fb65df9164da3ce717483970e92846d6 Philip (flip) Kromer committed Aug 20, 2012
View
12 .dexy
@@ -1,16 +1,8 @@
{
"airport.rb|idio": {},
- "airline_flights.rake": { "airports.rb": {} },
- "airline_flights.rake|rake" : { "rake": { "args": "airports:parse:openflights" } },
- "airline_flights.rake|rake" : { "rake": { "args": "airports:parse:dataexpo" } },
- "dataexpo_airports-parsed-sample.tsv|snippet|wulign": { "airline_flights.rake|rake": {} },
- "openflights_airports-parsed-sample.tsv|snippet|wulign": { "airline_flights.rake|rake": {} },
"dataex*sample.csv|snippet" : {},
- "dataex*.tsv" : {},
- "lines*.csv|snippet" : {},
- "*.asciidoc|jinja" : {
- "allinputs" : true
- },
+ "*.asciidoc|jinja" : { "allinputs": true },
+ "*.tsv|snippet|wulign": { },
"end":{}}
@@ -23,6 +23,10 @@ And here's what it looks like, transformed from raw to target model:
{{ d['tmp/airline_flights/openflights_airports-parsed-sample.tsv|snippet|wulign'] }}
----
+----
+{{ d['tmp/airline_flights/airport_identifiers-sample.tsv|snippet|wulign'] }}
+----
+
.Airplane Models
----
include::code/munging/airline_flights/airplane.rb[depth=1]
@@ -45,19 +49,21 @@ Most of the extraction is straightforward. For reasons explained below, we have
=== Recovering Time Zone
-So far, the airline data is fairly straightforward to import. However, Loki the Trickster rarely stays clear when it comes to adapting datasets across domains. The flight data has *local* actual/scheduled times, and it has airports, and it has the date -- but it has neither the *absolute* time nor the time zone.
+So far, the airline data is fairly straightforward to import. However, Loki the Trickster rarely stays clear when it comes to adapting datasets across domains. The flight data has _local_ actual/scheduled times, and it has airports, and it has the date -- but it has neither the _absolute_ time nor the time zone.
+
+So, you need a map from airports to time zones. Good news: Openflights.org has that data. In fact, it's more comprehensive and adds some other interesting columns. Bad news: its data is somewhat messier and its identifiers don't cleanly reconcile against the `airline_flights` table. So, you need a 'gazette': a unified table listing the IATA, ICAO and FAA id of each airport. Wikipedia has a table indexing airports by IATA (and some, but not all, pairings with ICAO and FAA); and a table indexing airports by ICAO (and some, but not all, pairings with IATA and FAA) -- and they mostly agree, but with a couple hundred (about 2% of nearly 10,000 airports) in conflict.
-So, you need a map from airports to time zones. Good news: Openflights.org has that data. In fact, it's more comprehensive and adds some other interesting columns. Bad news: its data is somewhat messier and its identifiers don't cleanly reconcile against the `airline_flights` table. So, you need a "*gazette*": a unified table listing the IATA, ICAO and FAA id of each airport. Wikipedia has a table indexing airports by IATA (and some, but not all, pairings with ICAO and FAA); and a table indexing airports by ICAO (and some, but not all, pairings with IATA and FAA) -- and they mostly agree, but with a couple hundred (about 2% of nearly 10,000 airports) in conflict.
+If you're keeping track: for want of a time zone, we need an airport-TZ map; for want of common identifiers, we need an ICAO-IATA-FAA identifier gazette; for want of clean resolution anywhere we end up reconciling two datasets against two different gazettes and hand-curating the _identifiable_ errors in the mapping footnote:["Yak Shaving": a recursively unbound descent into the sunk-cost fallacy]. I won't go into the boring details of reconciling the airports: they're boring, and detailed in the code (see `munging/airline_flights/reconcile*.rb`).
-If you're keeping track: for want of a time zone, we need an airport-TZ map; for want of common identifiers, we need an ICAO-IATA-FAA identifier gazette; for want of clean resolution anywhere we end up reconciling two datasets against two different gazettes and hand-curating the _identifiable_ errors in the mapping footnote:[The technical term for this recursively unbound descent into the sunk-cost fallacy is "Yak Shaving"]. I won't go into the boring details of reconciling the airports: they're boring, and detailed in the code (see `munging/airline_flights/reconcile*.rb`).
+But it's important to share that there is no royal road to clean data. It's easy to account for the work required to correct the obvious, surface messiness in the data. The _common case_ is that correcting the 2% of outliers, reconciling conflicting assertions, dealing with ill-formatted records or broken encodings, and the rest of the "chimpanzee" work takes more time than anything else involved in semi-structured data extraction. Most importantly, that work is not a programming problem -- it requires you to understand obscure details from the source domain not otherwise needed to solve your problem at hand
-But it's important to share that there is no royal road to clean data. It's easy to account for the work required to correct the obvious, surface messiness in the data. The **common case** is that correcting the 2% of outliers, reconciling conflicting assertions, dealing with ill-formatted records or broken encodings, and the rest of the "chimpanzee" work takes more time than anything else involved in semi-structured data extraction. Most importantly, that work is not a programming problem -- it requires you to understand obscure details from the source domain not otherwise needed to solve your problem at hand footnote:[I now know far more about the pecadilloes of international airport identifier schemes than I ever wished to know. Airports may have an IATA id, an ICAO id, and (in the US and its territories) an FAA id. In the _continental_ US, the ICAO is always the FAA id preceded by a "K": Austin-Bergstrom airport has FAA id `AUS`, IATA id `AUS` and ICAO id `KAUS`. However:
-<li>Not all airports have ICAO ids, and not all airports have IATA ids.</li>
-<li>The FAA id often, but not always, matches the IATA id; this is the primary source of errors in the airport metadata, as people blithely assign the FAA id to an IATA-id-less airport.</li>
+footnote:[I now know far more about the pecadilloes of international airport identifier schemes than I ever wished to know. Airports may have an IATA id, an ICAO id, and (in the US and its territories) an FAA id. In the _continental_ US, the ICAO is always the FAA id preceded by a "K": Austin-Bergstrom airport has FAA id `AUS`, IATA id `AUS` and ICAO id `KAUS`. However:
+* Not all airports have ICAO ids, and not all airports have IATA ids.
+* sThe FAA id often, but not always, matches the IATA id; this is the primary source of errors in the airport metadata, as people blithely assign the FAA id to an IATA-id-less airport.</li>
<li>There is yet another identifier, the METAR id, used to identify the weather station at an airport; it was once the same as the ICAO id but they are now maintained independently.</li></ul>
-Yet, somehow, all those planes typically land at the right place.].
+Yet, somehow, all those planes typically land at the right place.]
-sidebar: Note that I both converted the altitude figure to meters and rounded it to one decimal place.
+note:[Note that I both converted the altitude figure to meters and rounded it to one decimal place.]
=== Foundational Data ===
@@ -1,3 +1,15 @@
+class Airline
+ include Gorillib::Model
+ field :iata_id, String, doc: "2-letter IATA code, if available"
+ field :icao_id, String, doc: "3-letter ICAO code, if available"
+ field :airline_ofid, Integer, doc: "Unique OpenFlights identifier for this airline."
+ field :alias, String, doc: "Alias of the airline. For example, 'All Nippon Airways' is commonly known as 'ANA'"
+ field :callsign, String, doc: "Airline callsign"
+ field :country, String, doc: "Country or territory where airline is incorporated"
+ field :active, :boolean, doc: 'true if the airline is or has until recently been operational, false if it is defunct. (This is only a rough indication and should not be taken as 100% accurate)'
+ field :name, String, doc: "Airline name."
+end
+
#
# As of January 2012, the OpenFlights Airlines Database contains 5888
# airlines. If you enjoy this data, please consider [visiting their page and
@@ -28,22 +40,10 @@ class RawOpenflightAirline
field :active, :boolean, doc: 'true if the airline is or has until recently been operational, false if it is defunct. (This is only a rough indication and should not be taken as 100% accurate)'
def receive_active(val)
- super(case val when "Y" then true when "N" then false else val ; end)
+ super(case val.to_s when "Y" then true when "N" then false else val ; end)
end
def to_airline
Airline.receive(self.compact_attributes)
end
end
-
-class Airline
- include Gorillib::Model
- field :iata_id, String, doc: "2-letter IATA code, if available"
- field :icao_id, String, doc: "3-letter ICAO code, if available"
- field :airline_ofid, Integer, doc: "Unique OpenFlights identifier for this airline."
- field :alias, String, doc: "Alias of the airline. For example, 'All Nippon Airways' is commonly known as 'ANA'"
- field :callsign, String, doc: "Airline callsign"
- field :country, String, doc: "Country or territory where airline is incorporated"
- field :active, :boolean, doc: 'true if the airline is or has until recently been operational, false if it is defunct. (This is only a rough indication and should not be taken as 100% accurate)'
- field :name, String, doc: "Airline name."
-end
@@ -1,47 +1,68 @@
require_relative('../../rake_helper')
-require_relative('airport')
-require 'pry'
+require_relative('./models')
Pathname.register_paths(
af_data: [:data, 'airline_flights'],
af_work: [:work, 'airline_flights'],
+ af_code: File.dirname(__FILE__),
#
openflights_raw_airports: [:af_data, "openflights_airports-raw#{Settings[:mini_slug]}.csv" ],
dataexpo_raw_airports: [:af_data, "dataexpo_airports-raw#{Settings[:mini_slug]}.csv" ],
wikipedia_icao: [:af_data, "wikipedia_icao.tsv" ],
wikipedia_iata: [:af_data, "wikipedia_iata.tsv" ],
+ wikipedia_us_abroad: [:af_data, "wikipedia_us_abroad.tsv" ],
#
openflights_parsed: [:af_work, "openflights_airports-parsed#{Settings[:mini_slug]}.tsv"],
dataexpo_parsed: [:af_work, "dataexpo_airports-parsed#{Settings[:mini_slug]}.tsv" ],
airport_identifiers: [:af_work, "airport_identifiers.tsv" ],
+ airport_identifiers_mini: [:af_work, "airport_identifiers-sample.tsv" ],
)
chain :airline_flights do
+ code_files = FileList[Pathname.of(:af_code, '*airport*.rb').to_s]
chain(:parse) do
- # desc 'parse the dataexpo airports'
- # create_file :dataexpo_parsed do |dest|
- # RawDataexpoAirport.load_csv(:dataexpo_raw_airports) do |raw_airport|
- # dest << raw_airport.to_airport.to_tsv << "\n"
- # end
- # end
- #
- # desc 'parse the openflights airports'
- # create_file :openflights_parsed do |dest|
- # RawOpenflightAirport.load_csv(:openflights_raw_airports) do |raw_airport|
- # dest << raw_airport.to_airport.to_tsv << "\n"
- # end
- # end
-
- desc 'run the identifier resolver'
- file_task(:airport_identifiers,
- # after: [:dataexpo_parsed, :openflights_parsed]
- ) do |dest|
- require_relative 'resolve_identifiers'
- Airport::IdReconciler.load
+ desc 'parse the dataexpo airports'
+ create_file :dataexpo_parsed, after: code_files do |dest|
+ RawDataexpoAirport.load_airports(:dataexpo_raw_airports) do |airport|
+ dest << airport.to_tsv << "\n"
+ end
end
+
+ desc 'parse the openflights airports'
+ create_file :openflights_parsed, after: code_files do |dest|
+ RawOpenflightAirport.load_airports(:openflights_raw_airports) do |airport|
+ dest << airport.to_tsv << "\n"
+ end
+ end
+
+ task :reconcile_airports => [:dataexpo_parsed, :openflights_parsed] do
+ require_relative 'reconcile_airports'
+ Airport::IdReconciler.load_all
+ end
+
+ desc 'run the identifier reconciler'
+ create_file(:airport_identifiers, after: code_files, invoke: 'airline_flights:parse:reconcile_airports') do |dest|
+ Airport::IdReconciler.airports.each do |airport|
+ dest << airport.to_tsv << "\n"
+ puts airport if airport.faa_controlled? && (airport.icao !~ /^K/) && (airport.faa.blank?)
+ end
+ end
+
+ desc 'run the identifier reconciler'
+ create_file(:airport_identifiers_mini, after: code_files, invoke: 'airline_flights:parse:reconcile_airports') do |dest|
+ Airport::IdReconciler.exemplars.each do |airport|
+ dest << airport.to_tsv << "\n"
+ end
+ end
+
end
end
# task :default => 'airline_flights'
-task :default => 'airline_flights:parse:airport_identifiers'
+task :default => [
+ # 'airline_flights:parse:dataexpo_parsed',
+ # 'airline_flights:parse:openflights_parsed',
+ 'airline_flights:parse:airport_identifiers',
+ # 'airline_flights:parse:airport_identifiers_mini'
+]
@@ -56,6 +56,18 @@ def lint
errors.compact_blank
end
+ def to_s
+ str = "#<Airport "
+ str << [icao, iata, faa,
+ (latitude && "%4.1f" % latitude), (longitude && "%5.1f" % longitude), state, country,
+ "%-30s" % name, country, city].join("\t")
+ str << ">"
+ end
+
+ def faa_controlled?
+ icao =~ /^(?:K|P[ABFGHJKMOPW]|T[IJ]|NS(AS|FQ|TU))/
+ end
+
end
### @export "nil"
@@ -81,7 +93,12 @@ def lint
### @export "raw_openflight_airport"
module RawAirport
- COUNTRIES = { 'Puerto Rico' => 'us', 'Canada' => 'ca', 'USA' => 'us', 'United States' => 'us', 'Northern Mariana Islands' => 'us', }
+ COUNTRIES = { 'Puerto Rico' => 'us', 'Canada' => 'ca', 'USA' => 'us', 'United States' => 'us',
+ 'Northern Mariana Islands' => 'us', 'N Mariana Islands' => 'us',
+ 'Federated States of Micronesia' => 'fm',
+ 'Thailand' => 'th', 'Palau' => 'pw',
+ 'American Samoa' => 'as', 'Wake Island' => 'us', 'Virgin Islands' => 'vi', 'Guam' => 'gu'
+ }
BLANKISH_STRINGS = ["", nil, "NULL", '\\N', "NONE", "NA", "Null", "..."]
OK_CHARS_RE = /[^a-zA-Z0-9\ \/\.\,\-\(\)\']/
@@ -122,14 +139,18 @@ class RawOpenflightAirport
field :icao, String, blankish: BLANKISH_STRINGS, doc: "4-letter ICAO code; Blank if not assigned."
field :latitude, Float, doc: "Decimal degrees, usually to six significant digits. Negative is South, positive is North."
field :longitude, Float, doc: "Decimal degrees, usually to six significant digits. Negative is West, positive is East."
- field :altitude_ft, Float, doc: "In feet.", blankish: ['', nil, 0, '0']
+ field :altitude_ft, Float, blankish: ['', nil, 0, '0'], doc: "In feet."
field :utc_offset, Float, doc: "Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5."
field :dst_rule, String, doc: "Daylight savings time rule. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown). See the readme for more."
- UNRELIABLE_OPENFLIGHTS_IATA_VALUES = /^(7AK|AFE|AGA|AUQ|BDJ|BGW|BME|BPM|BXH|BZY|CAT|CEE|CEJ|CFS|CGU|CIO|CLV|CNN|DEE|DIB|DNM|DUH|DUR|FKI|GES|GSM|HKV|HOJ|HYD|IEO|IFN|IKA|IZA|JCU|JGS|KAE|KMW|KNC|LGQ|LUM|MCU|MCY|MDO|MOH|MON|MON|MPH|MVF|NAY|NMA|NOE|NQY|OTU|OUI|PBV|PCA|PCB|PGK|PHO|PIF|PKN|PKY|PMK|PTG|PZO|QAS|QKT|QVY|RCM|RJL|RTG|SBG|SDZ|SFG|SIC|SIQ|SJI|SRI|STG|STP|STU|SWQ|TJQ|TJS|TMC|TYA|UKC|VIY|VQS|VTS|WDH|WKM|WPR|WPU|ZQF)$/
+ UNRELIABLE_OPENFLIGHTS_IATA_VALUES = /^(7AK|AGA|AUQ|BDJ|BGW|BME|BPM|BXH|BZY|CAT|CEE|CEJ|CFS|CGU|CIO|CLV|CNN|DEE|DIB|DNM|DUH|DUR|FKI|GES|GSM|HKV|HOJ|HYD|IEO|IFN|IKA|IZA|JCU|JGS|KMW|KNC|LGQ|LUM|MCU|MCY|MDO|MOH|MON|MPH|MVF|NAY|NMA|NOE|NQY|OTU|OUI|PBV|PCA|PCB|PGK|PHO|PIF|PKN|PKY|PMK|PTG|PZO|QAS|QKT|QVY|RCM|RJL|RTG|SBG|SDZ|SFG|SIC|SIQ|SJI|SRI|STP|STU|SWQ|TJQ|TJS|TMC|TYA|UKC|VIY|VQS|VTS|WDH|WKM|WPR|WPU|ZQF)$/
+
+ def id_is_faa?
+ (icao =~ /^(?:K)/) || (icao.blank? && country == 'us')
+ end
- def iata ; (icao =~ /^K/ ? nil : iata_faa) unless iata_faa =~ UNRELIABLE_OPENFLIGHTS_IATA_VALUES end
- def faa ; (icao =~ /^K/ ? iata_faa : nil ) end
+ def iata ; (id_is_faa? ? nil : iata_faa) unless iata_faa =~ UNRELIABLE_OPENFLIGHTS_IATA_VALUES end
+ def faa ; (id_is_faa? ? iata_faa : nil ) end
def altitude
altitude_ft && (0.3048 * altitude_ft).round(1)
end
@@ -164,7 +185,7 @@ class RawDataexpoAirport
def to_airport
attrs = self.compact_attributes
- attrs[:icao] = "K#{faa}" if faa =~ /[A-Z]{3}/ && (not ['PR', 'AK', 'CQ', 'HI', 'AS'].include?(state))
+ attrs[:icao] = "K#{faa}" if faa =~ /[A-Z]{3}/ && (not ['PR', 'AK', 'CQ', 'HI', 'AS', 'GU', 'VI'].include?(state)) && (country == 'us')
Airport.receive(attrs)
end
@@ -1,9 +1,3 @@
-require 'gorillib/model'
-require 'gorillib/model/factories'
-require 'gorillib/model/serialization'
-require 'gorillib/model/serialization/csv'
-require 'gorillib/type/extended'
-
require_relative './airline'
require_relative './airport'
require_relative './route'
@@ -1,7 +1,5 @@
require_relative './models'
require 'gorillib/model/reconcilable'
-require 'gorillib/array/hashify'
-require 'gorillib/model/serialization/tsv'
class Airport
include Gorillib::Model::Reconcilable
@@ -59,21 +57,32 @@ class IdReconciler
def ids
opinions.flat_map{|op| op.ids.to_a }.uniq.compact
end
- def icao() ids.assoc(:icao) ; end
- def iata() ids.assoc(:iata) ; end
- def faa() ids.assoc(:faa) ; end
- def self.load
+ def self.load_all
+ Log.info "Loading all Airports and reconciling"
+ @airports = Array.new
RawDataexpoAirport .load_airports(:dataexpo_raw_airports ){|airport| register(:dataexpo, airport) }
RawOpenflightAirport.load_airports(:openflights_raw_airports){|airport| register(:openflights, airport) }
RawAirportIdentifier.load_airports(:wikipedia_icao ){|airport| register(:wp_icao, airport) }
RawAirportIdentifier.load_airports(:wikipedia_iata ){|airport| register(:wp_iata, airport) }
+ RawAirportIdentifier.load_airports(:wikipedia_us_abroad ){|airport| register(:wp_us_abroad, airport) }
recs = ID_MAP.map{|attr, hsh| hsh.sort.map(&:last) }.flatten.uniq
- cons = recs.map{|rec| rec.reconcile }
- cons.each do |consensus|
- lint = consensus.lint
- puts "%-79s\t%s" % [lint, consensus.to_s[0..100]] if lint.present?
+ recs.each do |rec|
+ consensus = rec.reconcile
+ # lint = consensus.lint
+ # puts "%-79s\t%s" % [lint, consensus.to_s[0..100]] if lint.present?
+ @airports << consensus
+ end
+ end
+
+ def self.airports
+ @airports
+ end
+
+ def self.exemplars
+ Airport::EXEMPLARS.map do |iata|
+ ID_MAP[:iata][iata].reconcile
end
end
@@ -110,23 +119,21 @@ def adopt_opinions(vals, _)
#
def self.register(origin, obj)
obj._origin = origin
- ids = obj.ids
- reconciler = self.new(opinions: [obj])
- # get the existing objects
- existing = ids.map{|attr, id| ID_MAP[attr][id] }.compact.uniq
- # reconcile them
+ # get the existing reconcilers
+ existing = obj.ids.map{|attr, id| ID_MAP[attr][id] }.compact.uniq
+ # push the new object in, and pull the most senior one out
+ existing.unshift(self.new(opinions: [obj]))
+ reconciler = existing.shift
+ # unite them into the reconciler
existing.each{|that| reconciler.adopt(that) }
-
# save the reconciler under each of the ids.
reconciler.ids.each{|attr, id| ID_MAP[attr][id] = reconciler }
- # dump_info("1 #{origin}", ids, reconciler, existing)
end
def inspect
str = "#<#{self.class.name} #{ids}"
opinions.each do |op|
- str << "\n\t #{op._origin}\t"
- str << [op.icao, op.iata, op.faa, op.name, op.city].join("\t")
+ str << "\n\t #{op._origin}\t#{op}"
end
str << ">"
end
Oops, something went wrong.

0 comments on commit 9953048

Please sign in to comment.