Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

actullly munging code lives with wukong/examples -- this will remain …

…data-only
  • Loading branch information...
commit d58c27c1475b2ed40c7a623143e82b7dd44e22ed 1 parent 8ae2140
@mrflip mrflip authored
Showing with 56 additions and 7,446 deletions.
  1. +0 −57 munging/airline_flights/airline.rb
  2. +0 −83 munging/airline_flights/airline_flights.rake
  3. 0  munging/airline_flights/airplane.rb
  4. +0 −211 munging/airline_flights/airport.rb
  5. +0 −129 munging/airline_flights/airport_id_unification.rb
  6. +0 −4 munging/airline_flights/airport_ok_chars.rb
  7. +0 −156 munging/airline_flights/flight.rb
  8. +0 −4 munging/airline_flights/models.rb
  9. +0 −26 munging/airline_flights/parse.rb
  10. +0 −142 munging/airline_flights/reconcile_airports.rb
  11. +0 −35 munging/airline_flights/route.rb
  12. +0 −1  munging/airline_flights/tasks.rake
  13. +0 −62 munging/airline_flights/timezone_fixup.rb
  14. +0 −167 munging/airline_flights/topcities.rb
  15. +0 −40 munging/airports/40_wbans.txt
  16. +0 −37 munging/airports/filter_weather_reports.rb
  17. +0 −31 munging/airports/join.pig
  18. +0 −33 munging/airports/to_tsv.rb
  19. +0 −19 munging/airports/usa_wbans.pig
  20. +0 −2,157 munging/airports/usa_wbans.txt
  21. +0 −19 munging/airports/wbans.pig
  22. +0 −2,310 munging/airports/wbans.txt
  23. +0 −54 munging/geo/geo_json.rb
  24. +0 −69 munging/geo/geo_models.rb
  25. +0 −78 munging/geo/geonames_models.rb
  26. +0 −172 munging/geo/iso_codes.rb
  27. +0 −124 munging/geo/reconcile_countries.rb
  28. +0 −71 munging/geo/tasks.rake
  29. +0 −62 munging/rake_helper.rb
  30. +0 −1  munging/weather/.gitignore
  31. +0 −4 munging/weather/Gemfile
  32. +0 −28 munging/weather/Rakefile
  33. +0 −13 munging/weather/extract_ish.rb
  34. +0 −119 munging/weather/models/weather.rb
  35. +0 −46 munging/weather/utils/noaa_downloader.rb
  36. +0 −8 munging/wikipedia/Gemfile
  37. +0 −34 munging/wikipedia/README.md
  38. +0 −193 munging/wikipedia/Rakefile
  39. +0 −57 munging/wikipedia/articles/extract_articles.rb
  40. +0 −16 munging/wikipedia/n1_subuniverse/n1_nodes.pig
  41. +0 −21 munging/wikipedia/page_metadata/extract_page_metadata.rb
  42. +0 −27 munging/wikipedia/page_metadata/extract_page_metadata.rb.old
  43. +0 −29 munging/wikipedia/pagelinks/augment_pagelinks.pig
  44. +0 −14 munging/wikipedia/pagelinks/extract_pagelinks.rb
  45. +0 −25 munging/wikipedia/pagelinks/extract_pagelinks.rb.old
  46. +0 −29 munging/wikipedia/pagelinks/undirect_pagelinks.pig
  47. +0 −32 munging/wikipedia/pageviews/augment_pageviews.pig
  48. +0 −85 munging/wikipedia/pageviews/extract_pageviews.rb
  49. +0 −25 munging/wikipedia/pig_style_guide.md
  50. +0 −19 munging/wikipedia/redirects/redirects_page_metadata.pig
  51. +0 −23 munging/wikipedia/subuniverse/sub_articles.pig
  52. +0 −24 munging/wikipedia/subuniverse/sub_page_metadata.pig
  53. +0 −22 munging/wikipedia/subuniverse/sub_pagelinks_from.pig
  54. +0 −22 munging/wikipedia/subuniverse/sub_pagelinks_into.pig
  55. +0 −26 munging/wikipedia/subuniverse/sub_pagelinks_within.pig
  56. +0 −29 munging/wikipedia/subuniverse/sub_pageviews.pig
  57. +0 −24 munging/wikipedia/subuniverse/sub_undirected_pagelinks_within.pig
  58. +0 −86 munging/wikipedia/utils/get_namespaces.rb
  59. +0 −11 munging/wikipedia/utils/munging_utils.rb
  60. +0 −1  munging/wikipedia/utils/namespaces.json
  61. +56 −0 wikipedia/README.md
View
57 munging/airline_flights/airline.rb
@@ -1,57 +0,0 @@
-class Airline
- include Gorillib::Model
- field :icao_id, String, doc: "3-letter ICAO code, if available", identifier: true, length: 2
- field :iata_id, String, doc: "2-letter IATA code, if available", identifier: true, length: 2
- field :airline_ofid, Integer, doc: "Unique OpenFlights identifier for this airline.", identifier: true
- field :active, :boolean, doc: 'true if the airline is or has until recently been operational, false if it is defunct. (This is only a rough indication and should not be taken as 100% accurate)'
- field :country, String, doc: "Country or territory where airline is incorporated"
- field :name, String, doc: "Airline name."
- field :callsign, String, doc: "Airline callsign", identifier: true
- field :alias, String, doc: "Alias of the airline. For example, 'All Nippon Airways' is commonly known as 'ANA'"
-end
-
-#
-# As of January 2012, the OpenFlights Airlines Database contains 5888
-# airlines. If you enjoy this data, please consider [visiting their page and
-# donating](http://openflights.org/data.html)
-#
-# > Notes: Airlines with null codes/callsigns/countries generally represent
-# > user-added airlines. Since the data is intended primarily for current
-# > flights, defunct IATA codes are generally not included. For example,
-# > "Sabena" is not listed with a SN IATA code, since "SN" is presently used by
-# > its successor Brussels Airlines.
-#
-# Sample entries
-#
-# 324,"All Nippon Airways","ANA All Nippon Airways","NH","ANA","ALL NIPPON","Japan","Y"
-# 412,"Aerolineas Argentinas",\N,"AR","ARG","ARGENTINA","Argentina","Y"
-# 413,"Arrowhead Airways",\N,"","ARH","ARROWHEAD","United States","N"
-#
-class RawOpenflightAirline
- include Gorillib::Model
- include Gorillib::Model::LoadFromCsv
- BLANKISH_STRINGS = ["", nil, "NULL", '\\N', "NONE", "NA", "Null", "..."]
-
- field :airline_ofid, Integer, blankish: BLANKISH_STRINGS, doc: "Unique OpenFlights identifier for this airline.", identifier: true
- field :name, String, blankish: BLANKISH_STRINGS, doc: "Airline name."
- field :alias, String, blankish: BLANKISH_STRINGS, doc: "Alias of the airline. For example, 'All Nippon Airways' is commonly known as 'ANA'"
- field :iata_id, String, blankish: BLANKISH_STRINGS, doc: "2-letter IATA code, if available", identifier: true, length: 2
- field :icao_id, String, blankish: BLANKISH_STRINGS, doc: "3-letter ICAO code, if available", identifier: true, length: 2
- field :callsign, String, blankish: BLANKISH_STRINGS, doc: "Airline callsign"
- field :country, String, blankish: BLANKISH_STRINGS, doc: "Country or territory where airline is incorporated"
- field :active, :boolean, blankish: BLANKISH_STRINGS, doc: 'true if the airline is or has until recently been operational, false if it is defunct. (This is only a rough indication and should not be taken as 100% accurate)'
-
- def receive_iata_id(val) super if val =~ /\A\w+\z/ ; end
- def receive_icao_id(val) super if val =~ /\A\w+\z/ ; end
- def receive_active(val)
- super(case val.to_s when "Y" then true when "N" then false else val ; end)
- end
-
- def to_airline
- Airline.receive(self.compact_attributes)
- end
-
- def self.load_airlines(filename)
- load_csv(filename){|raw_airline| yield(raw_airline.to_airline) }
- end
-end
View
83 munging/airline_flights/airline_flights.rake
@@ -1,83 +0,0 @@
-require_relative('../../rake_helper')
-require_relative('./models')
-
-Pathname.register_paths(
- af_data: [:data, 'airline_flights'],
- af_work: [:work, 'airline_flights'],
- af_code: File.dirname(__FILE__),
- #
- openflights_raw_airports: [:af_data, "openflights_airports-raw#{Settings[:mini_slug]}.csv" ],
- openflights_raw_airlines: [:af_data, "openflights_airlines-raw.csv" ],
- dataexpo_raw_airports: [:af_data, "dataexpo_airports-raw#{Settings[:mini_slug]}.csv" ],
- wikipedia_icao: [:af_data, "wikipedia_icao.tsv" ],
- wikipedia_iata: [:af_data, "wikipedia_iata.tsv" ],
- wikipedia_us_abroad: [:af_data, "wikipedia_us_abroad.tsv" ],
- #
- openflights_airports: [:af_work, "openflights_airports-parsed#{Settings[:mini_slug]}.tsv"],
- openflights_airlines: [:af_work, "openflights_airlines-parsed#{Settings[:mini_slug]}.tsv"],
- dataexpo_airports: [:af_work, "dataexpo_airports-parsed#{Settings[:mini_slug]}.tsv" ],
- airport_identifiers: [:af_work, "airport_identifiers.tsv" ],
- airport_identifiers_mini: [:af_work, "airport_identifiers-sample.tsv" ],
- # helpers
- country_name_lookup: [:work, 'geo', "country_name_lookup.tsv"],
- )
-
-chain :airline_flights do
- code_files = FileList[Pathname.of(:af_code, '*.rb').to_s]
- chain(:parse) do
-
- # desc 'parse the dataexpo airports'
- # create_file(:dataexpo_airports, after: code_files) do |dest|
- # RawDataexpoAirport.load_airports(:dataexpo_raw_airports) do |airport|
- # dest << airport.to_tsv << "\n"
- # end
- # end
-
- desc 'parse the openflights airports'
- create_file(:openflights_airports, after: [code_files, :force]) do |dest|
- require_relative('../geo/geo_models')
- Geo::CountryNameLookup.load
- RawOpenflightAirport.load_airports(:openflights_raw_airports) do |airport|
- dest << airport.to_tsv << "\n"
- # puts airport.country
- end
- end
-
- # task :reconcile_airports => [:dataexpo_airports, :openflights_airports] do
- # require_relative 'reconcile_airports'
- # Airport::IdReconciler.load_all
- # end
- #
- # desc 'run the identifier reconciler'
- # create_file(:airport_identifiers, after: code_files, invoke: 'airline_flights:parse:reconcile_airports') do |dest|
- # Airport::IdReconciler.airports.each do |airport|
- # dest << airport.to_tsv << "\n"
- # end
- # end
- #
- # desc 'run the identifier reconciler'
- # create_file(:airport_identifiers_mini, after: code_files, invoke: 'airline_flights:parse:reconcile_airports') do |dest|
- # Airport::IdReconciler.exemplars.each do |airport|
- # dest << airport.to_tsv << "\n"
- # end
- # end
- #
- # desc 'parse the openflights airlines'
- # create_file(:openflights_airlines, after: code_files) do |dest|
- # RawOpenflightAirline.load_airlines(:openflights_raw_airlines) do |airline|
- # dest << airline.to_tsv << "\n"
- # puts airline.to_tsv
- # end
- # end
-
- end
-end
-
-task :default => [
- 'airline_flights',
- # 'airline_flights:parse:dataexpo_airports',
- # 'airline_flights:parse:openflights_airports',
- # 'airline_flights:parse:airport_identifiers',
- # 'airline_flights:parse:airport_identifiers_mini',
- # 'airline_flights:parse:openflights_airlines',
-]
View
0  munging/airline_flights/airplane.rb
No changes.
View
211 munging/airline_flights/airport.rb
@@ -1,211 +0,0 @@
-# -*- coding: utf-8 -*-
-
-### @export "airport_model"
-class Airport
- include Gorillib::Model
-
- field :icao, String, doc: "4-letter ICAO code, or blank if not assigned.", length: 4, identifier: true, :blankish => ["", nil]
- field :iata, String, doc: "3-letter IATA code, or blank if not assigned.", length: 3, identifier: true, :blankish => ["", nil]
- field :faa, String, doc: "3-letter FAA code, or blank if not assigned.", length: 3, identifier: true, :blankish => ["", nil]
- field :utc_offset, Float, doc: "Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5.", validates: { inclusion: (-12...12) }
- field :dst_rule, String, doc: "Daylight savings time rule. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown). See the readme for more.", validates: { inclusion: %w[E A S O Z N U] }
- field :longitude, Float, doc: "Decimal degrees, usually to six significant digits. Negative is West, positive is East.", validates: { inclusion: (-180...180) }
- field :latitude, Float, doc: "Decimal degrees, usually to six significant digits. Negative is South, positive is North.", validates: { inclusion: (-90.0...90.0) }
- field :altitude, Float, doc: "Elevation in meters."
- field :name, String, doc: "Name of airport."
- field :country, String, doc: "Country or territory where airport is located.", length: 2
- field :state, String, doc: "State in which the airport is located", length: 2
- field :city, String, doc: "Main city served by airport. This is the logical city it serves; so, for example SFO gets 'San Francisco', not 'San Bruno'"
- field :airport_ofid, String, doc: "OpenFlights identifier for this airport.", identifier: true
-end
-### @export "nil"
-class Airport
- EXEMPLARS = %w[
- ANC ATL AUS BDL BNA BOI BOS BWI CLE CLT
- CMH DCA DEN DFW DTW EWR FLL HNL IAD IAH
- IND JAX JFK LAS LAX LGA MCI MCO MDW MIA
- MSP MSY OAK ORD PDX PHL PHX PIT PVD RDU
- SAN SEA SFO SJC SJU SLC SMF STL TPA YYZ
- ]
-
- def utc_time_for(tm)
- utc_time = tm.get_utc + utc_offset
- utc_time += (60*60) if TimezoneFixup.dst?(tm)
- utc_time
- end
-
- BLANKISH_STRINGS = ["", nil, "NULL", '\\N', "NONE", "NA", "Null", "..."]
- OK_CHARS_RE = /[^a-zA-Z0-9\:\ \/\.\,\-\(\)\'ÁÂÄÅÇÉÍÎÑÓÖØÚÜÞàáâãäåæçèéêëìíîïðñóôõöøúüýĀāăĆćČčēėęěğīİıŁłńņňŌōőřŞşŠšţťūůųźŽžơț]/
-
- def lint
- errors = {}
- errors["ICAO is wrong length"] = icao if icao.present? && icao.length != 4
- if (icao && faa && (icao =~ /^K.../))
- errors["ICAO != K+FAA yet ICAO is a K..."] = [icao, faa] if (icao != "K#{faa}")
- end
- # errors["ICAO present for piddlyshit airport"] = icao if icao.present? && ((faa.to_s.length == 4) || (faa.to_s =~ /\d/))
- errors[:spaces] ||= []
- errors[:funny] ||= []
- attributes.each do |attr, val|
- next if val.blank?
- errors["#{attr} looks blankish"] = val if BLANKISH_STRINGS.include?(val)
- if (val.is_a?(String))
- errors[:spaces] << [attr, val] if (val.strip != val)
- errors[:funny] << [attr, val] if val =~ OK_CHARS_RE
- end
- end
- errors.compact_blank
- end
-
- def to_s
- str = "#<Airport "
- str << [icao, iata, faa,
- (latitude && "%4.1f" % latitude), (longitude && "%5.1f" % longitude), state, country,
- "%-30s" % name, country, city].join("\t")
- str << ">"
- end
-
- def faa_controlled?
- icao =~ /^(?:K|P[ABFGHJKMOPW]|T[IJ]|NS(AS|FQ|TU))/
- end
-end
-### @export "airport_load"
-class Airport
- include Gorillib::Model::LoadFromTsv
- self.tsv_options.merge!(num_fields: 10..20)
- def self.load_airports(filename)
- load_tsv(filename){|airport| yield(airport) }
- end
-
-end
-### @export "nil"
-
-#
-# As of January 2012, the OpenFlights Airports Database contains 6977 airports
-# [spanning the globe](http://openflights.org/demo/openflights-apdb-2048.png).
-# If you enjoy this data, please consider [visiting their page and
-# donating](http://openflights.org/data.html)
-#
-# > Note: Rules for daylight savings time change from year to year and from
-# > country to country. The current data is an approximation for 2009, built on
-# > a country level. Most airports in DST-less regions in countries that
-# > generally observe DST (eg. AL, HI in the USA, NT, QL in Australia, parts of
-# > Canada) are marked incorrectly.
-#
-# Sample entries
-#
-# 507,"Heathrow","London","United Kingdom","LHR","EGLL",51.4775,-0.461389,83,0,"E"
-# 26,"Kugaaruk","Pelly Bay","Canada","YBB","CYBB",68.534444,-89.808056,56,-6,"A"
-# 3127,"Pokhara","Pokhara","Nepal","PKR","VNPK",28.200881,83.982056,2712,5.75,"N"
-#
-
-### @export "raw_openflight_airport"
-
-module RawAirport
- COUNTRIES = { 'Puerto Rico' => 'us', 'Canada' => 'ca', 'USA' => 'us', 'United States' => 'us',
- 'Northern Mariana Islands' => 'us', 'N Mariana Islands' => 'us',
- 'Federated States of Micronesia' => 'fm',
- 'Thailand' => 'th', 'Palau' => 'pw',
- 'American Samoa' => 'as', 'Wake Island' => 'us', 'Virgin Islands' => 'vi', 'Guam' => 'gu'
- }
- BLANKISH_STRINGS = ["", nil, "NULL", '\\N', "NONE", "NA", "Null", "..."]
- OK_CHARS_RE = /[^a-zA-Z0-9\:\ \/\.\,\-\(\)\'ÁÂÄÅÇÉÍÎÑÓÖØÚÜÞàáâãäåæçèéêëìíîïðñóôõöøúüýĀāăĆćČčēėęěğīİıŁłńņňŌōőřŞşŠšţťūůųźŽžơț]/
-
- def receive_city(val)
- super.tap{|val| if val then val.strip! ; val.gsub!(/\\+/, '') ; end }
- end
-
- def receive_country(val)
- super(COUNTRIES[val] || val)
- end
-
- def receive_name(val)
- super.tap do |val|
- if val
- val.strip!
- val.gsub!(/\\+/, '')
- val.gsub!(/\s*\[(military|private)\]/, '')
- val.gsub!(/\b(Int\'l|International)\b/, 'Intl')
- val.gsub!(/\b(Intercontinental)\b/, 'Intcntl')
- val.gsub!(/\b(Airpt)\b/, 'Airport')
- val.gsub!(/ Airport$/, '')
- end
- end
- end
-end
-
-#
-class RawOpenflightAirport
- include Gorillib::Model
- include Gorillib::Model::LoadFromCsv
- include RawAirport
- #
- field :airport_ofid, String, doc: "Unique OpenFlights identifier for this airport."
- field :name, String, doc: "Name of airport. May or may not contain the City name."
- field :city, String, blankish: BLANKISH_STRINGS, doc: "Main city served by airport. May be spelled differently from Name."
- field :country, String, doc: "Country or territory where airport is located."
- field :iata_faa, String, blankish: BLANKISH_STRINGS, doc: "3-letter FAA code, for airports located in the USA. For all other airports, 3-letter IATA code, or blank if not assigned."
- field :icao, String, blankish: BLANKISH_STRINGS, doc: "4-letter ICAO code; Blank if not assigned."
- field :latitude, Float, doc: "Decimal degrees, usually to six significant digits. Negative is South, positive is North."
- field :longitude, Float, doc: "Decimal degrees, usually to six significant digits. Negative is West, positive is East."
- field :altitude_ft, Float, blankish: ['', nil, 0, '0'], doc: "In feet."
- field :utc_offset, Float, doc: "Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5."
- field :dst_rule, String, doc: "Daylight savings time rule. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown). See the readme for more."
-
- UNRELIABLE_OPENFLIGHTS_IATA_VALUES = /^(7AK|AGA|AUQ|BDJ|BGW|BME|BPM|BXH|BZY|CAT|CEE|CEJ|CFS|CGU|CIO|CLV|CNN|DEE|DIB|DNM|DUH|DUR|FKI|GES|GSM|HKV|HOJ|HYD|IEO|IFN|IKA|IZA|JCU|JGS|KMW|KNC|LGQ|LUM|MCU|MCY|MDO|MOH|MON|MPH|MVF|NAY|NMA|NOE|NQY|OTU|OUI|PBV|PCA|PCB|PGK|PHO|PIF|PKN|PKY|PMK|PTG|PZO|QAS|QKT|QVY|RCM|RJL|RTG|SBG|SDZ|SFG|SIC|SIQ|SJI|SRI|STP|STU|SWQ|TJQ|TJS|TMC|TYA|UKC|VIY|VQS|VTS|WDH|WKM|WPR|WPU|ZQF)$/
-
- def id_is_faa?
- (icao =~ /^(?:K)/) || (icao.blank? && country == 'us')
- end
-
- def iata ; (id_is_faa? ? nil : iata_faa) unless iata_faa =~ UNRELIABLE_OPENFLIGHTS_IATA_VALUES end
- def faa ; (id_is_faa? ? iata_faa : nil ) end
- def altitude
- altitude_ft && (0.3048 * altitude_ft).round(1)
- end
-
- def receive_country(val)
- country = Geo::CountryNameLookup.for_alt_name(val, nil)
- p val unless country
- super(country ? country.country_id : val)
- end
-
- def to_airport
- attrs = self.compact_attributes.except(:altitude_ft)
- attrs[:altitude] = altitude
- attrs[:iata] = iata unless iata.to_s =~ UNRELIABLE_OPENFLIGHTS_IATA_VALUES
- attrs[:faa] = faa
- Airport.receive(attrs)
- end
-
- def self.load_airports(filename)
- load_csv(filename){|raw_airport| yield(raw_airport.to_airport) }
- end
-end
-
-### @export "raw_dataexpo_airport"
-class RawDataexpoAirport
- include Gorillib::Model
- include Gorillib::Model::LoadFromCsv
- include RawAirport
- self.csv_options = self.csv_options.merge(pop_headers: true)
-
- field :faa, String, doc: "the international airport abbreviation code"
- field :name, String, doc: "Airport name"
- field :city, String, blankish: ["NA"], doc: "city in which the airport is located"
- field :state, String, blankish: ["NA"], doc: "state in which the airport is located"
- field :country, String, doc: "country in which airport is located"
- field :latitude, Float, doc: "latitude of the airport"
- field :longitude, Float, doc: "longitude of the airport"
-
- def to_airport
- attrs = self.compact_attributes
- attrs[:icao] = "K#{faa}" if faa =~ /[A-Z]{3}/ && (not ['PR', 'AK', 'CQ', 'HI', 'AS', 'GU', 'VI'].include?(state)) && (country == 'us')
- Airport.receive(attrs)
- end
-
- def self.load_airports(filename)
- load_csv(filename){|raw_airport| yield(raw_airport.to_airport) }
- end
-end
-### @export "nil"
View
129 munging/airline_flights/airport_id_unification.rb
@@ -1,129 +0,0 @@
-class Airport
-
- # [Hash] all options passed to the field not recognized by one of its own current fields
- attr_reader :_extra_attributes
-
- # # Airports whose IATA and FAA codes differ; all are in the US, so their ICAO is "K"+the FAA id
- # FAA_ICAO_FIXUP = {
- # "GRM" => "CKC", "CLD" => "CRQ", "SDX" => "SEZ", "AZA" => "IWA", "SCE" => "UNV", "BLD" => "BVU",
- # "LKE" => "W55", "HSH" => "HND", "BKG" => "BBG", "UST" => "SGJ", "LYU" => "ELO", "WFK" => "FVE",
- # "FRD" => "FHR", "ESD" => "ORS", "RKH" => "UZA", "NZC" => "VQQ", "SCF" => "SDL", "JCI" => "IXD",
- # "AVW" => "AVQ", "UTM" => "UTA", "ONP" => "NOP", }
- #
- # [:iata, :icao, :latitude, :longitude, :country, :city, :name].each do |attr|
- # define_method("of_#{attr}"){ @_extra_attributes[:"of_#{attr}"] }
- # define_method("de_#{attr}"){ @_extra_attributes[:"de_#{attr}"] }
- # end
- #
- # def lint_differences
- # errors = {}
- # return errors unless de_name.present? && of_name.present?
- # [
- # [:iata, of_iata, de_iata], [:icao, of_icao, de_icao], [:country, of_country, de_country],
- # [:city, of_city, de_city],
- # [:name, of_name, de_name],
- # ].each{|attr, of, de| next unless of && de ; errors[attr] = [of, de] if of != de }
- #
- # if (of_latitude && of_longitude && de_latitude && de_longitude)
- # lat_diff = (of_latitude - de_latitude ).abs
- # lng_diff = (of_longitude - de_longitude).abs
- # unless (lat_diff < 0.015) && (lng_diff < 0.015)
- # msg = [of_latitude, de_latitude, of_longitude, de_longitude, lat_diff, lng_diff].map{|val| "%9.4f" % val }.join(" ")
- # errors["distance"] = ([msg, of_city, de_city, of_name, de_name])
- # end
- # end
- #
- # errors
- # end
- #
- # AIRPORTS = Hash.new # unless defined?(AIRPORTS)
- # def self.load(of_filename, de_filename)
- # RawOpenflightAirport.load_csv(of_filename) do |raw_airport|
- # airport = raw_airport.to_airport
- # AIRPORTS[airport.iata_icao] = airport
- # end
- # RawDataexpoAirport.load_csv(de_filename) do |raw_airport|
- # airport = (AIRPORTS[raw_airport.iata_icao] ||= self.new)
- # if airport.de_name
- # warn "duplicate data for #{[iata, de_iata, icao, de_icao]}: #{raw_airport.to_tsv} #{airport.to_tsv}"
- # end
- # airport.receive!(raw_airport.airport_attrs)
- # end
- # AIRPORTS
- # end
-
- def self.load(dirname)
- load_csv(File.join(dirname, 'wikipedia_icao.tsv')) do |id_mapping|
- [:icao, :iata, :faa ].each do |attr|
- val = id_mapping.read_attribute(attr) or next
- next if (val == '.') || (val == '_')
- if that = ID_MAPPINGS[attr][val]
- lint = that.disagreements(id_mapping)
- puts [attr, val, "%-25s" % lint.inspect, id_mapping, that, "%-60s" % id_mapping.name, "%-25s" % that.name].join("\t") if lint.present?
- else
- ID_MAPPINGS[attr][val] = id_mapping
- end
- end
- # [:icao, :iata, :faa ].each do |attr|
- # val = id_mapping.read_attribute(attr)
- # ID_MAPPINGS[attr][val] = id_mapping
- # end
- end
- load_csv(File.join(dirname, 'wikipedia_iata.tsv')) do |id_mapping|
- # if not ID_MAPPINGS[:icao].has_key?(id_mapping.icao)
- # puts [:badicao, "%-25s" % "", id_mapping, " "*24, "%-60s" % id_mapping.name].join("\t")
- # end
- [:icao, :iata, :faa ].each do |attr|
- val = id_mapping.read_attribute(attr) or next
- next if (val == '.') || (val == '_')
- if that = ID_MAPPINGS[attr][val]
- lint = that.disagreements(id_mapping)
- puts [attr, val, "%-25s" % lint.inspect, id_mapping, that, "%-60s" % id_mapping.name, "%-25s" % that.name].join("\t") if lint.present?
- else
- ID_MAPPINGS[attr][val] = id_mapping
- end
- end
- end
-
- # def adopt_field(that, attr)
- # this_val = self.read_attribute(attr)
- # that_val = that.read_attribute(attr)
- # if name =~ /Bogus|Austin/i
- # puts [attr, this_val, that_val, attribute_set?(attr), that.attribute_set?(attr), to_tsv, that.to_tsv].join("\t")
- # end
- # if this_val && that_val
- # if (this_val != that_val) then warn [attr, this_val, that_val, name].join("\t") ; end
- # elsif that_val
- # write_attribute(that_val)
- # end
- # end
-
- def to_s
- attributes.values[0..2].join("\t")
- end
-
- def disagreements(that)
- errors = {}
- [:icao, :iata, :faa ].each do |attr|
- this_val = self.read_attribute(attr) or next
- that_val = that.read_attribute(attr) or next
- next if that_val == '.' || that_val == '_'
- errors[attr] = [this_val, that_val] if this_val != that_val
- end
- errors
- end
-
- def self.dump_ids(ids)
- "%s\t%s\t%s" % [icao, iata, faa]
- end
- def self.dump_mapping
- [:icao, :iata, :faa].map do |attr|
- "%-50s" % ID_MAP[attr].to_a.sort.map{|id, val| "#{id}:#{val.icao||' '}|#{val.iata||' '}|#{val.faa||' '}"}.join(";")
- end
- end
-
- def self.dump_info(kind, ids, reconciler, existing, *args)
- ex_str = [existing.map{|el| dump_ids(el.ids) }, "\t\t","\t\t","\t\t"].flatten[0..2]
- puts [kind, dump_ids(ids), dump_ids(reconciler.ids), ex_str, *args, dump_mapping.join("//") ].flatten.join("\t| ")
- end
-end
View
4 munging/airline_flights/airport_ok_chars.rb
@@ -1,4 +0,0 @@
-# -*- coding: utf-8 -*-
-
-
-OK_CHARS_RE = /[^a-zA-Z0-9\ \/\.\,\-\(\)\'ÁÂÄÅÇÉÍÎÑÖØÜÞàáâãäåæçèéêëìíîïðñóôõöøúüýāăčėęěğİıŁłńōőřŞşŠšţťūźŽžơț]/
View
156 munging/airline_flights/flight.rb
@@ -1,156 +0,0 @@
-# Raw data:
-# Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Can
-# 2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0
-
-class RawAirlineFlight
- include Gorillib::Model
-
- field :date_year, Integer, position: 1, doc: "Year (1987-2008)"
- field :date_month, Integer, position: 2, doc: "Month (1-12)"
- field :date_day, Integer, position: 3, doc: "Day of month (1-31)"
- field :day_of_week, Integer, position: 4, doc: "Day of week -- 1 (Monday) - 7 (Sunday)"
- #
- field :act_dep_tod, String, position: 5, doc: "time of day for actual departure (local, hhmm)", blankish: [nil, '', 'NA']
- field :crs_dep_tod, String, position: 6, doc: "time of day for scheduled departure (local, hhmm)"
- field :act_arr_tod, String, position: 7, doc: "time of day for actual arrival (local, hhmm). Not adjusted for wrap-around.", blankish: [nil, '', 'NA']
- field :crs_arr_tod, String, position: 8, doc: "time of day for scheduled arrival (local, hhmm). Not adjusted for wrap-around."
- #
- field :unique_carrier, String, position: 9, doc: "unique carrier code", validates: { length: { in: 0..5 } }
- field :flight_num, Integer, position: 10, doc: "flight number"
- field :tail_num, String, position: 11, doc: "plane tail number", validates: { length: { in: 0..8 } }
- #
- field :act_duration, Integer, position: 12, doc: "actual flight time, in minutes", blankish: [nil, '', 'NA']
- field :crs_duration, Integer, position: 13, doc: "CRS flight time, in minutes"
- field :air_duration, Integer, position: 14, doc: "Air time, in minutes", blankish: [nil, '', 'NA']
- field :arr_delay, Integer, position: 15, doc: "arrival delay, in minutes", blankish: [nil, '', 'NA']
- field :dep_delay, Integer, position: 16, doc: "departure delay, in minutes", blankish: [nil, '', 'NA']
- field :from_airport, String, position: 17, doc: "Origin IATA airport code", validates: { length: { in: 0..3 } }
- field :into_airport, String, position: 18, doc: "Destination IATA airport code", validates: { length: { in: 0..3 } }
- field :distance_mi, Integer, position: 19, doc: "Flight distance, in miles"
- field :taxi_in_duration, Integer, position: 20, doc: "taxi in time, in minutes", blankish: [nil, '', 'NA']
- field :taxi_out_duration, Integer, position: 21, doc: "taxi out time in minutes", blankish: [nil, '', 'NA']
- #
- field :is_cancelled, :boolean_10, position: 22, doc: "was the flight cancelled?"
- field :cancellation_code, String, position: 23, doc: "Reason for cancellation (A = carrier, B = weather, C = NAS, D = security, Z = no cancellation)"
- field :is_diverted, :boolean_10, position: 24, doc: "Was the plane diverted?"
- field :carrier_delay, Integer, position: 25, doc: "in minutes"
- field :weather_delay, Integer, position: 26, doc: "in minutes"
- field :nas_delay, Integer, position: 27, doc: "in minutes"
- field :security_delay, Integer, position: 28, doc: "in minutes"
- field :late_aircraft_delay, Integer, position: 29, doc: "in minutes"
-
- def flight_date
- Time.new(date_year, date_month, date_day)
- end
-
- # uses the year / month / day, along with an "hhmm" string, to
- def inttime_from_hhmm(val, fencepost=nil)
- hour, minutes = [val.to_i / 100, val.to_i % 100]
- res = Time.utc(date_year, date_month, date_day, hour, minutes)
- # if before fencepost, we wrapped around in time
- res += (24 * 60 * 60) if fencepost && (res.to_i < fencepost)
- res.to_i
- end
-
- def act_dep_itime ; @act_dep_itime = inttime_from_hhmm(act_dep_tod) if act_dep_tod ; end
- def crs_dep_itime ; @crs_dep_itime = inttime_from_hhmm(crs_dep_tod) ; end
- def act_arr_itime ; @act_arr_itime = inttime_from_hhmm(act_arr_tod, act_dep_itime) if act_arr_tod ; end
- def crs_arr_itime ; @crs_arr_itime = inttime_from_hhmm(crs_arr_tod, crs_dep_itime) ; end
-
- def receive_tail_num(val) ; val = nil if val.to_s == "0" ; super(val) ; end
- def arr_delay(val) val = nil if val.to_s == 0 ; super(val) ; end
-
- def receive_cancellation_code(val) ; if val == "" then super("Z") else super(val) ; end ; end
-
- def to_airline_flight
- attrs = self.attributes.reject{|attr,val| [:year, :month, :day, :distance_mi].include?(attr) }
- attrs[:flight_datestr] = flight_date.strftime("%Y%m%d")
- attrs[:distance_km] = (distance_mi * 1.609_344).to_i
-
- attrs[:act_dep_tod] = "%04d" % act_dep_tod.to_i if act_dep_tod
- attrs[:crs_dep_tod] = "%04d" % crs_dep_tod.to_i if crs_dep_tod
- attrs[:act_arr_tod] = "%04d" % act_arr_tod.to_i if act_arr_tod
- attrs[:crs_arr_tod] = "%04d" % crs_arr_tod.to_i if crs_arr_tod
-
- attrs[:act_dep_itime] = act_dep_itime
- attrs[:crs_dep_itime] = crs_dep_itime
- attrs[:act_arr_itime] = act_arr_itime
- attrs[:crs_arr_itime] = crs_arr_itime
-
- AirlineFlight.receive(attrs)
- end
-end
-
-class AirlineFlight
- include Gorillib::Model
-
- # Identifier
- field :flight_datestr, String, position: 0, doc: "Date, YYYYMMDD. Use flight_date method if you want a date"
- field :unique_carrier, String, position: 1, doc: "Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2).", validates: { length: { in: 0..5 } }
- field :flight_num, Integer, position: 2, doc: "flight number"
- # Flight
- field :from_airport, String, position: 3, doc: "Origin IATA airport code", validates: { length: { in: 0..3 } }
- field :into_airport, String, position: 4, doc: "Destination IATA airport code", validates: { length: { in: 0..3 } }
- field :tail_num, String, position: 5, doc: "Plane tail number", validates: { length: { in: 0..8 } }
- field :distance_km, Integer, position: 6, doc: "Flight distance, in kilometers"
- field :day_of_week, Integer, position: 7, doc: "Day of week -- 1 (Monday) - 7 (Sunday)"
- # Departure and Arrival Absolute Time
- field :crs_dep_itime, IntTime, position: 8, doc: "scheduled departure time (utc epoch seconds)"
- field :crs_arr_itime, IntTime, position: 9, doc: "scheduled arrival time (utc epoch seconds)"
- field :act_dep_itime, IntTime, position: 10, doc: "actual departure time (utc epoch seconds)"
- field :act_arr_itime, IntTime, position: 11, doc: "actual arrival time (utc epoch seconds)"
- # Departure and Arrival Local Time of Day
- field :crs_dep_tod, String, position: 12, doc: "time of day for scheduled departure (local, hhmm)"
- field :crs_arr_tod, String, position: 13, doc: "time of day for scheduled arrival (local, hhmm). Not adjusted for wrap-around."
- field :act_dep_tod, String, position: 14, doc: "time of day for actual departure (local, hhmm)"
- field :act_arr_tod, String, position: 15, doc: "time of day for actual arrival (local, hhmm). Not adjusted for wrap-around."
- # Duration
- field :crs_duration, Integer, position: 16, doc: "CRS flight time, in minutes"
- field :act_duration, Integer, position: 17, doc: "Actual flight time, in minutes"
- field :air_duration, Integer, position: 18, doc: "Air time, in minutes"
- field :taxi_in_duration, Integer, position: 19, doc: "taxi in time, in minutes"
- field :taxi_out_duration, Integer, position: 20, doc: "taxi out time in minutes"
- # Delay
- field :is_diverted, :boolean_10, position: 21, doc: "Was the plane diverted? The actual_duration column remains NULL for all diverted flights."
- field :is_cancelled, :boolean_10, position: 22, doc: "was the flight cancelled?"
- field :cancellation_code, String, position: 23, doc: "Reason for cancellation (A = carrier, B = weather, C = NAS, D = security, Z = no cancellation)"
- field :dep_delay, Integer, position: 24, doc: "Difference in minutes between scheduled and actual departure time. Early departures show negative numbers. "
- field :arr_delay, Integer, position: 25, doc: "Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers."
- field :carrier_delay, Integer, position: 26, doc: "Carrier delay, in minutes"
- field :weather_delay, Integer, position: 27, doc: "Weather delay, in minutes"
- field :nas_delay, Integer, position: 28, doc: "National Air System delay, in minutes"
- field :security_delay, Integer, position: 29, doc: "Security delay, in minutes"
- field :late_aircraft_delay, Integer, position: 30, doc: "Late Aircraft delay, in minutes"
-
- def to_tsv
- attrs = attributes
- attrs[:is_cancelled] = is_cancelled ? 1 : 0
- attrs[:is_diverted] = is_diverted ? 1 : 0
- attrs[:act_dep_itime] ||= ' '
- attrs[:act_arr_itime] ||= ' '
-
- # FIXME
- attrs[:act_duration] = ((crs_arr_itime - crs_dep_itime) / 60.0).to_i
- attrs[:air_duration] = attrs[:act_duration] - attrs[:crs_duration]
- attrs.each{|key, val| attrs[key] = val.to_s[-7..-1] if val.to_s.length > 7 } # FIXME: for testing
-
- attrs.values.join("\t")
- end
-
- def flight_date
- @flight_date ||= Gorillib::Factory::DateFactory.receive(flight_datestr)
- end
-
- # checks that the record is sane
- def lint
- {
- act_duration: (!act_arr_itime) || (act_arr_itime - act_dep_itime == act_duration * 60),
- crs_duration: (!crs_arr_itime) || (crs_arr_itime - crs_dep_itime == crs_duration * 60),
- cancelled_has_code: (is_cancelled == (cancellation_code != "Z")),
- cancellation_code: (%w[A B C D Z].include?(cancellation_code)),
- act_duration: (!act_duration) || (act_duration == (air_duration + taxi_in_duration + taxi_out_duration)),
- dep_delay: (!act_dep_itime) || (dep_delay == (act_dep_itime - crs_dep_itime)/60.0),
- arr_delay: (!act_arr_itime) || (arr_delay == (act_arr_itime - crs_arr_itime)/60.0),
- }
- end
-end
View
4 munging/airline_flights/models.rb
@@ -1,4 +0,0 @@
-require_relative './airline'
-require_relative './airport'
-require_relative './route'
-require_relative './flight'
View
26 munging/airline_flights/parse.rb
@@ -1,26 +0,0 @@
-
-# see alsospec/examples/munging/airline_flights_spec.rb
-
- puts described_class.field_names.map{|fn| fn[0..6] }.join("\t")
- raw_airports = RawDataexpoAirport.load_csv(de_airports_filename)
- raw_airports.each do |airport|
- puts airport.to_tsv
- end
-
- puts described_class.field_names.join("\t") # .map{|fn| fn[0..6] }.join("\t")
- raw_airports = described_class.load_csv(raw_airports_filename)
- raw_airports.each do |airport|
- # puts airport.to_tsv
- linted = airport.lint
- puts [airport.iata, airport.icao, linted.inspect, airport.to_tsv, ].join("\t") if linted.present?
- end
-
- Airport.load(raw_airports_filename, de_airports_filename)
- Airport::AIRPORTS.each{|id,airport|
- #puts airport.to_tsv
- linted = airport.lint
- warn [airport.iata, airport.icao, airport.de_iata, "%-25s" % airport.name, linted.inspect].join("\t") if linted.present?
- }
-
-
-# Model.from_tuple(...)
View
142 munging/airline_flights/reconcile_airports.rb
@@ -1,142 +0,0 @@
-require_relative './models'
-require 'gorillib/model/reconcilable'
-
-class Airport
- include Gorillib::Model::Reconcilable
- attr_accessor :_origin # source of the record
-
- def conflicting_attribute!(attr, this_val, that_val)
- case attr
- when :name, :city, :airport_ofid then return :pass
- when :latitude, :longitude then return true if (this_val - that_val).abs < 3
- when :altitude then return true if (this_val - that_val).abs < 5
- end
- super
- end
-
- def ids
- [:icao, :iata, :faa].hashify{|attr| public_send(attr) }.compact
- end
-end
-
-#
-# Loads the Airport identifier tables scraped from Wikipedia
-#
-class RawAirportIdentifier < Airport
- include RawAirport
- include Gorillib::Model::LoadFromTsv
-
- def self.from_tuple(icao, iata, faa, name, city=nil, *_)
- self.new({icao: icao, iata: iata, faa: faa, name: name, city: city}.compact_blank)
- end
-
- def self.load_airports(filename, &block)
- load_tsv(filename, num_fields: 4..6, &block)
- end
-end
-
-class Airport
- #
- # Reconciler for Airports
- #
- # For each airport in turn across openflights, dataexpo and the two scraped
- # identifier sets,
- #
- #
- class IdReconciler
- include Gorillib::Model
- include Gorillib::Model::LoadFromCsv
- include Gorillib::Model::Reconcilable
- self.csv_options = { col_sep: "\t", num_fields: 3..6 }
-
- # Map the reconcilers to each ID they have anything to say about
- ID_MAP = { icao: {}, iata: {}, faa: {} }
-
- field :opinions, Array, default: Array.new, doc: "every record having an id in common with the other records in this field"
-
- def ids
- opinions.flat_map{|op| op.ids.to_a }.uniq.compact
- end
-
- def self.load_all
- Log.info "Loading all Airports and reconciling"
- @airports = Array.new
- RawDataexpoAirport .load_airports(:dataexpo_raw_airports ){|airport| register(:dataexpo, airport) }
- RawOpenflightAirport.load_airports(:openflights_raw_airports){|airport| register(:openflights, airport) }
- RawAirportIdentifier.load_airports(:wikipedia_icao ){|airport| register(:wp_icao, airport) }
- RawAirportIdentifier.load_airports(:wikipedia_iata ){|airport| register(:wp_iata, airport) }
- RawAirportIdentifier.load_airports(:wikipedia_us_abroad ){|airport| register(:wp_us_abroad, airport) }
-
- recs = ID_MAP.map{|attr, hsh| hsh.sort.map(&:last) }.flatten.uniq
- recs.each do |rec|
- consensus = rec.reconcile
- # lint = consensus.lint
- # puts "%-79s\t%s" % [lint, consensus.to_s[0..100]] if lint.present?
- @airports << consensus
- end
- end
-
- def self.airports
- @airports
- end
-
- def self.exemplars
- Airport::EXEMPLARS.map do |iata|
- ID_MAP[:iata][iata].reconcile
- end
- end
-
- def reconcile
- consensus = Airport.new
- clean = opinions.all?{|op| consensus.adopt(op) }
- # puts "\t#{consensus.inspect}"
- puts "confl\t#{self.inspect}" if not clean
- consensus
- end
-
- def adopt_opinions(vals, _)
- self.opinions = vals + self.opinions
- self.opinions.uniq!
- end
-
- # * find all existing reconcilers that share an ID with that record
- # * unify them into one reconciler
- # * store it back under all the IDs
- #
- # Suppose our dataset has 3 identifiers, which look like
- #
- # a S
- # S 88
- # a Z
- # b
- # Q
- # b Q 77
- #
- # We will wind up with these two reconcilers:
- #
- # <a S 88 opinions: [a,S, ],[S, ,88],[a,Z, ]>
- # <b Q 77 opinions: [b, , ],[ ,Q, ],[b,Q,77]>
- #
- def self.register(origin, obj)
- obj._origin = origin
- # get the existing reconcilers
- existing = obj.ids.map{|attr, id| ID_MAP[attr][id] }.compact.uniq
- # push the new object in, and pull the most senior one out
- existing.unshift(self.new(opinions: [obj]))
- reconciler = existing.shift
- # unite them into the reconciler
- existing.each{|that| reconciler.adopt(that) }
- # save the reconciler under each of the ids.
- reconciler.ids.each{|attr, id| ID_MAP[attr][id] = reconciler }
- end
-
- def inspect
- str = "#<#{self.class.name} #{ids}"
- opinions.each do |op|
- str << "\n\t #{op._origin}\t#{op}"
- end
- str << ">"
- end
- end
-
-end
View
35 munging/airline_flights/route.rb
@@ -1,35 +0,0 @@
-
-
-# As of January 2012, the OpenFlights/Airline Route Mapper Route Database
-# contains 59036 routes between 3209 airports on 531 airlines [spanning the
-# globe](http://openflights.org/demo/openflights-routedb-2048.png). If you
-# enjoy this data, please consider [visiting their page and
-# donating](http://openflights.org/data.html)
-#
-# > Notes: Routes are directional: if an airline operates services from A to B
-# > and from B to A, both A-B and B-A are listed separately. Routes where one
-# > carrier operates both its own and codeshare flights are listed only once.
-#
-# Sample entries
-#
-# BA,1355,SIN,3316,LHR,507,,0,744 777
-# BA,1355,SIN,3316,MEL,3339,Y,0,744
-# TOM,5013,ACE,1055,BFS,465,,0,320
-#
-class RawOpenflightRoute
- include Gorillib::Model
-
- field :iataicao, String, doc: "2-letter (IATA) or 3-letter (ICAO) code of the airline."
- field :airline_ofid, Integer, doc: "Unique OpenFlights identifier for airline (see Airline)."
- field :from_airport_iataicao, String, doc: "3-letter (IATA) or 4-letter (ICAO) code of the source airport."
- field :from_airport_ofid, Integer, doc: "Unique OpenFlights identifier for source airport (see Airport)"
- field :into_airport_iataicao, String, doc: "3-letter (IATA) or 4-letter (ICAO) code of the destination airport."
- field :into_airport_ofid, Integer, doc: "Unique OpenFlights identifier for destination airport (see Airport)"
- field :codeshare, :boolean, doc: "true if this flight is a codeshare (that is, not operated by Airline, but another carrier); empty otherwise."
- field :stops, Integer, doc: "Number of stops on this flight, or '0' for direct"
- field :equipment_list, String, doc: "3-letter codes for plane type(s) generally used on this flight, separated by spaces"
-
- def receive_codeshare(val)
- super(case val when "Y" then true when "N" then false else val ; end)
- end
-end
View
1  munging/airline_flights/tasks.rake
View
62 munging/airline_flights/timezone_fixup.rb
@@ -1,62 +0,0 @@
-require 'date'
-require 'gorillib/hash/zip'
-
-class Airport
-
- class TimezoneFixup
-
- YEARS = (2010 .. 2012).to_a
-
- DST_RULES = {
- 'E' => { name: 'European', beg_doy: 'last Sunday in March', end_doy: 'last Sunday in October', beg_dates: {}, end_dates: {}, used_in: 'all European countries (except Iceland), as well as Greenland, Lebanon, Russia and Tunisia. Jordan and Syria are almost the same, starting and ending on Friday instead of Sunday. European DST is also used to (crudely) approximate Iranian DST, although they actually use an entirely different calendar.', },
- 'A' => { name: 'US/Canada', beg_doy: '2nd Sunday in March', end_doy: '1st Sunday in November', beg_dates: {}, end_dates: {}, used_in: 'the United States (except Arizona, Hawaii and island territories) and Canada (with convoluted exceptions).', },
- 'S' => { name: 'South American', beg_doy: '3rd Sunday in March', end_doy: '3rd Sunday in October', southern: true, beg_dates: {}, end_dates: {}, used_in: 'With some variance in the exact dates, in Argentina, Chile, Mexico, Paraguay, Uruguay as well as the African states of Namibia and Mauritius.', },
- 'O' => { name: 'Australia', beg_doy: '1st Sunday in April', end_doy: '1st Sunday in October', southern: true, beg_dates: {}, end_dates: {}, used_in: 'Australia, except for Queensland and the Northern Territory.' },
- 'Z' => { name: 'New Zealand', beg_doy: '1st Sunday in April', end_doy: 'last Sunday in September', southern: true, beg_dates: {}, end_dates: {}, used_in: 'New Zealand', },
- 'N' => { name: 'None', beg_doy: nil, end_doy: nil, beg_dates: {}, end_dates: {}, used_in: 'DST not observed.', },
- 'U' => { name: 'Unknown', beg_doy: nil, end_doy: nil, beg_dates: {}, end_dates: {}, used_in: 'DST status not known. The same as "None".', },
- }
-
- DST_RULES['E'][:beg_dates] = { 1987 => "1987-03-29", 1988 => "1988-03-27", 1989 => "1989-03-26", 1990 => "1990-03-25", 1991 => "1991-03-31", 1992 => "1992-03-29", 1993 => "1993-03-28", 1994 => "1994-03-27", 1995 => "1995-03-26", 1996 => "1996-03-31", 1997 => "1997-03-30", 1998 => "1998-03-29", 1999 => "1999-03-28", 2000 => "2000-03-26", 2001 => "2001-03-25", 2002 => "2002-03-31", 2003 => "2003-03-30", 2004 => "2004-03-28", 2005 => "2005-03-27", 2006 => "2006-03-26", 2007 => "2007-03-25", 2008 => "2008-03-30", 2009 => "2009-03-29", 2010 => "2010-03-28", 2011 => "2011-03-27", 2012 => "2012-03-25", 2013 => "2013-03-31", 2014 => "2014-03-30", 2015 => "2015-03-29", 2016 => "2016-03-27", 2017 => "2017-03-26", 2018 => "2018-03-25", 2019 => "2019-03-31", 2020 => "2020-03-29", }.tap{|hsh| hsh.each{|year,date_str| hsh[year] = Date.parse(date_str) } }
- DST_RULES['E'][:end_dates] = { 1987 => "1987-10-25", 1988 => "1988-10-30", 1989 => "1989-10-29", 1990 => "1990-10-28", 1991 => "1991-10-27", 1992 => "1992-10-25", 1993 => "1993-10-31", 1994 => "1994-10-30", 1995 => "1995-10-29", 1996 => "1996-10-27", 1997 => "1997-10-26", 1998 => "1998-10-25", 1999 => "1999-10-31", 2000 => "2000-10-29", 2001 => "2001-10-28", 2002 => "2002-10-27", 2003 => "2003-10-26", 2004 => "2004-10-31", 2005 => "2005-10-30", 2006 => "2006-10-29", 2007 => "2007-10-28", 2008 => "2008-10-26", 2009 => "2009-10-25", 2010 => "2010-10-31", 2011 => "2011-10-30", 2012 => "2012-10-28", 2013 => "2013-10-27", 2014 => "2014-10-26", 2015 => "2015-10-25", 2016 => "2016-10-30", 2017 => "2017-10-29", 2018 => "2018-10-28", 2019 => "2019-10-27", 2020 => "2020-10-25", }.tap{|hsh| hsh.each{|year,date_str| hsh[year] = Date.parse(date_str) } }
- DST_RULES['A'][:beg_dates] = { 1987 => "1987-03-08", 1988 => "1988-03-13", 1989 => "1989-03-12", 1990 => "1990-03-11", 1991 => "1991-03-10", 1992 => "1992-03-08", 1993 => "1993-03-14", 1994 => "1994-03-13", 1995 => "1995-03-12", 1996 => "1996-03-10", 1997 => "1997-03-09", 1998 => "1998-03-08", 1999 => "1999-03-14", 2000 => "2000-03-12", 2001 => "2001-03-11", 2002 => "2002-03-10", 2003 => "2003-03-09", 2004 => "2004-03-14", 2005 => "2005-03-13", 2006 => "2006-03-12", 2007 => "2007-03-11", 2008 => "2008-03-09", 2009 => "2009-03-08", 2010 => "2010-03-14", 2011 => "2011-03-13", 2012 => "2012-03-11", 2013 => "2013-03-10", 2014 => "2014-03-09", 2015 => "2015-03-08", 2016 => "2016-03-13", 2017 => "2017-03-12", 2018 => "2018-03-11", 2019 => "2019-03-10", 2020 => "2020-03-08", }.tap{|hsh| hsh.each{|year,date_str| hsh[year] = Date.parse(date_str) } }
- DST_RULES['A'][:end_dates] = { 1987 => "1987-11-01", 1988 => "1988-11-06", 1989 => "1989-11-05", 1990 => "1990-11-04", 1991 => "1991-11-03", 1992 => "1992-11-01", 1993 => "1993-11-07", 1994 => "1994-11-06", 1995 => "1995-11-05", 1996 => "1996-11-03", 1997 => "1997-11-02", 1998 => "1998-11-01", 1999 => "1999-11-07", 2000 => "2000-11-05", 2001 => "2001-11-04", 2002 => "2002-11-03", 2003 => "2003-11-02", 2004 => "2004-11-07", 2005 => "2005-11-06", 2006 => "2006-11-05", 2007 => "2007-11-04", 2008 => "2008-11-02", 2009 => "2009-11-01", 2010 => "2010-11-07", 2011 => "2011-11-06", 2012 => "2012-11-04", 2013 => "2013-11-03", 2014 => "2014-11-02", 2015 => "2015-11-01", 2016 => "2016-11-06", 2017 => "2017-11-05", 2018 => "2018-11-04", 2019 => "2019-11-03", 2020 => "2020-11-01", }.tap{|hsh| hsh.each{|year,date_str| hsh[year] = Date.parse(date_str) } }
- DST_RULES['S'][:beg_dates] = { 1987 => "1987-10-18", 1988 => "1988-10-16", 1989 => "1989-10-15", 1990 => "1990-10-21", 1991 => "1991-10-20", 1992 => "1992-10-18", 1993 => "1993-10-17", 1994 => "1994-10-16", 1995 => "1995-10-15", 1996 => "1996-10-20", 1997 => "1997-10-19", 1998 => "1998-10-18", 1999 => "1999-10-17", 2000 => "2000-10-15", 2001 => "2001-10-21", 2002 => "2002-10-20", 2003 => "2003-10-19", 2004 => "2004-10-17", 2005 => "2005-10-16", 2006 => "2006-10-15", 2007 => "2007-10-21", 2008 => "2008-10-19", 2009 => "2009-10-18", 2010 => "2010-10-17", 2011 => "2011-10-16", 2012 => "2012-10-21", 2013 => "2013-10-20", 2014 => "2014-10-19", 2015 => "2015-10-18", 2016 => "2016-10-16", 2017 => "2017-10-15", 2018 => "2018-10-21", 2019 => "2019-10-20", 2020 => "2020-10-18", }.tap{|hsh| hsh.each{|year,date_str| hsh[year] = Date.parse(date_str) } }
- DST_RULES['S'][:end_dates] = { 1987 => "1987-03-15", 1988 => "1988-03-20", 1989 => "1989-03-19", 1990 => "1990-03-18", 1991 => "1991-03-17", 1992 => "1992-03-15", 1993 => "1993-03-21", 1994 => "1994-03-20", 1995 => "1995-03-19", 1996 => "1996-03-17", 1997 => "1997-03-16", 1998 => "1998-03-15", 1999 => "1999-03-21", 2000 => "2000-03-19", 2001 => "2001-03-18", 2002 => "2002-03-17", 2003 => "2003-03-16", 2004 => "2004-03-21", 2005 => "2005-03-20", 2006 => "2006-03-19", 2007 => "2007-03-18", 2008 => "2008-03-16", 2009 => "2009-03-15", 2010 => "2010-03-21", 2011 => "2011-03-20", 2012 => "2012-03-18", 2013 => "2013-03-17", 2014 => "2014-03-16", 2015 => "2015-03-15", 2016 => "2016-03-20", 2017 => "2017-03-19", 2018 => "2018-03-18", 2019 => "2019-03-17", 2020 => "2020-03-15", }.tap{|hsh| hsh.each{|year,date_str| hsh[year] = Date.parse(date_str) } }
- DST_RULES['O'][:beg_dates] = { 1987 => "1987-10-04", 1988 => "1988-10-02", 1989 => "1989-10-01", 1990 => "1990-10-07", 1991 => "1991-10-06", 1992 => "1992-10-04", 1993 => "1993-10-03", 1994 => "1994-10-02", 1995 => "1995-10-01", 1996 => "1996-10-06", 1997 => "1997-10-05", 1998 => "1998-10-04", 1999 => "1999-10-03", 2000 => "2000-10-01", 2001 => "2001-10-07", 2002 => "2002-10-06", 2003 => "2003-10-05", 2004 => "2004-10-03", 2005 => "2005-10-02", 2006 => "2006-10-01", 2007 => "2007-10-07", 2008 => "2008-10-05", 2009 => "2009-10-04", 2010 => "2010-10-03", 2011 => "2011-10-02", 2012 => "2012-10-07", 2013 => "2013-10-06", 2014 => "2014-10-05", 2015 => "2015-10-04", 2016 => "2016-10-02", 2017 => "2017-10-01", 2018 => "2018-10-07", 2019 => "2019-10-06", 2020 => "2020-10-04", }.tap{|hsh| hsh.each{|year,date_str| hsh[year] = Date.parse(date_str) } }
- DST_RULES['O'][:end_dates] = { 1987 => "1987-04-05", 1988 => "1988-04-03", 1989 => "1989-04-02", 1990 => "1990-04-01", 1991 => "1991-04-07", 1992 => "1992-04-05", 1993 => "1993-04-04", 1994 => "1994-04-03", 1995 => "1995-04-02", 1996 => "1996-04-07", 1997 => "1997-04-06", 1998 => "1998-04-05", 1999 => "1999-04-04", 2000 => "2000-04-02", 2001 => "2001-04-01", 2002 => "2002-04-07", 2003 => "2003-04-06", 2004 => "2004-04-04", 2005 => "2005-04-03", 2006 => "2006-04-02", 2007 => "2007-04-01", 2008 => "2008-04-06", 2009 => "2009-04-05", 2010 => "2010-04-04", 2011 => "2011-04-03", 2012 => "2012-04-01", 2013 => "2013-04-07", 2014 => "2014-04-06", 2015 => "2015-04-05", 2016 => "2016-04-03", 2017 => "2017-04-02", 2018 => "2018-04-01", 2019 => "2019-04-07", 2020 => "2020-04-05", }.tap{|hsh| hsh.each{|year,date_str| hsh[year] = Date.parse(date_str) } }
- DST_RULES['Z'][:beg_dates] = { 1987 => "1987-09-27", 1988 => "1988-09-25", 1989 => "1989-09-24", 1990 => "1990-09-30", 1991 => "1991-09-29", 1992 => "1992-09-27", 1993 => "1993-09-26", 1994 => "1994-09-25", 1995 => "1995-09-24", 1996 => "1996-09-29", 1997 => "1997-09-28", 1998 => "1998-09-27", 1999 => "1999-09-26", 2000 => "2000-09-24", 2001 => "2001-09-30", 2002 => "2002-09-29", 2003 => "2003-09-28", 2004 => "2004-09-26", 2005 => "2005-09-25", 2006 => "2006-09-24", 2007 => "2007-09-30", 2008 => "2008-09-28", 2009 => "2009-09-27", 2010 => "2010-09-26", 2011 => "2011-09-25", 2012 => "2012-09-30", 2013 => "2013-09-29", 2014 => "2014-09-28", 2015 => "2015-09-27", 2016 => "2016-09-25", 2017 => "2017-09-24", 2018 => "2018-09-30", 2019 => "2019-09-29", 2020 => "2020-09-27", }.tap{|hsh| hsh.each{|year,date_str| hsh[year] = Date.parse(date_str) } }
- DST_RULES['Z'][:end_dates] = { 1987 => "1987-04-05", 1988 => "1988-04-03", 1989 => "1989-04-02", 1990 => "1990-04-01", 1991 => "1991-04-07", 1992 => "1992-04-05", 1993 => "1993-04-04", 1994 => "1994-04-03", 1995 => "1995-04-02", 1996 => "1996-04-07", 1997 => "1997-04-06", 1998 => "1998-04-05", 1999 => "1999-04-04", 2000 => "2000-04-02", 2001 => "2001-04-01", 2002 => "2002-04-07", 2003 => "2003-04-06", 2004 => "2004-04-04", 2005 => "2005-04-03", 2006 => "2006-04-02", 2007 => "2007-04-01", 2008 => "2008-04-06", 2009 => "2009-04-05", 2010 => "2010-04-04", 2011 => "2011-04-03", 2012 => "2012-04-01", 2013 => "2013-04-07", 2014 => "2014-04-06", 2015 => "2015-04-05", 2016 => "2016-04-03", 2017 => "2017-04-02", 2018 => "2018-04-01", 2019 => "2019-04-07", 2020 => "2020-04-05", }.tap{|hsh| hsh.each{|year,date_str| hsh[year] = Date.parse(date_str) } }
-
- def self.parse_boundary(str, *args)
- require 'chronic'
- rank, weekday, art, month = str.split(/\s+/)
- if rank == 'last'
- val = ['5th', '4th'].map{|wk| Chronic.parse([wk, weekday, art, month].join(' '), *args) }.compact.first
- else
- val = Chronic.parse(str, *args)
- end
- Date.new(val.year, val.month, val.day)
- end
-
- def self.beg_date(rule, year)
- DST_RULES[rule][:beg_dates][year] ||= parse_boundary(DST_RULES[rule][:beg_doy], now: Time.utc(year, 1, 1))
- end
- def self.end_date(rule, year)
- DST_RULES[rule][:end_dates][year] ||= parse_boundary(DST_RULES[rule][:end_doy], now: Time.utc(year, 1, 1))
- end
-
- def self.table
- %w[E A S O Z].each{|rule| YEARS.each{|year| beg_date(rule, year) ; end_date(rule, year) } }
- DST_RULES
- end
-
- def self.dst?(rule, val)
- early = beg_date(rule, val.year)
- late = end_date(rule, val.year)
- in_range = (val >= early) && (val < late)
- DST_RULES[rule][:southern] ? (not in_range) : in_range
- end
-
- end
-end
View
167 munging/airline_flights/topcities.rb
@@ -1,167 +0,0 @@
-#!/usr/bin/env ruby
-require('rake')
-require_relative('../../rake_helper')
-require_relative './models'
-
-Pathname.register_paths(
- af_data: [:data, 'airline_flights'],
- af_work: [:work, 'airline_flights'],
- af_code: File.dirname(__FILE__),
- airport_identifiers: [:af_work, "airport_identifiers.tsv" ],
- )
-
-AIRPORTS_TO_MATCH = [
- [ 'Tokyo', 1, "HND", ],
- [ 'Guangzhou', 2, "CAN", ],
- [ 'Seoul', 3, "ICN", ],
- [ 'Shanghai', 4, "PVG", ],
- [ 'Mexico.*City', 5, "MEX", ],
- [ 'Delhi', 6, "DEL", ],
- [ 'New.*York', 7, "JFK", ],
- [ 'S.*o.*Paulo', 8, "GRU", ],
- [ 'Mumbai|Bombay', 9, "BOM", ],
- [ 'Manila', 10, "MNL", ],
- [ 'Jakarta', 11, "CGK", ],
- [ 'Los.*Angeles', 12, "LAX", ],
- [ 'Karachi', 13, "KHI", ],
- [ 'Osaka', 14, "KIX", ],
- [ 'Beijing', 15, "PEK", ],
- [ 'Moscow', 16, "SVO", ],
- [ 'Cairo', 17, "CAI", ],
- [ 'Kolkata|Calcutta', 18, "CCU", ],
- [ 'Buenos.*Aires', 19, "EZE", ],
- [ 'Dhaka', 20, "DAC", ],
- [ 'Bangkok', 21, "BKK", ],
- [ 'Tehran|Abyek', 22, "IKA", ],
- [ 'Istanbul', 23, "IST", ],
- [ 'Janeiro', 24, "GIG", ],
- [ 'London', 25, "LHR", ],
- [ 'Lagos', 26, "LOS", ],
- [ 'Paris', 27, "CDG", ],
- [ 'Chicago', 28, "ORD", ],
- [ 'Kinshasa', 29, "FIH", ],
- [ 'Lima', 30, "LIM", ],
- [ 'Wuhan', 31, "WUH", ],
- [ 'Bangalore', 32, "BLR", ],
- [ 'Bogot.*', 33, "BOG", ],
- [ 'Taipei', 34, "TSA", ],
- [ 'Washington|Arling', 35, "DCA", ],
- [ 'Johannesburg', 36, "JNB", ],
- [ 'Saigon|Ho.Chi.M', 37, "SGN", ],
- [ 'San.*Francisco', 38, "SFO", ],
- [ 'Boston', 39, "BOS", ],
- [ 'Hong.*Kong', 40, "HKG", ],
- [ 'Baghdad', 41, "SDA", ],
- [ 'Madrid', 42, "MAD", ],
- [ 'Singapore', 43, "SIN", ],
- [ 'Kuala.*Lumpur', 44, "KUL", ],
- [ 'Chongqing|Chung.*', 45, "CKG", ],
- [ 'Santiago', 46, "SCL", ],
- [ 'Toronto', 47, "YYZ", ],
- [ 'Riyadh', 48, "RUH", ],
- [ 'Atlanta', 49, "ATL", ],
- [ 'Miami', 50, "MIA", ],
- [ 'Detroit', 51, "DTW", ],
- [ 'St..*Petersburg', 52, "LED", ],
- [ 'Khartoum', 53, "KRT", ],
- [ 'Sydney', 54, "SYD", ],
- [ 'Milan', 55, "MXP", ],
- [ 'Abidjan', 56, "ABJ", ],
- [ 'Barcelona', 57, "BCN", ],
- [ 'Nairobi', 58, "NBO", ],
- [ 'Caracas', 59, "CCS", ],
- [ 'Monterrey', 60, "MTY", ],
- [ 'Phoenix', 61, "PHX", ],
- [ 'Berlin', 62, "TXL", ],
- [ 'Melbourne', 63, "MEL", ],
- [ 'Casablanca', 64, "CMN", ],
- [ 'Montreal', 65, "YUL", ],
- [ 'Salvador', 66, "SSA", ],
- [ 'Rome', 67, "FCO", ],
- [ 'Kiev', 68, "KBP", ],
- [ 'Ad+is.*Ab.ba', 69, "ADD", ],
- [ 'Denver', 70, "DEN", ],
- [ 'St.*Louis', 71, "STL", ],
- [ 'Dakar', 72, "DKR", ],
- [ 'San.*Juan', 73, "SJU", ],
- [ 'Vancouver', 74, "YVR", ],
- [ 'Tel.*Aviv', 75, "TLV", ],
- [ 'Tunis', 76, "TUN", ],
- [ 'Portland', 77, "PDX", ],
- [ 'Manaus', 78, "MAO", ],
- [ 'Calgary', 79, "YYC", ],
- [ 'Halifax', 80, "YHZ", ],
- [ 'Prague', 81, "PRG", ],
- [ 'Copenhagen', 82, "CPH", ],
- [ 'Djibouti', 83, "JIB", ],
- [ 'Quito', 84, "UIO", ],
- [ 'Helsinki', 85, "HEL", ],
- [ 'Papeete|Tahiti', 86, "PPT", ],
- [ 'Frankfurt', 87, "FRA", ],
- [ 'Reykjavik', 88, "RKV", ],
- [ 'Riga', 89, "RIX", ],
- [ 'Antananarivo', 90, "TNR", ],
- [ 'Amsterdam', 91, "AMS", ],
- [ 'Bucharest', 92, "OTP", ],
- [ 'Novosibirsk', 93, "OVB", ],
- [ 'Kigali', 94, "KGL", ],
- [ 'Dushanbe', 95, "DYU", ],
- [ 'Dubai', 96, "DXB", ],
- [ 'Bermuda', 97, "BDA", ],
- [ 'Anchorage', 98, "ANC", ],
- [ 'Austin', 99, "AUS", ],
- [ 'Honolulu', 100, "HNL", ],
- [ 'Apia', 101, "FGI", ],
- [ 'Vienna', 102, "VIE", ],
- [ 'Brussels', 103, "BRU", ],
- [ 'Munich', 104, "MUC", ],
- [ 'Dublin', 105, "DUB", ],
- [ 'Doha', 106, "DOH", ],
- [ 'Taipei', 107, "TPE", ],
- [ 'Yakutsk', 108, "YKS", ],
- [ 'Z.rich', 109, "ZRH", ],
- [ 'Manchester', 110, "MAN", ],
- [ 'Houston', 111, "IAH", ],
- [ 'Charlotte', 112, "CLT", ],
- [ 'Dallas', 113, "DFW", ],
- [ 'Las.*Vegas', 114, "LAS", ],
- [ 'Antalya', 115, "AYT", ],
- [ 'Auckland', 116, "AKL", ],
-]
-
-MATCHED_AIRPORTS = {}
-MATCH_ON_IATA = {}
-MATCH_ON_CITY = {}
-match_on_city_names = []
-
-AIRPORTS_TO_MATCH.each do |name, idx, iata|
- hsh = {iata: iata, re: Regexp.new(name, 'i'), name: name, idx: idx}
- if iata.present?
- MATCH_ON_IATA[iata] = hsh
- else
- match_on_city_names << name
- MATCH_ON_CITY[hsh[:re]] = hsh
- end
-end
-match_on_city_re = Regexp.new(match_on_city_names.join('|'))
-
-Airport.load_tsv(:airport_identifiers) do |airport|
- airport.name = airport.name[0..30]
- if MATCH_ON_IATA.include?(airport.iata)
- hsh = MATCH_ON_IATA[airport.iata]
- warn [hsh.values, airport.to_tsv].flatten.join("\t") unless hsh[:re] =~ airport.city
- MATCHED_AIRPORTS[hsh[:idx]] = airport
- # elsif (airport.city =~ match_on_city_re)
- # MATCH_ON_CITY.each do |re, hsh|
- # if (airport.city =~ re)
- # puts [airport.to_tsv, hsh[:name], hsh[:idx]].join("\t")
- # end
- # end
- end
-end
-
-AIRPORTS_TO_MATCH.each do |name, idx, iata|
- # next if MATCHED_AIRPORTS[idx]
- airport_str = MATCHED_AIRPORTS[idx] ? MATCHED_AIRPORTS[idx].to_tsv : "\t\t\t\t\t\t\t\t\t\t\t\t"
- puts [airport_str, name, "", idx].join("\t")
-end
View
40 munging/airports/40_wbans.txt
@@ -1,40 +0,0 @@
-13874
-13874
-14739
-13881
-13881
-03017
-03017
-03927
-03927
-94847
-94847
-14734
-14734
-53127
-99999
-12960
-12960
-94789
-94789
-23169
-23169
-23174
-23174
-12815
-12815
-12839
-12839
-14922
-14922
-94846
-94846
-13739
-13739
-23183
-23183
-99999
-24233
-24233
-23234
-23234
View
37 munging/airports/filter_weather_reports.rb
@@ -1,37 +0,0 @@
-#!/usr/bin/env ruby
-# encoding:UTF-8
-
-require 'wukong'
-require 'pathname'
-load '/home/dlaw/dev/wukong/examples/wikipedia/munging_utils.rb'
-
-module WeatherFilter
- class Mapper < Wukong::Streamer::LineStreamer
-
- WBAN_FILENAME = '/home/dlaw/dev/wukong/examples/airports/wbans.txt'
- USA_WBAN_FILENAME = '/home/dlaw/dev/wukong/examples/airports/usa_wbans.txt'
- FORTY_WBANS_FILENAME = '/home/dlaw/dev/wukong/examples/airports/40_wbans.txt'
-
- def initialize
- @wbans = []
- wban_file = File.open(FORTY_WBANS_FILENAME)
- wban_file.each_line do |line|
- @wbans << line[0..-2]
- end
- end
-
- def process line
- MungingUtils.guard_encoding(line) do |clean_line|
- wban = Pathname(ENV['map_input_file']).basename.to_s.split('-')[1]
- if @wbans.include? wban
- yield line
- end
- end
- end
- end
-end
-
-Wukong::Script.new(
- WeatherFilter::Mapper,
- nil
-).run
View
31 munging/airports/join.pig
@@ -1,31 +0,0 @@
-/* This was a misguided attempt at generating a list of WBAN IDs assigned to airports by filtering the mshr_enhanced
- * and joining it with isd_stations. This is misguided because mshr_enhanced contains much more data than isd_stations,
- * and also contains multiple entries for each weather station, making it non-obvious how best to join the data.
- * A simpler and better approach, taken in usa_wbans.pig and wbans.pig, is to filter and unique mshr_enhanced.
- */
-
-mshr = LOAD '/Users/dlaw/Desktop/stations/mshr_enhanced.tsv' AS
- (source_id:chararray, source:chararray, begin_date:chararray, end_date:chararray, station_status:chararray,
- ncdcstn_id:chararray, icao_id:chararray, wban_id:chararray, faa_id:chararray, nwsli_id:chararray, wmo_id:chararray,
- coop_id:chararray, transmittal_id:chararray, ghcnd_id:chararray, name_principal:chararray, name_principal_short:chararray,
- name_coop:chararray, name_coop_short:chararray, name_publication:chararray, name_alias:chararray, nws_clim_div:chararray,
- nws_clim_div_name:chararray, state_prov:chararray, county:chararray, nws_st_code:chararray, fips_country_code:chararray,
- fips_country_name:chararray, nws_region:chararray, nws_wfo:chararray, elev_ground:chararray, elev_ground_unit:chararray,
- elev_barom:chararray, elev_barom_unit:chararray, elev_air:chararray, elev_air_unit:chararray, elev_zerodat:chararray,
- elev_zerodat_unit:chararray, elev_unk:chararray, elev_unk_unit:chararray, lat_dec:chararray, lon_dec:chararray,
- lat_lon_precision:chararray, relocation:chararray, utc_offset:chararray, obs_env:chararray, platform:chararray);
-
-mshr_grouped = GROUP mshr BY (icao_id, wban_id, faa_id, begin_date, end_date);
-mshr_final = FOREACH mshr_grouped GENERATE FLATTEN(group) AS (wban_id, icao_id, faa_id, begin_date, end_date);
-
-stations = LOAD '/Users/dlaw/Desktop/stations/stations.tsv' AS
- (usaf_id:chararray, wban_id:chararray, station_name:chararray, wmo_country_id:chararray, fips_country_id:chararray,
- state:chararray, icao_call_sign:chararray, latitude:chararray, longitude:chararray, elevation:chararray, begin:chararray, end:chararray);
-
-first_pass_j = JOIN mshr_final BY (wban_id) RIGHT OUTER, stations BY (wban_id);
-first_pass_f = FILTER first_pass_j BY (mshr_final::icao_id is not null);
-first_pass = FOREACH first_pass_f GENERATE
- stations::wban_id, mshr_final::icao_id, stations::icao_call_sign, stations::usaf_id, mshr_final::faa_id,
- stations::station_name, stations::wmo_country_id, stations::fips_country_id, stations::state, stations::latitude, stations::longitude, stations::elevation, stations::begin, stations::end;
-
-STORE first_pass INTO '/Users/dlaw/Desktop/stations/airport_stations';
View
33 munging/airports/to_tsv.rb
@@ -1,33 +0,0 @@
-#!/usr/bin/env ruby
-load 'flat/lib/flat.rb'
-
-# This is a script that uses the flat file parser
-# to transform the mshr enhanced data file and the
-# ISD stations list from fixed-width to .tsv.
-# The script must be in the same directory with
-# mshr_enhanced.txt, isd_stations.txt, and the
-# flat file parsing library to work.
-
-# mshr-enhanced format description can be found at
-# ftp://ftp.ncdc.noaa.gov/pub/data/homr/docs/MSHR_Enhanced_Table.txt
-
-# The actual mshr-enhanced table can be found at
-# http://www.ncdc.noaa.gov/homr/file/mshr_enhanced.txt.zip
-
-# isd_stations can be found at
-# http://www1.ncdc.noaa.gov/pub/data/noaa/ish-history.txt
-
-# Format strings
-MSHR_FORMAT_STRING = %{s20 s10 s8 s8 s20 s20 s20 s20 s20 s20 s20 s20 s20 s20
- s100 s30 s100 s30 s100 s100 s10 s40 s10 s50 s2 s2 s100
- s30 s10 s40 s20 s40 s20 s40 s20 s40 s20 s40 s20 s20 s20
- s10 s62 s16 s40 s100}
-ISD_FORMAT_STRING = %{s6 s5 s29 s2 s2 s2 s5 D6e3 D7e3 D6e1 _2 s8 s8}
-
-# Parse mshr_enhanced
-mshr_parser = Flat.create_parser(MSHR_FORMAT_STRING,1)
-mshr_parser.file_to_tsv('mshr_enhanced.txt','mshr_enhanced.tsv')
-
-# Parse isd_stations
-isd_parser = Flat.create_parser(ISD_FORMAT_STRING,1,false)
-isd_parser.file_to_tsv('isd_stations.txt','isd_stations.tsv')
View
19 munging/airports/usa_wbans.pig
@@ -1,19 +0,0 @@
-// Outputs a list of WBAN ids that are assigned to airports in the USA
-
-mshr = LOAD '/Users/dlaw/Desktop/stations/mshr_enhanced.tsv' AS
- (source_id:chararray, source:chararray, begin_date:chararray, end_date:chararray, station_status:chararray,
- ncdcstn_id:chararray, icao_id:chararray, wban_id:chararray, faa_id:chararray, nwsli_id:chararray, wmo_id:chararray,
- coop_id:chararray, transmittal_id:chararray, ghcnd_id:chararray, name_principal:chararray, name_principal_short:chararray,
- name_coop:chararray, name_coop_short:chararray, name_publication:chararray, name_alias:chararray, nws_clim_div:chararray,
- nws_clim_div_name:chararray, state_prov:chararray, county:chararray, nws_st_code:chararray, fips_country_code:chararray,
- fips_country_name:chararray, nws_region:chararray, nws_wfo:chararray, elev_ground:chararray, elev_ground_unit:chararray,
- elev_barom:chararray, elev_barom_unit:chararray, elev_air:chararray, elev_air_unit:chararray, elev_zerodat:chararray,
- elev_zerodat_unit:chararray, elev_unk:chararray, elev_unk_unit:chararray, lat_dec:chararray, lon_dec:chararray,
- lat_lon_precision:chararray, relocation:chararray, utc_offset:chararray, obs_env:chararray, platform:chararray);
-
-mshr_grouped = GROUP mshr BY (icao_id, wban_id, faa_id, fips_country_code);
-mshr_flattened = FOREACH mshr_grouped GENERATE FLATTEN(group) AS (wban_id, icao_id, faa_id, fips_country_code);
-mshr_filtered = FILTER mshr_flattened BY (icao_id is not null and wban_id is not null and fips_country_code == 'US');
-
-mshr_final = FOREACH mshr_filtered GENERATE wban_id;
-STORE mshr_final INTO '/Users/dlaw/Desktop/stations/usa_wbans';
View
2,157 munging/airports/usa_wbans.txt
@@ -1,2157 +0,0 @@
-03069
-94119
-04204
-94991
-53928
-54779
-03196
-63844
-63875
-53929
-64773
-04201
-94076
-04849
-63834
-94037
-63839
-63876
-63877
-03051
-63878
-12981
-12971
-53866
-53904
-12978
-53862
-94061
-63848
-63879
-63847
-53990
-53966
-53972
-03053
-94298
-03032
-63827
-92807
-92807
-03064
-93226
-63880
-94993
-53964
-94623
-94070
-94943
-04862
-12832
-03974
-94299
-14737
-13962
-23050
-14929
-14929
-03019
-13869
-04863
-14756
-53991
-54916
-13959
-24283
-93730
-94997
-04847
-53930
-93940
-03970
-03043
-94998
-13705
-03034
-03064
-94968
-93915
-93065
-04827
-04993
-54770
-53909
-14762
-12804
-03820
-13873
-24044
-04828
-04864
-14930
-94999
-53931
-63853
-53175
-53870
-24015
-93773
-93773
-14813
-14735
-12932
-53146
-93097
-03958
-94910
-23061
-24160
-63833
-23047
-23047
-13870
-53989
-54816
-94989
-12899
-13871
-93846
-94974
-14847
-94975
-04850
-14736
-93067
-93227
-12897
-13701
-94849
-04837
-04837
-53983
-54917
-53932
-03949
-54768
-53915
-94889
-93991
-12953
-04808
-94790
-04865
-94987
-53865
-93073
-03973
-54754
-93796
-94224
-54809
-94929
-13874
-03035
-13958
-04825
-94287
-14946
-14605
-04901
-04902
-23224
-03892
-13904
-13958
-14897
-93797
-53933
-03812
-14777
-23191
-04903
-94190
-53959
-04205
-04904
-14910
-63901
-04779
-13861
-53129
-94815
-54772
-93216
-13944
-14775
-13803
-24119
-24119
-54817
-12971
-54904
-93942
-94946
-93240
-53881
-23159
-12803
-94961
-14816
-03036
-14740
-94702
-14702
-94871
-04751
-24028
-24234
-23155
-13838
-04905
-03024
-53882
-04725
-04725
-14606
-23005
-14616
-94055
-13876
-94289
-94793
-94947
-23157
-63835
-24033
-24011
-24011
-04839
-13820
-03065
-93068
-14958
-04842
-53145
-53988
-24130
-23036
-04853
-12982
-13726
-12818
-03872
-94902
-03859
-23158
-24217
-54760
-23225
-13802
-94046
-24180
-03893
-54831
-94700
-03999
-53823
-13897
-25312
-94185
-04906
-24131
-24267
-14739
-12809
-54765
-03044
-24164
-53918
-12917
-53883
-53992
-94938
-14931
-12919
-12919
-13904
-14815
-24135
-64705
-13970
-14742
-14964
-14733
-14733
-23156
-23152
-04866
-93783
-04813
-54921
-03959
-94282
-53934
-54733
-93943
-93808
-93721
-54922
-94054
-13814
-24133
-03182
-92804
-12981
-24132
-04867
-04839
-04868
-14817
-13883
-13883
-24046
-14895
-23051
-14607
-04907
-03732
-04908
-04909
-03037
-13825
-94625
-04101
-53884
-23254
-03038
-54764
-14966
-93129
-94977
-53916
-24017
-23007
-54743
-24286
-14703
-53850
-13884
-93069
-54901
-54828
-93967
-04805
-03935
-94866
-03914
-13882
-53128
-53935
-93736
-13880
-93203
-14990
-04910
-93809
-04869
-93798
-03802
-04911
-03881
-04912
-54920
-03894
-14820
-14820
-54832
-13999
-03904
-94266
-13881
-53845
-23136
-14821
-94870
-14858
-94940
-54923
-04913
-13984
-23086
-93033
-03179
-13981
-93075
-24045
-24136
-12867
-14745
-04914
-93037
-12947
-03945
-13983
-93799
-03701
-24089
-03960
-53981
-03039
-53936
-03027
-54905
-93134
-94199
-94624
-93718
-53860
-12924
-12924
-03177
-53912
-13866
-12879
-93842
-03932
-04915
-03847
-24137
-12833
-03727
-53867
-53993
-63840
-94032
-93814
-23077
-24202
-23008
-04870
-94890
-13941
-94979
-53902
-54774
-14751
-24018
-24018
-94204
-03164
-94056
-93728
-12834
-23161
-13960
-13728
-54791
-93235
-93815
-63881
-94908
-13743
-53852
-13985
-13985
-54781
-03887
-04916
-03017
-23062
-53925
-14822
-94119
-25515
-04851
-03927
-22015
-94057
-13839
-93843
-93042
-04135
-24012
-93771
-04871
-14747
-03073
-22001
-14913
-14913
-54833
-24138
-24219
-23109
-93784
-23078
-03994
-03724
-94984
-63903
-13837
-53885
-04917
-94891
-13707
-93770
-94878
-94892
-24103
-03702
-63842
-03160
-03976
-54844
-93005
-22010
-14933
-94704
-23162
-94962
-53905
-03991
-53853
-94847
-04830
-53937
-53938
-93026
-04872
-04787
-03070
-94928
-94982
-94982
-54924
-03184
-53910
-54734
-04918
-54786
-03809
-13910
-04919
-03987
-53965
-03049
-14905
-94239
-14991
-03983
-03983
-12983
-04920
-13786
-13787
-73805
-23098
-03703
-03197
-23114
-23179
-94721
-94050
-53864
-12906
-54758
-04873
-23063
-03844
-54838
-93076
-63843
-53886
-24213
-04806
-13729
-24121
-63882
-93992
-03977
-14748
-24220
-94964
-23044
-23154
-54757
-24165
-13989
-03165
-03704
-53112
-13909
-53887
-24193
-04845
-54766
-53876
-04921
-53851
-24141
-14608
-03020
-53872
-14860
-25318
-12961
-04874
-14824
-94853
-13935
-03756
-94971
-53110
-04875
-04922
-63872
-24221
-94195
-92808
-04923
-93817
-04111
-53825
-94726
-53939
-93719
-14734
-12971
-53888
-53842
-03705
-12836
-53114
-03706
-53969
-53907
-93735
-93996
-14914
-93193
-93740
-93737
-94969
-24146
-23167
-24047
-94963
-94015
-53831
-13730
-03981
-03981
-03022
-14825
-04876
-04924
-94006
-03736
-53819
-53819
-04925
-94966
-13840
-53841
-03185
-04926
-94276
-03124
-54792
-04780
-04927
-94868
-13763
-54818
-04840
-03103
-12849
-13744
-03918
-13921
-93733
-14704
-23090
-12835
-94957
-94035
-94062
-14826
-53889
-23256
-94933
-13920
-14719
-04928
-04877
-12895
-53890
-54787
-13947
-94948
-14944
-14944
-04929
-13945
-13964
-23091
-04930
-53113
-03018
-13807
-13961
-03888
-03166
-04836
-03707
-14827
-53891
-54793
-03985
-03985
-12885
-03734
-93993
-54773
-03896
-13975
-93764
-12993
-13940
-94959
-03148
-23168
-94023
-23064
-53940
-03195
-04999
-53977
-23055
-24087
-13764
-24157
-53126
-53892
-24048
-53866
-24112
-14916
-14750
-03901
-14976
-94008
-94008
-64774
-04931
-12876
-23066
-03025
-04843
-53907
-23065
-23065
-93929
-13939
-04854
-12923
-53941
-13886
-94992
-03056
-24142
-12990
-94626
-93057
-12816
-53913
-14707
-04878
-24146
-53984
-93874
-94919
-14898
-14898
-53874
-24201
-14935
-03902
-03992
-94860
-94860
-13713
-14829
-13723
-03870
-03870
-14715
-24143
-53893
-53942
-93007
-23081
-94833
-03030
-53838
-13926
-03929
-13978
-03708
-53837
-24051
-53979
-63889
-53967
-04932
-03186
-54762
-04807
-53820
-54850
-53855
-13833
-03709
-93986
-12994
-04933
-04934
-03908
-04935
-94025
-12962
-03023
-03710
-53894
-94038
-63873
-03961
-14752
-93747
-93706
-93218
-03980
-93046
-03167
-94931
-54728
-24101
-94261
-04998
-53119
-53869
-13927
-03810
-93990
-14894
-24144
-03933
-53895
-23002
-94187
-53896
-53127
-04936
-53111
-03711
-93034
-14936
-13806
-03962
-12918
-94745
-94745
-53839
-94225
-63836
-53970
-04113
-03712
-12904
-13971
-03852
-63852
-53897
-93729
-94949
-93757
-12826
-03856
-94814
-64761
-03860
-53857
-63883
-03868
-93823
-14609
-12927
-26522
-13986
-23170
-14758
-94012
-93228
-92809
-54790
-63837
-94720
-12979
-94973
-03968
-04829
-04997
-04857
-53842
-63884
-03923
-93738
-04724
-12960
-94073
-04937
-94990
-03928
-03928
-53118
-24145
-64706
-94039
-53943
-53135
-93167
-04879
-93785
-04989
-53898
-54767
-53944
-04880
-25626
-04938
-03972
-13781
-13748
-13841
-13841
-04833
-04833
-24091
-94893
-93819
-93819
-53972
-23040
-14918
-23141
-93807
-23194
-03984
-14937
-53899
-03144
-14778
-14938
-54827
-92813
-94014
-93726
-04781
-04881
-04826
-94761
-03026
-23104
-23104
-94926
-94623
-12969
-04844
-93909
-93194
-93244
-54772
-24166
-03940
-53945
-13889
-13889
-03953
-13973
-53978
-94051
-03963
-04110
-94789
-03730
-53824
-04720
-63885
-04939
-03889
-03889
-13987
-04940
-14919
-03713
-14834
-63801
-53971
-04726
-53946
-94854
-53947
-53966
-53982
-14833
-54906
-63846
-04882
-03714
-14989
-25314
-24223
-53995
-03013
-14835
-12883
-93091
-14836
-24022
-23169
-03950
-23174
-23042
-23042
-54735
-24023
-23020
-53963
-13776
-12976
-54925
-03937
-03937
-54736
-13812
-04883
-94765
-12819
-94709
-93820
-13702
-93987
-13976
-14732
-23129
-03821
-24148
-94128
-53973
-53844
-03875
-23067
-03731
-93010
-13963
-53813
-53813
-14623
-04941
-03715
-94285
-04114
-53919
-94236
-53975
-24021
-14939
-14886
-04809
-63802
-14921
-54737
-94049
-24172