Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import authorities #11

Closed
garethrees opened this issue Apr 19, 2016 · 9 comments
Closed

Import authorities #11

garethrees opened this issue Apr 19, 2016 · 9 comments
Assignees

Comments

@garethrees
Copy link
Member

garethrees commented Apr 19, 2016

Import from csv.

Cleanup script:

#!/usr/bin/env ruby
# -*- encoding : utf-8 -*-
require 'csv'
# gem install unicode_utils
require 'unicode_utils/downcase'

def main(csv, ignore)
  ignore_list = ignore ? File.read(ignore).split("\n") : []
  puts print_new_csv(clean_csv(csv, ignore_list))
end

def clean_csv(csv, ignore_list = [])
  rows = []
  CSV.foreach(csv, headers: true, header_converters: :symbol) do |row|
    next if ignore_list.include?(row[:name])
    cleaned_data = {}
    cleaned_data[:name] = row[:name]
    cleaned_data[:request_email] = row[:request_email]
    cleaned_data[:home_page] = clean_homepage(row[:web_sitesi].to_s)
    cleaned_data[:notes] = make_notes(row[:telefon1].to_s, row[:faks_1].to_s, row[:adres].to_s)
    cleaned_data[:tag_string] = make_tag_string(row[:kurum_tr].to_s, row[:l].to_s)
    rows << cleaned_data
  end
  rows
end

def clean_homepage(str)
  return str if str.empty?

  if str.start_with?('http')  
    str
  else
    "http://#{ str }"
  end
end

def make_notes(phone, fax, address)
  str = ''

  unless address.empty?
    str << <<-EOF
<strong>Adres:</strong><br />
#{address}
EOF
  end

  unless phone.empty?
    str << <<-EOF
<strong>Telefon:</strong><br />
#{phone}
EOF
  end

  unless fax.empty?
    str << <<-EOF
<strong>Faks:</strong><br />
#{fax}
EOF
  end

  str.gsub("\n\n", "\n").gsub("\n", "<br />").gsub("<br /><br />", "<br />")
end

def make_tag_string(type, location)
  loc = clean_location(location.to_s)
  type = clean_type(type.to_s)
  both = loc << ' ' << type
  UnicodeUtils.downcase(both.split(' ').sort.join(' '), :tr)
end

def clean_type(type)
  type.gsub(' ', '_').gsub('_/_', ' ').gsub('_(_', ' ').gsub('_)_', '').gsub('_)', '')
end

def clean_location(location)
  location.gsub(" - ", "_")
end

def print_new_csv(data)
  headers = data.first.keys
  headers[0] = "##{ headers[0] }"
  CSV.generate(headers: headers) do |csv|
    csv << headers
    data.each do |d|
      csv << d.values
    end
  end
end

csv = ARGV[0]
ignore = ARGV[1]

if csv.nil? || !File.exist?(csv)
  puts "File does not exist: #{ csv }"
  exit 1
end

if ignore && !File.exist?(ignore)
  puts "File does not exist: #{ ignore }"
  exit 1
end

main(csv, ignore)
@garethrees
Copy link
Member Author

Some duplication to clean up:

foiturkey@stork:/var/www/turkey.alaveteli.org/alaveteli$ cat /home/gareth/turkey-bodies-cleaned.csv | bundle exec rake import:import_csv
Only a dry run; public bodies will not be created
Preliminary check for ambiguous names or slugs...
The name Aksu Kaymakamlığı was found 2 times.
The name Bayat Belediyesi was found 2 times.
The name Bayat Kaymakamlığı was found 2 times.
The name Edremit Belediyesi was found 2 times.
The name Edremit Kaymakamlığı was found 2 times.
The name Erdek Belediyesi was found 2 times.
The name Ereğli Belediyesi was found 2 times.
The name Ereğli Kaymakamlığı was found 2 times.
The name Ilgın Belediyesi was found 2 times.
The name Kale Belediyesi was found 2 times.
The name Kale Kaymakamlığı was found 2 times.
The name Köprübaşı Belediyesi was found 2 times.
The name Köprübaşı Kaymakamlığı was found 2 times.
The name Orman Genel Müdürlüğü was found 2 times.
The name Ovacık Belediyesi was found 2 times.
The name Ovacık Kaymakamlığı was found 2 times.
The name Pazar Belediyesi was found 2 times.
The name Pazar Kaymakamlığı was found 2 times.
The name Saray Belediyesi was found 2 times.
The name Saray Kaymakamlığı was found 2 times.
The name Yenişehir Kaymakamlığı was found 2 times.
The url_part aile_ve_sosyal_politikalar_bakanligi was found 2 times.
The url_part aksu_kaymakamligi was found 2 times.
The url_part avrupa_birligi_bakanligi was found 2 times.
The url_part basbakanlik was found 2 times.
The url_part basin_ilan_kurumu was found 2 times.
The url_part bayat_belediyesi was found 2 times.
The url_part bayat_kaymakamligi was found 2 times.
The url_part beypazari_belediyesi was found 2 times.
The url_part bilim_sanayi_ve_teknoloji_bakanligi was found 2 times.
The url_part calisma_ve_sosyal_guvenlik_bakanligi was found 2 times.
The url_part cevre_ve_sehircilik_bakanligi was found 2 times.
The url_part cumhurbaskanligi was found 2 times.
The url_part devlet_personel_baskanligi was found 2 times.
The url_part disisleri_bakanligi was found 2 times.
The url_part edremit_belediyesi was found 2 times.
The url_part edremit_kaymakamligi was found 2 times.
The url_part ekonomi_bakanligi was found 2 times.
The url_part enerji_ve_tabii_kaynaklar_bakanligi was found 2 times.
The url_part erdek_belediyesi was found 2 times.
The url_part eregli_belediyesi was found 2 times.
The url_part eregli_kaymakamligi was found 2 times.
The url_part genclik_ve_spor_bakanligi was found 2 times.
The url_part gida_tarim_ve_hayvancilik_bakanligi was found 2 times.
The url_part gumruk_ve_ticaret_bakanligi was found 2 times.
The url_part hava_kuvvetleri_komutanligi was found 2 times.
The url_part hazine_mustesarligi was found 2 times.
The url_part icisleri_bakanligi was found 2 times.
The url_part iett was found 2 times.
The url_part ilgin_belediyesi was found 2 times.
The url_part istanbul_vergi_dairesi_baskanligi was found 2 times.
The url_part jandarma_genel_komutanligi was found 2 times.
The url_part kacakcilik_istihbarat_ve_bilgi_toplama_dairesi_baskanligi was found 2 times.
The url_part kale_belediyesi was found 2 times.
The url_part kale_kaymakamligi was found 2 times.
The url_part kalkinma_bakanligi was found 2 times.
The url_part kara_kuvvetleri_komutanligi was found 2 times.
The url_part koprubasi_belediyesi was found 2 times.
The url_part koprubasi_kaymakamligi was found 2 times.
The url_part kultur_ve_turizm_bakanligi was found 2 times.
The url_part maliye_bakanligi was found 2 times.
The url_part milli_egitim_bakanligi was found 2 times.
The url_part milli_savunma_bakanligi was found 2 times.
The url_part orman_genel_mudurlugu was found 2 times.
The url_part orman_ve_su_isleri_bakanligi was found 2 times.
The url_part ovacik_belediyesi was found 2 times.
The url_part ovacik_kaymakamligi was found 2 times.
The url_part ozel_ogretim_kurumlari_genel_mudurlugu was found 2 times.
The url_part pazar_belediyesi was found 2 times.
The url_part pazar_kaymakamligi was found 2 times.
The url_part saglik_bakanligi was found 2 times.
The url_part sahil_guvenlik_komutanligi was found 2 times.
The url_part saray_belediyesi was found 2 times.
The url_part saray_kaymakamligi was found 2 times.
The url_part ulastirma_denizcilik_ve_haberlesme_bakanligi was found 2 times.
The url_part yenisehir_kaymakamligi was found 2 times.

@garethrees garethrees self-assigned this Apr 19, 2016
@garethrees
Copy link
Member Author

Lots of duplicate url_part warnings because of mysociety/alaveteli#2684

@garethrees
Copy link
Member Author

Duplicate url_part warnings matched with the names that generate them:

The url_part aile_ve_sosyal_politikalar_bakanligi was found 2 times.
  Aile Ve Sosyal Politikalar Bakanliği
  Aile Ve Sosyal Politikalar Bakanlığı
The url_part aksu_kaymakamligi was found 2 times.
  Aksu Kaymakamlığı
  Aksu Kaymakamlığı
The url_part avrupa_birligi_bakanligi was found 2 times.
  Avrupa Birliği Bakanliği
  Avrupa Birliği Bakanlığı
The url_part basbakanlik was found 2 times.
  Başbakanlik
  Başbakanlık
The url_part basin_ilan_kurumu was found 2 times.
  Basin Ilan Kurumu
  Basın Ilan Kurumu
The url_part bayat_belediyesi was found 2 times.
  Bayat Belediyesi
  Bayat Belediyesi
The url_part bayat_kaymakamligi was found 2 times.
  Bayat Kaymakamlığı
  Bayat Kaymakamlığı
The url_part beypazari_belediyesi was found 2 times.
  Beypazarı Belediyesi
  Beypazarı Belediyesi, # L453 ACTUAL DUPLICATE, USE L452
The url_part bilim_sanayi_ve_teknoloji_bakanligi was found 2 times.
  Bilim, Sanayi Ve Teknoloji Bakanliği
  Bilim, Sanayi Ve Teknoloji Bakanlığı
The url_part calisma_ve_sosyal_guvenlik_bakanligi was found 2 times.
  Çalişma Ve Sosyal Güvenlik Bakanliği
  Çalışma Ve Sosyal Güvenlik Bakanlığı
The url_part cevre_ve_sehircilik_bakanligi was found 2 times.
  Çevre Ve Şehircilik Bakanliği
  Çevre Ve Şehircilik Bakanlığı
The url_part cumhurbaskanligi was found 2 times.
  Cumhurbaşkanliği
  Cumhurbaşkanlığı
The url_part devlet_personel_baskanligi was found 2 times.
  Devlet Personel Başkanliği
  Devlet Personel Başkanlığı
The url_part disisleri_bakanligi was found 2 times.
  Dişişleri Bakanliği
  Dışişleri Bakanlığı
The url_part edremit_belediyesi was found 2 times.
  Edremit Belediyesi
  Edremit Belediyesi
The url_part edremit_kaymakamligi was found 2 times.
  Efeler Kaymakamlığı
  Eflani Kaymakamlığı
The url_part ekonomi_bakanligi was found 2 times.
  Ekonomi Bakanliği
  Ekonomi Bakanlığı
The url_part enerji_ve_tabii_kaynaklar_bakanligi was found 2 times.
  Enerji Ve Tabii Kaynaklar Bakanliği
  Enerji Ve Tabii Kaynaklar Bakanlığı
The url_part erdek_belediyesi was found 2 times.
  Erdek Belediyesi
  Erdek Belediyesi
The url_part eregli_belediyesi was found 2 times.
  Ereğli Belediyesi
  Ereğli Belediyesi
The url_part eregli_kaymakamligi was found 2 times.
  Ereğli Kaymakamlığı
  Ereğli Kaymakamlığı
The url_part genclik_ve_spor_bakanligi was found 2 times.
  Gençlik Ve Spor Bakanliği
  Gençlik Ve Spor Bakanlığı
The url_part gida_tarim_ve_hayvancilik_bakanligi was found 2 times.
  Gida, Tarim Ve Hayvancilik Bakanliği
  Gıda, Tarım Ve Hayvancılık Bakanlığı
The url_part gumruk_ve_ticaret_bakanligi was found 2 times.
  Gümrük Ve Ticaret Bakanliği
  Gümrük Ve Ticaret Bakanlığı
The url_part hava_kuvvetleri_komutanligi was found 2 times.
  Hava Kuvvetleri Komutanliği
  Hava Kuvvetleri Komutanlığı
The url_part hazine_mustesarligi was found 2 times.
  Hazine Müsteşarliği
  Hazine Müsteşarlığı
The url_part icisleri_bakanligi was found 2 times.
  Içişleri Bakanliği
  Içişleri Bakanlığı
The url_part iett was found 2 times.
  I.E.T.T.
  Iett
The url_part ilgin_belediyesi was found 2 times.
  Ilgın Belediyesi
  Ilgın Belediyesi
The url_part istanbul_vergi_dairesi_baskanligi was found 2 times.
  Istanbul Vergi Dairesi Başkanliği
  Istanbul Vergi Dairesi Başkanlığı
The url_part jandarma_genel_komutanligi was found 2 times.
  Jandarma Genel Komutanliği
  Jandarma Genel Komutanlığı
The url_part kacakcilik_istihbarat_ve_bilgi_toplama_dairesi_baskanligi was found 2 times.
  Kaçakçilik Istihbarat Ve Bilgi Toplama Dairesi Başkanliği
  Kaçakçılık Istihbarat Ve Bilgi Toplama Dairesi Başkanlığı
The url_part kale_belediyesi was found 2 times.
  Kale Belediyesi
  Kale Belediyesi
The url_part kale_kaymakamligi was found 2 times.
  Kale Kaymakamlığı
  Kale Kaymakamlığı
The url_part kalkinma_bakanligi was found 2 times.
  Kalkinma Bakanliği
  Kalkınma Bakanlığı
The url_part kara_kuvvetleri_komutanligi was found 2 times.
  Kara Kuvvetleri Komutanliği
  Kara Kuvvetleri Komutanlığı
The url_part koprubasi_belediyesi was found 2 times.
  Köprübaşı Belediyesi
  Köprübaşı Belediyesi
The url_part koprubasi_kaymakamligi was found 2 times.
  Köprübaşı Kaymakamlığı
  Köprübaşı Kaymakamlığı
The url_part kultur_ve_turizm_bakanligi was found 2 times.
  Kültür Ve Turizm Bakanliği
  Kültür Ve Turizm Bakanlığı
The url_part maliye_bakanligi was found 2 times.
  Maliye Bakanliği
  Maliye Bakanlığı
The url_part milli_egitim_bakanligi was found 2 times.
  Milli Eğitim Bakanliği
  Milli Eğitim Bakanlığı
The url_part milli_savunma_bakanligi was found 2 times.
  Milli Savunma Bakanliği
  Milli Savunma Bakanlığı
The url_part orman_genel_mudurlugu was found 2 times.
  Orman Genel Müdürlüğü
  Orman Genel Müdürlüğü # These look like actual duplicates
The url_part orman_ve_su_isleri_bakanligi was found 2 times.
  Orman Ve Su Işleri Bakanliği
  Orman Ve Su Işleri Bakanlığı
The url_part ovacik_belediyesi was found 2 times.
  Ovacık Belediyesi
  Ovacık Belediyesi
The url_part ovacik_kaymakamligi was found 2 times.
  Ovacık Kaymakamlığı
  Ovacık Kaymakamlığı
The url_part ozel_ogretim_kurumlari_genel_mudurlugu was found 2 times.
  Özel Öğretim Kurumlari Genel Müdürlüğü
  Özel Öğretim Kurumları Genel Müdürlüğü
The url_part pazar_belediyesi was found 2 times.
  Pazar Belediyesi
  Pazar Belediyesi
The url_part pazar_kaymakamligi was found 2 times.
  Pazar Kaymakamlığı
  Pazar Kaymakamlığı
The url_part saglik_bakanligi was found 2 times.
  Sağlik Bakanliği
  Sağlık Bakanlığı
The url_part sahil_guvenlik_komutanligi was found 2 times.
  Sahil Güvenlik Komutanliği
  Sahil Güvenlik Komutanlığı
The url_part saray_belediyesi was found 2 times.
  Saray Belediyesi
  Saray Belediyesi
The url_part saray_kaymakamligi was found 2 times.
  Saray Kaymakamlığı
  Saray Kaymakamlığı
The url_part ulastirma_denizcilik_ve_haberlesme_bakanligi was found 2 times.
  Ulaştirma, Denizcilik Ve Haberleşme Bakanliği
  Ulaştırma, Denizcilik Ve Haberleşme Bakanlığı
The url_part yenisehir_kaymakamligi was found 2 times.
  Yenişehir Kaymakamlığı
  Yenişehir Kaymakamlığı

@garethrees
Copy link
Member Author

problem-names.txt

The name Aksu Kaymakamlığı was found 2 times.
The name Bayat Belediyesi was found 2 times.
The name Bayat Kaymakamlığı was found 2 times.
The name Edremit Belediyesi was found 2 times.
The name Edremit Kaymakamlığı was found 2 times.
The name Erdek Belediyesi was found 2 times.
The name Ereğli Belediyesi was found 2 times.
The name Ereğli Kaymakamlığı was found 2 times.
The name Ilgın Belediyesi was found 2 times.
The name Kale Belediyesi was found 2 times.
The name Kale Kaymakamlığı was found 2 times.
The name Köprübaşı Belediyesi was found 2 times.
The name Köprübaşı Kaymakamlığı was found 2 times.
The name Orman Genel Müdürlüğü was found 2 times.
The name Ovacık Belediyesi was found 2 times.
The name Ovacık Kaymakamlığı was found 2 times.
The name Pazar Belediyesi was found 2 times.
The name Pazar Kaymakamlığı was found 2 times.
The name Saray Belediyesi was found 2 times.
The name Saray Kaymakamlığı was found 2 times.
The name Yenişehir Kaymakamlığı was found 2 times.

problem-url-parts.txt

The url_part aile_ve_sosyal_politikalar_bakanligi was found 2 times.
  Aile Ve Sosyal Politikalar Bakanliği
  Aile Ve Sosyal Politikalar Bakanlığı
The url_part aksu_kaymakamligi was found 2 times.
  Aksu Kaymakamlığı
  Aksu Kaymakamlığı
The url_part avrupa_birligi_bakanligi was found 2 times.
  Avrupa Birliği Bakanliği
  Avrupa Birliği Bakanlığı
The url_part basbakanlik was found 2 times.
  Başbakanlik
  Başbakanlık
The url_part basin_ilan_kurumu was found 2 times.
  Basin Ilan Kurumu
  Basın Ilan Kurumu
The url_part bayat_belediyesi was found 2 times.
  Bayat Belediyesi
  Bayat Belediyesi
The url_part bayat_kaymakamligi was found 2 times.
  Bayat Kaymakamlığı
  Bayat Kaymakamlığı
The url_part beypazari_belediyesi was found 2 times.
  Beypazarı Belediyesi
  Beypazarı Belediyesi,
The url_part bilim_sanayi_ve_teknoloji_bakanligi was found 2 times.
  Bilim, Sanayi Ve Teknoloji Bakanliği
  Bilim, Sanayi Ve Teknoloji Bakanlığı
The url_part calisma_ve_sosyal_guvenlik_bakanligi was found 2 times.
  Çalişma Ve Sosyal Güvenlik Bakanliği
  Çalışma Ve Sosyal Güvenlik Bakanlığı
The url_part cevre_ve_sehircilik_bakanligi was found 2 times.
  Çevre Ve Şehircilik Bakanliği
  Çevre Ve Şehircilik Bakanlığı
The url_part cumhurbaskanligi was found 2 times.
  Cumhurbaşkanliği
  Cumhurbaşkanlığı
The url_part devlet_personel_baskanligi was found 2 times.
  Devlet Personel Başkanliği
  Devlet Personel Başkanlığı
The url_part disisleri_bakanligi was found 2 times.
  Dişişleri Bakanliği
  Dışişleri Bakanlığı
The url_part edremit_belediyesi was found 2 times.
  Edremit Belediyesi
  Edremit Belediyesi
The url_part edremit_kaymakamligi was found 2 times.
  Efeler Kaymakamlığı
  Eflani Kaymakamlığı
The url_part ekonomi_bakanligi was found 2 times.
  Ekonomi Bakanliği
  Ekonomi Bakanlığı
The url_part enerji_ve_tabii_kaynaklar_bakanligi was found 2 times.
  Enerji Ve Tabii Kaynaklar Bakanliği
  Enerji Ve Tabii Kaynaklar Bakanlığı
The url_part erdek_belediyesi was found 2 times.
  Erdek Belediyesi
  Erdek Belediyesi
The url_part eregli_belediyesi was found 2 times.
  Ereğli Belediyesi
  Ereğli Belediyesi
The url_part eregli_kaymakamligi was found 2 times.
  Ereğli Kaymakamlığı
  Ereğli Kaymakamlığı
The url_part genclik_ve_spor_bakanligi was found 2 times.
  Gençlik Ve Spor Bakanliği
  Gençlik Ve Spor Bakanlığı
The url_part gida_tarim_ve_hayvancilik_bakanligi was found 2 times.
  Gida, Tarim Ve Hayvancilik Bakanliği
  Gıda, Tarım Ve Hayvancılık Bakanlığı
The url_part gumruk_ve_ticaret_bakanligi was found 2 times.
  Gümrük Ve Ticaret Bakanliği
  Gümrük Ve Ticaret Bakanlığı
The url_part hava_kuvvetleri_komutanligi was found 2 times.
  Hava Kuvvetleri Komutanliği
  Hava Kuvvetleri Komutanlığı
The url_part hazine_mustesarligi was found 2 times.
  Hazine Müsteşarliği
  Hazine Müsteşarlığı
The url_part icisleri_bakanligi was found 2 times.
  Içişleri Bakanliği
  Içişleri Bakanlığı
The url_part iett was found 2 times.
  I.E.T.T.
  Iett
The url_part ilgin_belediyesi was found 2 times.
  Ilgın Belediyesi
  Ilgın Belediyesi
The url_part istanbul_vergi_dairesi_baskanligi was found 2 times.
  Istanbul Vergi Dairesi Başkanliği
  Istanbul Vergi Dairesi Başkanlığı
The url_part jandarma_genel_komutanligi was found 2 times.
  Jandarma Genel Komutanliği
  Jandarma Genel Komutanlığı
The url_part kacakcilik_istihbarat_ve_bilgi_toplama_dairesi_baskanligi was found 2 times.
  Kaçakçilik Istihbarat Ve Bilgi Toplama Dairesi Başkanliği
  Kaçakçılık Istihbarat Ve Bilgi Toplama Dairesi Başkanlığı
The url_part kale_belediyesi was found 2 times.
  Kale Belediyesi
  Kale Belediyesi
The url_part kale_kaymakamligi was found 2 times.
  Kale Kaymakamlığı
  Kale Kaymakamlığı
The url_part kalkinma_bakanligi was found 2 times.
  Kalkinma Bakanliği
  Kalkınma Bakanlığı
The url_part kara_kuvvetleri_komutanligi was found 2 times.
  Kara Kuvvetleri Komutanliği
  Kara Kuvvetleri Komutanlığı
The url_part koprubasi_belediyesi was found 2 times.
  Köprübaşı Belediyesi
  Köprübaşı Belediyesi
The url_part koprubasi_kaymakamligi was found 2 times.
  Köprübaşı Kaymakamlığı
  Köprübaşı Kaymakamlığı
The url_part kultur_ve_turizm_bakanligi was found 2 times.
  Kültür Ve Turizm Bakanliği
  Kültür Ve Turizm Bakanlığı
The url_part maliye_bakanligi was found 2 times.
  Maliye Bakanliği
  Maliye Bakanlığı
The url_part milli_egitim_bakanligi was found 2 times.
  Milli Eğitim Bakanliği
  Milli Eğitim Bakanlığı
The url_part milli_savunma_bakanligi was found 2 times.
  Milli Savunma Bakanliği
  Milli Savunma Bakanlığı
The url_part orman_genel_mudurlugu was found 2 times.
  Orman Genel Müdürlüğü
  Orman Genel Müdürlüğü
The url_part orman_ve_su_isleri_bakanligi was found 2 times.
  Orman Ve Su Işleri Bakanliği
  Orman Ve Su Işleri Bakanlığı
The url_part ovacik_belediyesi was found 2 times.
  Ovacık Belediyesi
  Ovacık Belediyesi
The url_part ovacik_kaymakamligi was found 2 times.
  Ovacık Kaymakamlığı
  Ovacık Kaymakamlığı
The url_part ozel_ogretim_kurumlari_genel_mudurlugu was found 2 times.
  Özel Öğretim Kurumlari Genel Müdürlüğü
  Özel Öğretim Kurumları Genel Müdürlüğü
The url_part pazar_belediyesi was found 2 times.
  Pazar Belediyesi
  Pazar Belediyesi
The url_part pazar_kaymakamligi was found 2 times.
  Pazar Kaymakamlığı
  Pazar Kaymakamlığı
The url_part saglik_bakanligi was found 2 times.
  Sağlik Bakanliği
  Sağlık Bakanlığı
The url_part sahil_guvenlik_komutanligi was found 2 times.
  Sahil Güvenlik Komutanliği
  Sahil Güvenlik Komutanlığı
The url_part saray_belediyesi was found 2 times.
  Saray Belediyesi
  Saray Belediyesi
The url_part saray_kaymakamligi was found 2 times.
  Saray Kaymakamlığı
  Saray Kaymakamlığı
The url_part ulastirma_denizcilik_ve_haberlesme_bakanligi was found 2 times.
  Ulaştirma, Denizcilik Ve Haberleşme Bakanliği
  Ulaştırma, Denizcilik Ve Haberleşme Bakanlığı
The url_part yenisehir_kaymakamligi was found 2 times.
  Yenişehir Kaymakamlığı
  Yenişehir Kaymakamlığı

Munge them together and remove duplicates:

$ { ggrep -Po '(?<=The name ).*(?= was .*)' problem-names.txt ; grep -E '^\s\s' problem-url-parts.txt | sed -e 's/^[[:space:]]*//' ; } | tr - - | sort | uniq
Aile Ve Sosyal Politikalar Bakanliği
Aile Ve Sosyal Politikalar Bakanlığı
Aksu Kaymakamlığı
Avrupa Birliği Bakanliği
Avrupa Birliği Bakanlığı
Basin Ilan Kurumu
Basın Ilan Kurumu
Bayat Belediyesi
Bayat Kaymakamlığı
Başbakanlik
Başbakanlık
Beypazarı Belediyesi
Beypazarı Belediyesi,
Bilim, Sanayi Ve Teknoloji Bakanliği
Bilim, Sanayi Ve Teknoloji Bakanlığı
Cumhurbaşkanliği
Cumhurbaşkanlığı
Devlet Personel Başkanliği
Devlet Personel Başkanlığı
Dişişleri Bakanliği
Dışişleri Bakanlığı
Edremit Belediyesi
Edremit Kaymakamlığı
Efeler Kaymakamlığı
Eflani Kaymakamlığı
Ekonomi Bakanliği
Ekonomi Bakanlığı
Enerji Ve Tabii Kaynaklar Bakanliği
Enerji Ve Tabii Kaynaklar Bakanlığı
Erdek Belediyesi
Ereğli Belediyesi
Ereğli Kaymakamlığı
Gençlik Ve Spor Bakanliği
Gençlik Ve Spor Bakanlığı
Gida, Tarim Ve Hayvancilik Bakanliği
Gümrük Ve Ticaret Bakanliği
Gümrük Ve Ticaret Bakanlığı
Gıda, Tarım Ve Hayvancılık Bakanlığı
Hava Kuvvetleri Komutanliği
Hava Kuvvetleri Komutanlığı
Hazine Müsteşarliği
Hazine Müsteşarlığı
I.E.T.T.
Iett
Ilgın Belediyesi
Istanbul Vergi Dairesi Başkanliği
Istanbul Vergi Dairesi Başkanlığı
Içişleri Bakanliği
Içişleri Bakanlığı
Jandarma Genel Komutanliği
Jandarma Genel Komutanlığı
Kale Belediyesi
Kale Kaymakamlığı
Kalkinma Bakanliği
Kalkınma Bakanlığı
Kara Kuvvetleri Komutanliği
Kara Kuvvetleri Komutanlığı
Kaçakçilik Istihbarat Ve Bilgi Toplama Dairesi Başkanliği
Kaçakçılık Istihbarat Ve Bilgi Toplama Dairesi Başkanlığı
Köprübaşı Belediyesi
Köprübaşı Kaymakamlığı
Kültür Ve Turizm Bakanliği
Kültür Ve Turizm Bakanlığı
Maliye Bakanliği
Maliye Bakanlığı
Milli Eğitim Bakanliği
Milli Eğitim Bakanlığı
Milli Savunma Bakanliği
Milli Savunma Bakanlığı
Orman Genel Müdürlüğü
Orman Ve Su Işleri Bakanliği
Orman Ve Su Işleri Bakanlığı
Ovacık Belediyesi
Ovacık Kaymakamlığı
Pazar Belediyesi
Pazar Kaymakamlığı
Sahil Güvenlik Komutanliği
Sahil Güvenlik Komutanlığı
Saray Belediyesi
Saray Kaymakamlığı
Sağlik Bakanliği
Sağlık Bakanlığı
Ulaştirma, Denizcilik Ve Haberleşme Bakanliği
Ulaştırma, Denizcilik Ve Haberleşme Bakanlığı
Yenişehir Kaymakamlığı
Çalişma Ve Sosyal Güvenlik Bakanliği
Çalışma Ve Sosyal Güvenlik Bakanlığı
Çevre Ve Şehircilik Bakanliği
Çevre Ve Şehircilik Bakanlığı
Özel Öğretim Kurumlari Genel Müdürlüğü
Özel Öğretim Kurumları Genel Müdürlüğü

@garethrees
Copy link
Member Author

garethrees commented Apr 20, 2016

Remove the duplicate authorities and cleanup:

$ { ggrep -Po '(?<=The name ).*(?= was .*)' problem-names.txt ; grep -E '^\s\s' problem-url-parts.txt | sed -e 's/^[[:space:]]*//' ; } | tr - - | sort | uniq > turkey-ignore-list.txt
$ ./turkey_authorities.rb ~/Downloads/turkey-authorities.csv ~/Downloads/turkey-ignore-list.txt > ~/Downloads/turkey-authorities-cleaned-no-dupes.csv

@garethrees
Copy link
Member Author

Invalid emails on dryrun:

  error: line 30: invalid email 'https://webportal.adiyaman.bel.tr/' for authority 'Adıyaman Belediyesi'
  error: line 287: invalid email 'https://webportal.aydin.bel.tr/' for authority 'Aydın Büyükşehir Belediyesi'
  error: line 307: invalid email 'azdavaykaymakamlõ_õ@azdavaygov.tr' for authority 'Azdavay Kaymakamlığı'
  error: line 481: invalid email 'https://webportal.bolu.bel.tr/' for authority 'Bolu Belediyesi'
  error: line 533: invalid email 'https://webportal.burdur-bld.gov.tr/' for authority 'Burdur Belediyesi'
  error: line 819: invalid email 'dodurga.gov.tr' for authority 'Dodurga Kaymakamlığı'
  error: line 1115: invalid email 'gulnarbelediyesi@hotmail. Com' for authority 'Gülnar Belediyesi'
  error: line 1120: invalid email 'bilgi@gulyalõ.gov.tr' for authority 'Gülyalı Kaymakamlığı'
  error: line 1138: invalid email 'guneysinir.isay.gov.tr' for authority 'Güneysınır Kaymakamlığı'
  error: line 1337: invalid email 'https://online.clkbogazici.com.tr/' for authority 'İstanbul - Avrupa Elektrik Dağıtım ( İl Müdürlüğü )'
  error: line 1512: invalid email 'https://webserver.kastamonu.bel.tr/' for authority 'Kastamonu Belediyesi'
  error: line 1530: invalid email 'ozelkalemkayseri.bel.tr' for authority 'Kayseri Büyükşehir Belediyesi'
  error: line 1645: invalid email 'info@konyaalti@bel.tr' for authority 'Konyaaltı Belediyesi'
  error: line 1921: invalid email 'bilgi@odemis@bel.tr' for authority 'Ödemiş Belediyesi'
  error: line 2067: invalid email 'rtuk@rtuk.gov.tr; bilgidinme@rtuk.gov.tr' for authority 'Rtük Radyo Ve Televizyon Üst Kurulu'
  error: line 2195: invalid email 'kaymakamlõk@senpazar.gov.tr' for authority 'Şenpazar Kaymakamlığı'
  error: line 2305: invalid email 'Êsuloglu@icisleri.gov.tr' for authority 'Süloğlu Kaymakamlığı'
  error: line 2473: invalid email 'tŸrkeliky@gmail.com' for authority 'Türkeli Kaymakamlığı'
  error: line 2475: invalid email 'tabb@afad.gov.tr.' for authority 'Türkiye Afet Bilgi Bankası'
  error: line 2503: invalid email 'http://www.tarimkredi.org.tr/' for authority 'Türkiye Tarim Kredi Kooperatifleri Birliği'
  error: line 2517: invalid email '04tutak@i�i_leri.gov.tr' for authority 'Tutak Kaymakamlığı'
  error: line 2565: invalid email 'bilgi@ŸzŸmlŸ.bel.tr' for authority 'Üzümlü Belediyesi'

@garethrees
Copy link
Member Author

garethrees commented Apr 20, 2016

Munge the invalid emails in:

$ { ggrep -Po '(?<=The name ).*(?= was .*)' problem-names.txt ; grep -E '^\s\s' problem-url-parts.txt | sed -e 's/^[[:space:]]*//' ; ggrep -oP "(?<=authority ').*(?=')" turkey-invalid-emails.txt ; } | tr - - | sort | uniq > turkey-ignore-list.txt 
$ ./turkey_authorities.rb ~/Downloads/turkey-authorities.csv ~/Downloads/turkey-ignore-list.txt > ~/Downloads/turkey-authorities-cleaned-no-dupes-or-invalid.csv

@garethrees
Copy link
Member Author

Most authorities now uploaded; Waiting on clarifications to remaining handful.

@garethrees
Copy link
Member Author

Done, other than http://bilmehakki.org/body/list/unicode_email, which we'll revisit in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant