## Parsing the News Headlines ## 

## Objective ## 

Find any city and/or country names mentioned in each of the news headlines.

## Workflow #1 ##
Load in the headline data and examine it for any data quality issues.
- Use any library/data structure to read in the headlines.
- Read through some of the headlines and identify potential problems.

In [2]:
headline_file = open('data/headlines.txt', 'r')

headlines = [line.strip()
             for line in headline_file.readlines()]

num_headlines = len(headlines)

print(f"{num_headlines} headlines have been loaded")

650 headlines have been loaded


In [3]:
headlines

['Zika Outbreak Hits Miami',
 'Could Zika Reach New York City?',
 'First Case of Zika in Miami Beach',
 'Mystery Virus Spreads in Recife, Brazil',
 'Dallas man comes down with case of Zika',
 'Trinidad confirms first Zika case',
 'Zika Concerns are Spreading in Houston',
 'Geneve Scientists Battle to Find Cure',
 'The CDC in Atlanta is Growing Worried',
 'Zika Infested Monkeys in Sao Paulo',
 'Brownsville teen contracts Zika virus',
 'Mosquito control efforts in St. Louis take new tactics with Zika threat',
 'San Juan reports 1st U.S. Zika-related death amid outbreak',
 'Flu outbreak in Galveston, Texas',
 'Zika alert â€“ Manila now threatened',
 'Zika afflicts 7 in Iloilo City',
 'New Los Angeles Hairstyle goes Viral',
 'Louisiana Zika cases up to 26',
 'Orlando volunteers aid Zika research',
 'Zika infects pregnant woman in Cebu',
 "Chicago's First Zika Case Confirmed",
 'Tampa Bay Area Zika Case Count Climbs',
 'Bad Water Leads to Sickness in Flint, Michigan',
 'Baltimore plans for 

In [4]:
headlines.sort()

headlines

['18 new Zika Cases in Bogota',
 '19 new Zika Cases in Sengkang',
 'Alameda Residents Recieve Rabies vaccine',
 'Albany Residents Recieve Respiratory Syncytial Virus vaccine',
 'Antipolo under threat from Zika Virus',
 'Arhus is infested with Bronchitis',
 'Arvada is infested with Syphilis',
 'Authorities a Miami',
 'Authorities are Worried about the Spread of Bronchitis in Silver Spring',
 'Authorities are Worried about the Spread of Chickenpox in Hemet',
 'Authorities are Worried about the Spread of Chickenpox in Richmond',
 'Authorities are Worried about the Spread of Dengue in Kingston',
 'Authorities are Worried about the Spread of Gonorrhea in Taoyuan City',
 'Authorities are Worried about the Spread of Hepatitis B in Yiwu',
 'Authorities are Worried about the Spread of Hepatitis D in Akron',
 'Authorities are Worried about the Spread of Hepatitis D in Ganja',
 'Authorities are Worried about the Spread of Hepatitis D in North Bay',
 'Authorities are Worried about the Spread of In

Comments on the cities appearing in each headline:

- Has to rely on external data (with accent marks removed) to identify city names from headlines
- There are city names with multiple words
- City name requires case-insensitve matching
- Not all headlines have country name

## Workflow #2 ##
Using regular expressions and the cities and countries within the geonamescache library, match any cities/countries within each headline. 

- Make sure to normalize headlines and city/country names by removing accent marks. This can be done with the unidecode library.

In [6]:
from unidecode import unidecode
import re

def name_to_regex(name):
    decoded_name = unidecode(name)
    if name != decoded_name:
        regex = fr'\b({name}|{decoded_name})\b'
    else:
        regex = fr'\b{name}\b'
    return re.compile(regex, flags=re.IGNORECASE)

In [7]:
unidecode('Shibirghān')

'Shibirghan'

External data - GeonamesCache: for city names and country names

In [9]:
from geonamescache import GeonamesCache
gc = GeonamesCache()

countries = [country['name'] for country in gc.get_countries().values()]
country_to_name = {name_to_regex(name): name for name in countries}

cities = [city['name'] for city in gc.get_cities().values()]
city_to_name = {name_to_regex(name): name for name in cities}

In [10]:
countries

['Andorra',
 'United Arab Emirates',
 'Afghanistan',
 'Antigua and Barbuda',
 'Anguilla',
 'Albania',
 'Armenia',
 'Angola',
 'Antarctica',
 'Argentina',
 'American Samoa',
 'Austria',
 'Australia',
 'Aruba',
 'Aland Islands',
 'Azerbaijan',
 'Bosnia and Herzegovina',
 'Barbados',
 'Bangladesh',
 'Belgium',
 'Burkina Faso',
 'Bulgaria',
 'Bahrain',
 'Burundi',
 'Benin',
 'Saint Barthelemy',
 'Bermuda',
 'Brunei',
 'Bolivia',
 'Bonaire, Saint Eustatius and Saba ',
 'Brazil',
 'Bahamas',
 'Bhutan',
 'Bouvet Island',
 'Botswana',
 'Belarus',
 'Belize',
 'Canada',
 'Cocos Islands',
 'Democratic Republic of the Congo',
 'Central African Republic',
 'Republic of the Congo',
 'Switzerland',
 'Ivory Coast',
 'Cook Islands',
 'Chile',
 'Cameroon',
 'China',
 'Colombia',
 'Costa Rica',
 'Cuba',
 'Cabo Verde',
 'Curacao',
 'Christmas Island',
 'Cyprus',
 'Czechia',
 'Germany',
 'Djibouti',
 'Denmark',
 'Dominica',
 'Dominican Republic',
 'Algeria',
 'Ecuador',
 'Estonia',
 'Egypt',
 'Western Saha

In [11]:
country_to_name

{re.compile(r'\bAndorra\b', re.IGNORECASE|re.UNICODE): 'Andorra',
 re.compile(r'\bUnited Arab Emirates\b',
 re.IGNORECASE|re.UNICODE): 'United Arab Emirates',
 re.compile(r'\bAfghanistan\b', re.IGNORECASE|re.UNICODE): 'Afghanistan',
 re.compile(r'\bAntigua and Barbuda\b',
 re.IGNORECASE|re.UNICODE): 'Antigua and Barbuda',
 re.compile(r'\bAnguilla\b', re.IGNORECASE|re.UNICODE): 'Anguilla',
 re.compile(r'\bAlbania\b', re.IGNORECASE|re.UNICODE): 'Albania',
 re.compile(r'\bArmenia\b', re.IGNORECASE|re.UNICODE): 'Armenia',
 re.compile(r'\bAngola\b', re.IGNORECASE|re.UNICODE): 'Angola',
 re.compile(r'\bAntarctica\b', re.IGNORECASE|re.UNICODE): 'Antarctica',
 re.compile(r'\bArgentina\b', re.IGNORECASE|re.UNICODE): 'Argentina',
 re.compile(r'\bAmerican Samoa\b', re.IGNORECASE|re.UNICODE): 'American Samoa',
 re.compile(r'\bAustria\b', re.IGNORECASE|re.UNICODE): 'Austria',
 re.compile(r'\bAustralia\b', re.IGNORECASE|re.UNICODE): 'Australia',
 re.compile(r'\bAruba\b', re.IGNORECASE|re.UNICODE): '

In [12]:
cities

['Andorra la Vella',
 'Umm Al Quwain City',
 'Ras Al Khaimah City',
 'Zayed City',
 'Khawr Fakkān',
 'Dubai',
 'Dibba Al-Fujairah',
 'Dibba Al-Hisn',
 'Sharjah',
 'Ar Ruways',
 'Al Fujairah City',
 'Al Ain City',
 'Ajman City',
 'Adh Dhayd',
 'Abu Dhabi',
 'Khalifah A City',
 'Bani Yas City',
 'Musaffah',
 'Al Shamkhah City',
 'Reef Al Fujairah City',
 'Zaranj',
 'Taloqan',
 'Shīnḏanḏ',
 'Shibirghān',
 'Shahrak',
 'Sar-e Pul',
 'Sang-e Chārak',
 'Aībak',
 'Rustāq',
 'Qarqīn',
 'Qarāwul',
 'Pul-e Khumrī',
 'Paghmān',
 'Nahrīn',
 'Maymana',
 'Mehtar Lām',
 'Mazār-e Sharīf',
 'Lashkar Gāh',
 'Kushk',
 'Kunduz',
 'Khōst',
 'Khulm',
 'Khāsh',
 'Khanabad',
 'Karukh',
 'Kandahār',
 'Kabul',
 'Jalālābād',
 'Jabal os Saraj',
 'Herāt',
 'Ghormach',
 'Ghazni',
 'Gereshk',
 'Gardez',
 'Fayzabad',
 'Farah',
 'Kafir Qala',
 'Charikar',
 'Baraki Barak',
 'Bāmyān',
 'Balkh',
 'Baghlān',
 'Ārt Khwājah',
 'Āsmār',
 'Asadābād',
 'Andkhōy',
 'Bāzārak',
 'Markaz-e Woluswalī-ye Āchīn',
 'Saint John’s',
 'Th

In [13]:
city_to_name

{re.compile(r'\bAndorra la Vella\b',
 re.IGNORECASE|re.UNICODE): 'Andorra la Vella',
 re.compile(r'\bUmm Al Quwain City\b',
 re.IGNORECASE|re.UNICODE): 'Umm Al Quwain City',
 re.compile(r'\bRas Al Khaimah City\b',
 re.IGNORECASE|re.UNICODE): 'Ras Al Khaimah City',
 re.compile(r'\bZayed City\b', re.IGNORECASE|re.UNICODE): 'Zayed City',
 re.compile(r'\b(Khawr Fakkān|Khawr Fakkan)\b',
 re.IGNORECASE|re.UNICODE): 'Khawr Fakkān',
 re.compile(r'\bDubai\b', re.IGNORECASE|re.UNICODE): 'Dubai',
 re.compile(r'\bDibba Al-Fujairah\b',
 re.IGNORECASE|re.UNICODE): 'Dibba Al-Fujairah',
 re.compile(r'\bDibba Al-Hisn\b', re.IGNORECASE|re.UNICODE): 'Dibba Al-Hisn',
 re.compile(r'\bSharjah\b', re.IGNORECASE|re.UNICODE): 'Sharjah',
 re.compile(r'\bAr Ruways\b', re.IGNORECASE|re.UNICODE): 'Ar Ruways',
 re.compile(r'\bAl Fujairah City\b',
 re.IGNORECASE|re.UNICODE): 'Al Fujairah City',
 re.compile(r'\bAl Ain City\b', re.IGNORECASE|re.UNICODE): 'Al Ain City',
 re.compile(r'\bAjman City\b', re.IGNORECASE|re.U

## Workflow #3, 4 ##

Put the extracted data into a pandas DataFrame with three columns: headline, city, country.

Watch out for multiple cities in a headline and matches on short words! We want the match to be on the entire city—for example San Marino—and not a partial match—San.

Make sure there were no issues with the extraction by sampling some of the headlines and examining the city and country names. One method for finding problems is to look for the most common names and see if there are any issues. 

In [14]:
def get_name_in_text(text, dictionary):
    for regex, name in sorted(dictionary.items(), key=lambda x: x[1]):
        if regex.search(text):
            return name
    return None

In [15]:
import pandas as pd

matched_countries = [get_name_in_text(headline, country_to_name) for headline in headlines] 
matched_cities = [get_name_in_text(headline, city_to_name) for headline in headlines]  # single city per headline

headline_city_country_data = {'Headline': headlines, 'City': matched_cities, 'Country': matched_countries}
df_headline_city_country = pd.DataFrame(headline_city_country_data)
df_headline_city_country

Unnamed: 0,Headline,City,Country
0,18 new Zika Cases in Bogota,Bogotá,
1,19 new Zika Cases in Sengkang,Sengkang,
2,Alameda Residents Recieve Rabies vaccine,Alameda,
3,Albany Residents Recieve Respiratory Syncytial...,Albany,
4,Antipolo under threat from Zika Virus,Antipolo,
5,Arhus is infested with Bronchitis,Århus,
6,Arvada is infested with Syphilis,Arvada,
7,Authorities a Miami,Miami,
8,Authorities are Worried about the Spread of Br...,Of,
9,Authorities are Worried about the Spread of Ch...,Hemet,


In [16]:
summary = df_headline_city_country[['City', 'Country']].describe()
print(summary)

       City Country
count   619      15
unique  510      10
top      Of  Brazil
freq     45       3


"Of" is the most common city name, but it is incorrect.

In [18]:
of_cities = df_headline_city_country[df_headline_city_country.City == 'Of'][['City', 'Headline']]
print(of_cities.to_string(index=False))

City                                           Headline
  Of  Authorities are Worried about the Spread of Br...
  Of  Authorities are Worried about the Spread of Ch...
  Of  Authorities are Worried about the Spread of Go...
  Of  Authorities are Worried about the Spread of He...
  Of  Authorities are Worried about the Spread of In...
  Of  Authorities are Worried about the Spread of Ma...
  Of  Authorities are Worried about the Spread of Ro...
  Of  Authorities are Worried about the Spread of Sy...
  Of             Case of Measles Reported in Springdale
  Of              Case of Measles Reported in Vancouver
  Of            Case of Norovirus Reported in Stratford
  Of              Case of Swine Flu Reported in Tbilisi
  Of      Case of West Nile Virus Reported in Riverside
  Of                       Outbreak of Zika in Alvorada
  Of                         Outbreak of Zika in Destin
  Of                       Outbreak of Zika in Hutchins
  Of                   Outbreak of Zika in Palm 

In [19]:
def get_cities_in_headline(headline):
    cities_in_headline = set()
    for regex, name in city_to_name.items():
        match = regex.search(headline)
        if match:
            if headline[match.start()].isupper():
                cities_in_headline.add(name)
                
    return list(cities_in_headline)

In [21]:
df_headline_city_country['Cities'] = df_headline_city_country['Headline'].apply(get_cities_in_headline)  # return multiple cities (if any) per headline
df_headline_city_country['Num_cities'] = df_headline_city_country['Cities'].apply(len)

In [25]:
df_headline_city_country[df_headline_city_country.Num_cities > 1].sort_values('Num_cities', ascending=False)

Unnamed: 0,Headline,City,Country,Cities,Num_cities
627,Zika spreads to San Luis Potosi,Potosí,,"[San Luis, San Luis Potosí, San, Potosí]",4
551,Zika Reported in North Miami Beach,Miami,,"[Miami, North Miami, North Miami Beach, Miami ...",4
647,Zika worries in San Salvador,Salvador,,"[San, San Salvador, Salvador]",3
348,Pneumonia Exposure in San Jose,San,,"[San Jose, San, San José]",3
379,Rumors about Hepatitis D Spreading in San Juan...,San,,"[San Juan Capistrano, San, San Juan]",3
301,"New Zika Case in Kota Kinabalu, Malaysia",Kota,Malaysia,"[Kota, Kotā, Kota Kinabalu]",3
363,Rhinovirus Comes to San Jose,San,,"[San Jose, San, San José]",3
610,Zika only the latest mosquito-borne threat to ...,Borne,,"[Orleans, New Orleans, Orléans]",3
586,Zika arrives in West Palm Beach,Palm Beach,,"[West Palm Beach, Palm Beach]",2
546,Zika Outbreak in Wichita Falls,Wichita,,"[Wichita, Wichita Falls]",2


In [27]:
# Choose a single city with the longest name when multiple cities are matched in one headline 

def get_longest_city(cities):
    if cities:
        return max(cities, key=len)
    return None

df_headline_city_country['City'] = df_headline_city_country['Cities'].apply(get_longest_city)

In [29]:
df_headline_city_country[df_headline_city_country.Num_cities > 1].sort_values('Num_cities', ascending=False)

Unnamed: 0,Headline,City,Country,Cities,Num_cities
627,Zika spreads to San Luis Potosi,San Luis Potosí,,"[San Luis, San Luis Potosí, San, Potosí]",4
551,Zika Reported in North Miami Beach,North Miami Beach,,"[Miami, North Miami, North Miami Beach, Miami ...",4
647,Zika worries in San Salvador,San Salvador,,"[San, San Salvador, Salvador]",3
348,Pneumonia Exposure in San Jose,San Jose,,"[San Jose, San, San José]",3
379,Rumors about Hepatitis D Spreading in San Juan...,San Juan Capistrano,,"[San Juan Capistrano, San, San Juan]",3
301,"New Zika Case in Kota Kinabalu, Malaysia",Kota Kinabalu,Malaysia,"[Kota, Kotā, Kota Kinabalu]",3
363,Rhinovirus Comes to San Jose,San Jose,,"[San Jose, San, San José]",3
610,Zika only the latest mosquito-borne threat to ...,New Orleans,,"[Orleans, New Orleans, Orléans]",3
586,Zika arrives in West Palm Beach,West Palm Beach,,"[West Palm Beach, Palm Beach]",2
546,Zika Outbreak in Wichita Falls,Wichita Falls,,"[Wichita, Wichita Falls]",2


Confirm that no erroneous short city-name (4 characters or less) is getting assigned to one of our headlines.

In [30]:
short_cities = df_headline_city_country[df_headline_city_country.City.str.len() <= 4][['City', 'Headline']]
print(short_cities.to_string(index=False))

 City                                           Headline
 Yiwu  Authorities are Worried about the Spread of He...
 Rome  Authorities are Worried about the Spread of Ma...
 Kobe                     Chikungunya re-emerges in Kobe
 Bonn  Contaminated Meat Brings Trouble for Bonn Farmers
 Erie                        Erie County sets Zika traps
 Kent                       Kent is infested with Rabies
 Lima                Lima tries to address Zika Concerns
 Lyon                   Mad Cow Disease Detected in Lyon
 Molo                Molo Cholera Spread Causing Concern
 Waco                More Zika patients reported in Waco
 Nadi  More people in Nadi are infected with HIV ever...
 Pune                     Pune woman diagnosed with Zika
 Baud  Rumors about Tuberculosis Spreading in Baud ha...
 Suva  Suva authorities confirmed the spread of Rotav...
 Reno  The Spread of Gonorrhea in Reno has been Confi...
 Baku    The Spread of Herpes in Baku has been Confirmed
 Jaén                         Z

Headlines with matched country value

In [37]:
df_countries = df_headline_city_country[df_headline_city_country.Country.notnull()][['City', 'Country', 'Headline']]
print(df_countries.to_string(index=False))
print(f"{len(df_countries)} headlines with country matches.")

             City    Country                                           Headline
      Belize City     Belize                 Belize City under threat from Zika
           Recife     Brazil            Mystery Virus Spreads in Recife, Brazil
    Kota Kinabalu   Malaysia           New Zika Case in Kota Kinabalu, Malaysia
        Hong Kong  Hong Kong                    Norovirus Exposure in Hong Kong
      Panama City     Panama                    Outbreak of Zika in Panama City
           Panamá     Panama           Panama Cityâ€™s first Zika related death
   Guatemala City  Guatemala  Rumors about Meningitis spreading in Guatemala...
         Campinas     Brazil                   Student sick in Campinas, Brazil
          Bangkok   Thailand                     Thailand-Zika Virus in Bangkok
        Singapore  Singapore                  Zika cases in Singapore reach 393
 Ho Chi Minh City    Vietnam     Zika cases in Vietnam's Ho Chi Minh City surge
       Piracicaba     Brazil            

Look for the most common city and country names and see if there are any issues

In [35]:
df_headline_city_country[['City', 'Country']].describe()

Unnamed: 0,City,Country
count,611,15
unique,577,10
top,Madrid,Brazil
freq,4,3


Headlines with no city matched

In [36]:
df_unmatched = df_headline_city_country[df_headline_city_country.City.isnull()]
num_unmatched = len(df_unmatched)
print(f"{num_unmatched} headlines contain no city matches.")
df_unmatched

39 headlines contain no city matches.


Unnamed: 0,Headline,City,Country,Cities,Num_cities
45,Bronchitis Outbreak in Manhasset,,,[],0
72,Chikungunya has not Left Pismo Beach,,,[],0
99,Fort Belvoir tests new cure for Hepatitis C,,,[],0
111,Greenwich Establishes Zika Task Force,,,[],0
113,Gympie Patient in Critical Condition after Con...,,,[],0
145,Hillsborough uses innovative trap against Zika...,,,[],0
182,Longwood volunteers spreading Zika awareness,,,[],0
183,Louisiana Zika cases up to 26,,,[],0
207,Maka City Experiences Influenza Outbreak,,,[],0
208,Malaria Exposure in Sussex,,,[],0


Try to map country by city using external data - GeonamesCache

In [38]:
from geonamescache import GeonamesCache
gc = GeonamesCache()

city_list = [city['name'] for city in gc.get_cities().values()]
country_code_list = [city['countrycode'] for city in gc.get_cities().values()]
country_list = [gc.get_countries()[country_code].get('name') for country_code in country_code_list]

city_country_data = {'City': city_list, 'Country': country_list}
df_city_country = pd.DataFrame(city_country_data)

df_city_country

Unnamed: 0,City,Country
0,Andorra la Vella,Andorra
1,Umm Al Quwain City,United Arab Emirates
2,Ras Al Khaimah City,United Arab Emirates
3,Zayed City,United Arab Emirates
4,Khawr Fakkān,United Arab Emirates
5,Dubai,United Arab Emirates
6,Dibba Al-Fujairah,United Arab Emirates
7,Dibba Al-Hisn,United Arab Emirates
8,Sharjah,United Arab Emirates
9,Ar Ruways,United Arab Emirates


In [39]:
dict_city_country = df_city_country.set_index('City').T.to_dict('list')

  """Entry point for launching an IPython kernel.


In [41]:
headline_country_list = [''.join(dict_city_country.get(city_name, "")) for city_name in df_headline_city_country['City'].to_list()]
df_headline_city_country['Mapped Country'] = headline_country_list

df_headline_city_country

Unnamed: 0,Headline,City,Country,Cities,Num_cities,Mapped Country
0,18 new Zika Cases in Bogota,Bogotá,,[Bogotá],1,Colombia
1,19 new Zika Cases in Sengkang,Sengkang,,[Sengkang],1,Indonesia
2,Alameda Residents Recieve Rabies vaccine,Alameda,,[Alameda],1,United States
3,Albany Residents Recieve Respiratory Syncytial...,Albany,,[Albany],1,United States
4,Antipolo under threat from Zika Virus,Antipolo,,[Antipolo],1,Philippines
5,Arhus is infested with Bronchitis,Århus,,[Århus],1,Denmark
6,Arvada is infested with Syphilis,Arvada,,[Arvada],1,United States
7,Authorities a Miami,Miami,,[Miami],1,United States
8,Authorities are Worried about the Spread of Br...,Silver Spring,,"[Spring, Silver Spring]",2,United States
9,Authorities are Worried about the Spread of Ch...,Hemet,,[Hemet],1,United States


In [43]:
df_headline_city_country[df_headline_city_country.Country.notnull()][['City', 'Country', 'Mapped Country', 'Headline']]

Unnamed: 0,City,Country,Mapped Country,Headline
36,Belize City,Belize,Belize,Belize City under threat from Zika
293,Recife,Brazil,Brazil,"Mystery Virus Spreads in Recife, Brazil"
301,Kota Kinabalu,Malaysia,Malaysia,"New Zika Case in Kota Kinabalu, Malaysia"
316,Hong Kong,Hong Kong,Hong Kong,Norovirus Exposure in Hong Kong
333,Panama City,Panama,United States,Outbreak of Zika in Panama City
339,Panamá,Panama,Panama,Panama Cityâ€™s first Zika related death
384,Guatemala City,Guatemala,Guatemala,Rumors about Meningitis spreading in Guatemala...
450,Campinas,Brazil,Brazil,"Student sick in Campinas, Brazil"
459,Bangkok,Thailand,Thailand,Thailand-Zika Virus in Bangkok
603,Singapore,Singapore,Singapore,Zika cases in Singapore reach 393


## Workflow #5 ##
Once you are confident you’ve found all the cities/countries in each headline, save the DataFrame for the next part.

In [None]:
df_cleaned_headline_city_country = df_headline_city_country[['City', 'Country', 'Headline']]