# 1st Part out of 5
## Let's parse the News Headlines 
Things to do: 
Load in the headline data and examine it for any data quality issues.

Use any library/data structure to read in the headlines.

Read through some of the headlines and identify potential problems.

In [26]:
# Start with importing few libraries
import pandas as pd

'c:\\Users\\narus\\JC_SDrive\\VSC\\Github\\Disease_Outbreak'

## Create headlines df and load in the headline data
### Look for some data quality issues

In [28]:
# Create dataframe using pandas
# Important to put seperator as None, otherwise it seperates using comma since it is a csv file
# Encoding must be set to utf-8 so that dashes and accents show up normally
headlines = pd.read_csv('data/headlines.txt', header=None, sep='None', encoding='utf-8')

# change headline type to str, otherwise will not be able to iterate in matched_cities and matched_countries
headlines = headlines[0].astype(str)

# let's check number of headlines by checking len()
num_headlines = len(headlines)
print(f"{num_headlines} headlines have been loaded")


650 headlines have been loaded


In [29]:
# Show top 50 headlines to see what it looks like
headlines

0                               Zika Outbreak Hits Miami
1                        Could Zika Reach New York City?
2                      First Case of Zika in Miami Beach
3                Mystery Virus Spreads in Recife, Brazil
4                Dallas man comes down with case of Zika
                             ...                        
645    Rumors about Rabies spreading in Jerusalem hav...
646                More Zika patients reported in Indang
647    Suva authorities confirmed the spread of Rotav...
648           More Zika patients reported in Bella Vista
649                       Zika Outbreak in Wichita Falls
Name: 0, Length: 650, dtype: object

## Extracting locations from headline Data

In [30]:
# Create a function that transforms *gc location names* into a case-independent and accent-independent regular expression
# Converting names to regexes
# **This function will be used later to map between regex and original names from geonamescache**

import re
from unidecode import unidecode # import unidecode function from unidecode module

def name_to_regex(name):
    decoded_name = unidecode(name) 
    if name != decoded_name:
        regex = fr'\b({name}|{decoded_name})\b'
    else:
        regex = fr'\b{name}\b'
    return re.compile(regex, flags=re.IGNORECASE)

 # unidecode to strip accents from characters
 # if name doesn't equal decoded name, use name or decoded name as regex
 # not sure what fr does
 # re.IGNORECASE to perform case-insensitive matching
 # name_to_regex will return (using re.compile) good location names

In [31]:
# This is where we link the regex to country-names and city-names

import geonamescache
gc = geonamescache.GeonamesCache()

# Create list of countries based on existing list from gc using list comprehension
countries = [country['name']
             for country in gc.get_countries().values()]
# Create a dictionary to map regex Country to gc country name
country_to_name = {name_to_regex(name): name
                   for name in countries}

# Create a list of cities based on existing list from gc using list comprehension
cities = [city['name'] for city in gc.get_cities().values()]

# Create a dictionary to map regex City to gc city name
city_to_name = {name_to_regex(name): name 
                for name in cities}

In [32]:
# Let's create a function that creates a sorted country_to_name or city_to_name dictionary and for each dictionary item (regex, name) it searches for a match within the text (headline)
# Returns gc location name from the text (headline)
# For each line in the sorted dictionary (country_to_name or city_to_name), if text (headlines) matches the regex, return the name


def get_name_in_text(text, dictionary):
    for regex, name in sorted(dictionary.items(),
                              key=lambda x: x[1]):
        if regex.search(text):
            return name
    return None

# for (this *inner*) <var> in <iterable> (list, tuple, collection of objects):
#   <statements(s)> (do this)
# dictionary.items() creates an iterable list with the items in a list 
# key=lambda x: x[1] is a sorting mechanism that sorts our dictionary by value
# iterating over dictionaries gives us a non-deterministic sequence of results. A change in sequence-order could alter which locations get matched to the inputted text. This is especially true if multiple locations are present in text. Sorting by location name ensures that function output will not change from run to run.

In [33]:
# Let's use the function get_name_in_text to find the cities and countries mentioned in the headlines list
# We will store the results in the df pandas table
matched_countries = [get_name_in_text(headline, country_to_name)
                     for headline in headlines]
matched_cities = [get_name_in_text(headline, city_to_name)
                  for headline in headlines]
data = {'Headline': headlines, 'City': matched_cities,
        'Country': matched_countries}
df = pd.DataFrame(data)

In [34]:
# Show top 50 of the df, may take few minutes to display
df.head(50)

Unnamed: 0,Headline,City,Country
0,Zika Outbreak Hits Miami,Miami,
1,Could Zika Reach New York City?,New York City,
2,First Case of Zika in Miami Beach,Miami,
3,"Mystery Virus Spreads in Recife, Brazil",Recife,Brazil
4,Dallas man comes down with case of Zika,Dallas,
5,Trinidad confirms first Zika case,Trinidad,
6,Zika Concerns are Spreading in Houston,Houston,
7,Geneve Scientists Battle to Find Cure,Genève,
8,The CDC in Atlanta is Growing Worried,Atlanta,
9,Zika Infested Monkeys in Sao Paulo,São Paulo,


In [35]:
# Summarize the contents of the df using describe method
summary = df[['City', 'Country']].describe()
print(summary)

City   Country
count   619        15
unique  510        10
top      Of  Malaysia
freq     45         3


In [36]:
# Show top 10 headlines with the city called "Of"
of_cities = df[df.City == 'Of'][['City', 'Headline']]
ten_of_cities = of_cities.head(10)
print(ten_of_cities.to_string(index=False))

# there are 45 headlines with the city called "Of"
# we could change all cities to begin with a capitalization but we did not consider multiple matches in a headline

City                                           Headline
  Of              Case of Measles Reported in Vancouver
  Of  Authorities are Worried about the Spread of Br...
  Of  Authorities are Worried about the Spread of Ma...
  Of  Rochester authorities confirmed the spread of ...
  Of     Tokyo Encounters Severe Symptoms of Meningitis
  Of  Authorities are Worried about the Spread of In...
  Of            Spike of Pneumonia Cases in Springfield
  Of  The Spread of Measles in Spokane has been Conf...
  Of                    Outbreak of Zika in Panama City
  Of    Urbana Encounters Severe Symptoms of Meningitis


In [37]:
# Let's find out how many headlines contain 2 or more city matches

def get_cities_in_headline(headline): # Will return a list of all unique cities in a headline
    cities_in_headline = set() # Create a new set
    for regex, name in city_to_name.items(): 
        match = regex.search(headline)
        if match:
            if headline[match.start()].isupper(): # Ensures first letter of city name is capitalized
                cities_in_headline.add(name)

    return list(cities_in_headline)

# for each city_to_name (regex, gc name), go through the headline and if headline finds a match with regex, and if the matched headline is upper case then add the name to cities_in_headline set list

df['Cities'] = df['Headline'].apply(get_cities_in_headline)
df['Num_cities'] = df['Cities'].apply(len)
df_multiple_cities = df[df.Num_cities > 1]
num_rows, _ = df_multiple_cities.shape
print(f"{num_rows} headlines match multiple cities")

69 headlines match multiple cities


In [38]:
# Gonna check what is happening with multiple cities in a headline
# Check the 10 multiple cities and see which city names are being picked (long or short)

ten_cities = df_multiple_cities[['Cities', 'Headline']].head(10)
print(ten_cities.to_string(index=False))

# it appears that the short invalid city names are being matched to the headlines that have multiple cities in a headline


Cities                                           Headline
         [York, New York City]                    Could Zika Reach New York City?
          [Miami Beach, Miami]                  First Case of Zika in Miami Beach
               [San Juan, San]  San Juan reports 1st U.S. Zika-related death a...
    [Los Angeles, Los Ángeles]               New Los Angeles Hairstyle goes Viral
                  [Tampa, Bay]              Tampa Bay Area Zika Case Count Climbs
        [Ho Chi Minh City, Ho]     Zika cases in Vietnam's Ho Chi Minh City surge
              [San Diego, San]           Key Zika Findings in San Diego Institute
           [Hīt, Kuala Lumpur]                 Kuala Lumpur is Hit By Zika Threat
          [San Francisco, San]                   Zika Virus Reaches San Francisco
 [San Salvador, Salvador, San]                       Zika worries in San Salvador


In [39]:
# We are going to assign the longest city-name as the representative city location if more than one matched city is found

def get_longest_city(cities):
    if cities:
        return max(cities, key=len)
    return None

df['City'] = df['Cities'].apply(get_longest_city)

In [40]:
# Let's check if the headlines are being assigned a wrong short city name

short_cities = df[df.City.str.len() <= 4][['City', 'Headline']]
print(short_cities.to_string(index=False))

City                                           Headline
 Lima                Lima tries to address Zika Concerns
 Pune                     Pune woman diagnosed with Zika
 Rome  Authorities are Worried about the Spread of Ma...
 Molo                Molo Cholera Spread Causing Concern
 Miri                               Zika arrives in Miri
 Nadi  More people in Nadi are infected with HIV ever...
 Baud  Rumors about Tuberculosis Spreading in Baud ha...
 Kobe                     Chikungunya re-emerges in Kobe
 Waco                More Zika patients reported in Waco
 Erie                        Erie County sets Zika traps
 Kent                       Kent is infested with Rabies
 Reno  The Spread of Gonorrhea in Reno has been Confi...
 Sibu                      Zika symptoms spotted in Sibu
 Baku    The Spread of Herpes in Baku has been Confirmed
 Bonn  Contaminated Meat Brings Trouble for Bonn Farmers
 Jaen                         Zika Troubles come to Jaen
 Yuma                       Zika

In [41]:
# Moving on to checking countries
# Only 1 of the total headlines contain actual country info, let's check manually

df_countries = df[df.Country.notnull()][['City',
                                         'Country',
                                         'Headline']]
print(df_countries.to_string(index=False))

City    Country                                           Headline
           Recife     Brazil            Mystery Virus Spreads in Recife, Brazil
 Ho Chi Minh City    Vietnam     Zika cases in Vietnam's Ho Chi Minh City surge
          Bangkok   Thailand                     Thailand-Zika Virus in Bangkok
       Piracicaba     Brazil                Zika outbreak in Piracicaba, Brazil
            Klang   Malaysia                   Zika surfaces in Klang, Malaysia
   Guatemala City  Guatemala  Rumors about Meningitis spreading in Guatemala...
      Belize City     Belize                 Belize City under threat from Zika
         Campinas     Brazil                   Student sick in Campinas, Brazil
      Mexico City     Mexico               Zika outbreak spreads to Mexico City
    Kota Kinabalu   Malaysia           New Zika Case in Kota Kinabalu, Malaysia
      Johor Bahru   Malaysia                 Zika reaches Johor Bahru, Malaysia
        Hong Kong  Hong Kong                    Norov

In [42]:
# Since all headlines with country information has city information, we can assign latitude and longtitude without relying on country's coordinates
# Let's get rid of country names from our analysis

df.drop('Country', axis=1, inplace=True)

In [43]:
# Check for rows where no locations were detected
# Count the number of unmatched headlines and then print a subset of that data

df_unmatched = df[df.City.isnull()]
num_unmatched = len(df_unmatched)
print(f"{num_unmatched} headlines contain no city matches.")
print(df_unmatched.head(10)[['Headline']].values)

39 headlines contain no city matches.
[['Louisiana Zika cases up to 26']
 ['Zika infects pregnant woman in Cebu']
 ['Spanish Flu Sighted in Antigua']
 ['Zika case reported in Oton']
 ['Hillsborough uses innovative trap against Zika 20 minutes ago']
 ['Maka City Experiences Influenza Outbreak']
 ['West Nile Virus Outbreak in Saint Johns']
 ['Malaria Exposure in Sussex']
 ['Greenwich Establishes Zika Task Force']
 ['Will West Nile Virus vaccine help Parsons?']]


In [44]:
# Since 39 headlines are only 6% of the total number of headlines, we are going to delete the missing city headlines
# GeoNamesCache failed to identify these headlines
# Price for deletion is reduction in data quality but it will not impact our results because we have high coverage of matched cities

df = df[~df.City.isnull()][['City', 'Headline']]

In [45]:
# Let's see our final df!
df

Unnamed: 0,City,Headline
0,Miami,Zika Outbreak Hits Miami
1,New York City,Could Zika Reach New York City?
2,Miami Beach,First Case of Zika in Miami Beach
3,Recife,"Mystery Virus Spreads in Recife, Brazil"
4,Dallas,Dallas man comes down with case of Zika
...,...,...
645,Jerusalem,Rumors about Rabies spreading in Jerusalem hav...
646,Indang,More Zika patients reported in Indang
647,Suva,Suva authorities confirmed the spread of Rotav...
648,Bella Vista,More Zika patients reported in Bella Vista
