# Discovering Disease Outbreaks from News Headlines

Identifying and mapping epidemics is crucial to prevent or respond to deadly disease outbreaks. Your first assignment for the WHO is as follows:

- Extract the locations (city and/or country name) from each news headline.
- Find the geographic coordinates of each headline using the city/country.
- Cluster (group) the headlines based on the geographic location.
- Visualize the clusters on a map and analyze them for patterns indicating an epidemic.
- Investigate the largest clusters for signs of disease outbreaks.
- Review headlines in the largest clusters within the United States and around the world. If any disease outbreak is   particularly dominant, visualize all worldwide mentions of that disease.
- Provide a summary of your findings to your superiors at the WHO so they can direct resources.

## 1. Parsing the News Headlines

**Objective**

Find any city and/or country names mentioned in each of the news headlines.

**Workflow**

1. Load in the headline data and examine it for any data quality issues.
1. Use any library/data structure to read in the headlines
1. Read through some of the headlines and identify potential problems
1. Using regular expressions and the cities and countries within the geonamescache library, match any cities/countries within each headline.
1. Make sure to normalize headlines and city/country names by removing accent marks. This can be done with the unidecode library.
1. Watch out for multiple cities in a headline and matches on short words! We want the match to be on the entire city—for example San Marino—and not a partial match—San.
1. Put the extracted data into a pandas DataFrame with three columns: headline, city, country.
1. Make sure there were no issues with the extraction by sampling some of the headlines and examining the city and country names.
1. One method for finding problems is to look for the most common names and see if there are any issues.
1. Once you are confident you’ve found all the cities/countries in each headline, save the DataFrame for the next part.

**Importance to project**

* We can’t do much with just the headlines; although they contain the city/country names, they do not contain the geographic information—latitude and longitude—we need to find clusters of disease outbreaks. The first step in getting the geographic information is to isolate the names.

* Later, we will use the names to find the location of each headline, which requires bringing in external data (through geonamescache).

* This workflow is common in data science. First, we separate the useful information from the noise—data mining—and then we augment it with external data—data engineering.

### Import all relevent libraries

In [1]:
# Regular expression
import re

# Data analysis and wrangling
import numpy as np
import pandas as pd
# Set display options for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth', None)
# To set the float precision(the number of places after the decimal)
pd.set_option('precision', 3)

# Normalized unicode data (to remove accents)
import unidecode
 
## Visualization
# matplotlib
import matplotlib.pyplot as plt
get_ipython().magic('matplotlib inline')
import seaborn as sns

# Ignore warning
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read and unidecode dataset 
file_contents = ""
with open("headlines.txt", "r") as file_handle:
    for line in file_handle.readlines():
        file_contents += line.replace("-", " ")
        unidecode.unidecode(line)
# View first 500 charactors        
file_contents[:500]

'Zika Outbreak Hits Miami\nCould Zika Reach New York City?\nFirst Case of Zika in Miami Beach\nMystery Virus Spreads in Recife, Brazil\nDallas man comes down with case of Zika\nTrinidad confirms first Zika case\nZika Concerns are Spreading in Houston\nGeneve Scientists Battle to Find Cure\nThe CDC in Atlanta is Growing Worried\nZika Infested Monkeys in Sao Paulo\nBrownsville teen contracts Zika virus\nMosquito control efforts in St. Louis take new tactics with Zika threat\nSan Juan reports 1st U.S. Zika rela'

In [3]:
# Import geonamescache and retrive all name of cities and countries from its API
import geonamescache
'''
get_continents()
get_countries()
get_us_states()
get_cities()
get_countries_by_names()
get_us_states_by_names()
get_cities_by_name(name)
get_us_counties()
'''
gc = geonamescache.GeonamesCache()

In [4]:
# Retrive country names data and create a dataframe.
states = pd.DataFrame(gc.get_us_states_by_names()).T.reset_index(drop=True)
states = states.sort_values(by='name',ascending=False).reset_index(drop=True)
# states.info()
print(states.shape)
states.head()

(51, 4)


Unnamed: 0,code,name,fips,geonameid
0,WY,Wyoming,56,5843591
1,WI,Wisconsin,55,5279468
2,WV,West Virginia,54,4826850
3,WA,Washington,53,5815135
4,VA,Virginia,51,6254928


In [5]:
# Retrive country names data and create a dataframe.
counties = pd.DataFrame(gc.get_us_counties()).T.reset_index(drop=True).T
counties.columns = ["code","name","state"]
counties = counties.sort_values(by="state",ascending=False).reset_index(drop=True)
# print(counties.shape)


s = ["County","Municipio","Island","Census Area", "City and Borough", "Borough","Parish"]
regexs = '|'.join(s)



column = []     
co = []
counties["county"] = pd.Series()


for county in counties.name:     
    if type(counties["county"]) != str:
        compiled_uscounty = re.compile(regexs)
        cc = compiled_uscounty.sub("",county)
        co.append(cc)
        column.append(co)
            
counties["county"] = pd.Series(co)
counties = counties.sort_values(by="state")
print(counties.shape)
counties.head()

(3234, 4)


Unnamed: 0,code,name,state,county
3233,2070,Dillingham Census Area,AK,Dillingham
3205,2180,Nome Census Area,AK,Nome
3206,2290,Yukon-Koyukuk Census Area,AK,Yukon-Koyukuk
3207,2282,Yakutat City and Borough,AK,Yakutat
3208,2275,Wrangell City and Borough,AK,Wrangell


In [6]:
# Retrive city names data and create a dataframe.
cities = pd.DataFrame(gc.get_cities()).T.reset_index(drop=True)
cities = cities.sort_values(by='name').reset_index(drop=True)
cities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24336 entries, 0 to 24335
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   geonameid    24336 non-null  object
 1   name         24336 non-null  object
 2   latitude     24336 non-null  object
 3   longitude    24336 non-null  object
 4   countrycode  24336 non-null  object
 5   population   24336 non-null  object
 6   timezone     24336 non-null  object
 7   admin1code   24336 non-null  object
dtypes: object(8)
memory usage: 1.5+ MB


In [7]:
# transform all accented strings to English alphabets'
for index,city in enumerate(cities.loc[:,'name']):
    cities.loc[index,'name'] = unidecode.unidecode(city)
cities.sample(15)

Unnamed: 0,geonameid,name,latitude,longitude,countrycode,population,timezone,admin1code
19474,1852607,Shibata,38.0,139.0,JP,80793,Asia/Tokyo,29
763,2759794,Amsterdam,52.4,4.89,NL,741636,Europe/Amsterdam,07
20917,1683340,Tanauan,14.1,121.0,PH,68456,Asia/Manila,40
18173,2981206,Saint-Chamond,45.5,4.51,FR,38014,Europe/Paris,84
3117,2654579,Bromsgrove,52.3,-2.06,GB,49117,Europe/London,ENG
162,2522430,Adra,36.7,-3.02,ES,24373,Europe/Madrid,51
16792,5207069,Pottstown,40.2,-75.6,US,22664,America/New_York,PA
3050,2654789,Brent,51.6,-0.302,GB,329100,Europe/London,ENG
1738,1603235,Ban Huai Thalaeng,15.0,103.0,TH,15352,Asia/Bangkok,27
24014,6295548,Zurich (Kreis 7),47.4,8.58,CH,33820,Europe/Zurich,ZH


In [8]:
# Create dataframe of headlines
df = pd.DataFrame(file_contents.split('\n'),columns=['headline'])
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 650 entries, 0 to 649
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   headline  650 non-null    object
dtypes: object(1)
memory usage: 5.2+ KB
None


Unnamed: 0,headline
0,Zika Outbreak Hits Miami
1,Could Zika Reach New York City?
2,First Case of Zika in Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil"
4,Dallas man comes down with case of Zika


In [9]:
# Retrive country names data and create a dataframe.
countries = pd.DataFrame(gc.get_countries_by_names()).T.reset_index(drop=True)
countries = countries.sort_values(by='name',ascending=False).reset_index(drop=True)
# countries.info()
print(countries.shape)
countries.head()

(252, 17)


Unnamed: 0,geonameid,name,iso,iso3,isonumeric,fips,continentcode,capital,areakm2,population,tld,currencycode,currencyname,phone,postalcoderegex,languages,neighbours
0,878675,Zimbabwe,ZW,ZWE,716,ZI,AF,Harare,390580,13061000,.zw,ZWL,Dollar,263,,"en-ZW,sn,nr,nd","ZA,MZ,BW,ZM"
1,895949,Zambia,ZM,ZMB,894,ZA,AF,Lusaka,752614,13460305,.zm,ZMW,Kwacha,260,^(\d{5})$,"en-ZM,bem,loz,lun,lue,ny,toi","ZW,TZ,MZ,CD,NA,MW,AO"
2,69543,Yemen,YE,YEM,887,YM,AS,Sanaa,527970,23495361,.ye,YER,Rial,967,,ar-YE,"SA,OM"
3,2461445,Western Sahara,EH,ESH,732,WI,AF,El-Aaiun,266000,273008,.eh,MAD,Dirham,212,,"ar,mey","DZ,MR,MA"
4,4034749,Wallis and Futuna,WF,WLF,876,WF,OC,Mata Utu,274,16025,.wf,XPF,Franc,681,^(986\d{2})$,"wls,fud,fr-WF",


In [10]:
# Use regex to retrive all country names from headline column and add them to country column.
df['country'] = pd.Series()

country_regexs = []  

for country in countries.name:
    r = '\\b'+country+'\\b'
    country_regexs.append(r)  

for regex in country_regexs:
    compiled_country = re.compile(regex)     
    for index,headline in enumerate(df.headline):
        match = compiled_country.search(headline) #flags=re.IGNORECASE can be used
        if match is not None:
            start, end = match.start(), match.end()
            matched_string = headline[start: end]
            print(index,matched_string, '<<<', headline)
            df.loc[index,'country'] = matched_string

25 Vietnam <<< Zika cases in Vietnam's Ho Chi Minh City surge
30 Thailand <<< Thailand Zika Virus in Bangkok
169 Singapore <<< Zika cases in Singapore reach 393
155 Panama <<< Outbreak of Zika in Panama City
576 Panama <<< Panama CityÃ¢â‚¬â„¢s first Zika related death
83 Mexico <<< Zika outbreak spreads to Mexico City
58 Malaysia <<< Zika surfaces in Klang, Malaysia
124 Malaysia <<< New Zika Case in Kota Kinabalu, Malaysia
127 Malaysia <<< Zika reaches Johor Bahru, Malaysia
131 Hong Kong <<< Norovirus Exposure in Hong Kong
59 Guatemala <<< Rumors about Meningitis spreading in Guatemala City have been refuted
3 Brazil <<< Mystery Virus Spreads in Recife, Brazil
44 Brazil <<< Zika outbreak in Piracicaba, Brazil
78 Brazil <<< Student sick in Campinas, Brazil
77 Belize <<< Belize City under threat from Zika


In [11]:
df[df.country.notnull()]

Unnamed: 0,headline,country
3,"Mystery Virus Spreads in Recife, Brazil",Brazil
25,Zika cases in Vietnam's Ho Chi Minh City surge,Vietnam
30,Thailand Zika Virus in Bangkok,Thailand
44,"Zika outbreak in Piracicaba, Brazil",Brazil
58,"Zika surfaces in Klang, Malaysia",Malaysia
59,Rumors about Meningitis spreading in Guatemala City have been refuted,Guatemala
77,Belize City under threat from Zika,Belize
78,"Student sick in Campinas, Brazil",Brazil
83,Zika outbreak spreads to Mexico City,Mexico
124,"New Zika Case in Kota Kinabalu, Malaysia",Malaysia


In [12]:
city_regexs = []

for city in cities.name:
#     r1 = '\\b'+city+'\\b'
#     city_regexs.append(r1)
    r2 = city
    city_regexs.append(r2)
for state in states.name:    
    r3 = '\\b'+state+'\\b'
    city_regexs.append(r3)
for county in counties.county:
    r4 = '\\b'+county+'\\b'
    city_regexs.append(r4)
    
ind = []
city = []
hline = []

for val in ['Mala','Bron','Viru','Pati','Rota','Will','Green','\\b'+'Will'+'\\b']:
    for regex in city_regexs:
        if regex == val:
            city_regex = city_regexs.remove(val)

for regex in city_regexs:
    compiled_city = re.compile(regex)  
    
    for index,headline in enumerate(df.headline):
        match = compiled_city.search(headline)
        if match is not None:
            start, end = match.start(), match.end()
            matched_string = headline[start: end]
            if len(matched_string) > 3:
                ind.append(index)
                city.append(matched_string)
                hline.append(headline)
#                 print(index,matched_string, '<<<', headline)

# Create dataframe of matched results and sort values by headline_no
matched = {'headline_no': ind, 'headline': hline, 'city': city}
matched_cities = pd.DataFrame(matched)
matched_cities = matched_cities.sort_values(by='headline_no').reset_index(drop=True)

matched_cities.head(15)

Unnamed: 0,headline_no,headline,city
0,0,Zika Outbreak Hits Miami,Miami
1,1,Could Zika Reach New York City?,York
2,1,Could Zika Reach New York City?,York
3,1,Could Zika Reach New York City?,York
4,1,Could Zika Reach New York City?,York
5,1,Could Zika Reach New York City?,York
6,1,Could Zika Reach New York City?,New York
7,1,Could Zika Reach New York City?,York
8,1,Could Zika Reach New York City?,New York
9,1,Could Zika Reach New York City?,York


In [13]:
cities[cities.name == 'York']

Unnamed: 0,geonameid,name,latitude,longitude,countrycode,population,timezone,admin1code
23653,4562407,York,40,-76.7,US,43992,America/New_York,PA
23654,2633352,York,54,-1.08,GB,153717,Europe/London,ENG


**Note: We found some interesting patterns of matched results.**

**For example**
* There were **three** matched results on a headline "Could Zika Reach New York City?"
* The first result was "New York City" which was a correct matched. 
* The rest two were "York" which could either be matched but why did we have two "York"?
* Check out on the `cities` dataframe then we found that there are two different locations of "York", one in the US and another one in the GB.

In [14]:
# How many headline had matched result more than 1? 
print('*********'*10)
print('**** Checked duplicates ****')
print('*********'*10)
print('\n')
print(str(len(matched_cities.headline.value_counts()[matched_cities.headline.value_counts()>1]))+'/650 headlines had matched result more than 1.')

# Drop duplicates of matched city names
matched_cities_uniq = matched_cities.drop_duplicates()

print('\n')
print('Note: There were still number of headlines which had matched result more than 1. However, among those unique matched city names, only the longest string of city is the correct matched')
print('\n')

print('*********'*10)
print('**** Dropped duplicates ****')
print('*********'*10)
print('\n')
print(str(len(matched_cities_uniq.headline.value_counts()[matched_cities_uniq.headline.value_counts()>1]))+'/650 headlines had matched result more than 1 after dropped duplicates.')
print('\n')
print(matched_cities_uniq.headline.value_counts()[matched_cities_uniq.headline.value_counts()>1].head(10))
print('\n')


# Make a list of headlines that had matched result more than 1.
redun_headlines = matched_cities_uniq.headline.value_counts()[matched_cities_uniq.headline.value_counts()>1].index.tolist()
# Keep only the longest matched cities
for hl in redun_headlines:
    cities_to_compare = matched_cities_uniq.city[matched_cities_uniq.headline == hl]
    length_str = [(len(city),index) for index,city in list(zip(cities_to_compare.index,cities_to_compare))]
    for length,index in length_str:
        if (length,index) != max(length_str):
            matched_cities_uniq.drop(index,axis=0,inplace=True)

print('\n')
print('*********'*10)
print('**** Filtered only the longest matched cities ****')
print('*********'*10)
print('\n')  
print(str(len(matched_cities_uniq.headline.value_counts()[matched_cities_uniq.headline.value_counts()>1]))+'/650 headlines had matched result more than 1 after filtered the longest.')
matched_cities_uniq = matched_cities_uniq.reset_index(drop=False).set_index('headline_no',drop=True)         
display(matched_cities_uniq.head())


******************************************************************************************
**** Checked duplicates ****
******************************************************************************************


308/650 headlines had matched result more than 1.


Note: There were still number of headlines which had matched result more than 1. However, among those unique matched city names, only the longest string of city is the correct matched


******************************************************************************************
**** Dropped duplicates ****
******************************************************************************************


196/650 headlines had matched result more than 1 after dropped duplicates.


Could Zika Reach New York City?                                                    5
More people in Saint Petersburg are infected with Varicella every year             5
Zika Reported in North Miami Beach                                                 5
Salt

Unnamed: 0_level_0,index,headline,city
headline_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,Zika Outbreak Hits Miami,Miami
1,10,Could Zika Reach New York City?,New York City
2,13,First Case of Zika in Miami Beach,Miami Beach
3,16,"Mystery Virus Spreads in Recife, Brazil",Recife
4,17,Dallas man comes down with case of Zika,Dallas


In [15]:
#  Add city column to df dataframe
df = df.join(matched_cities_uniq[['city']]) 

# Manually fill the nulls in city column.
for i in df[df.city.isnull()].index:
    df.loc[i,'city'] = 'Cebu ' #<<<  Zika infects pregnant woman in Cebu 
    df.loc[i,'city'] = 'Antigua ' #<<<  Spanish Flu Sighted in Antigua 
    df.loc[i,'city'] = 'Rio De Janeiro' #<<<  Carnival under threat in Rio De Janeiro due to Zika outbreak 
    df.loc[i,'city'] = 'Oton' #<<<  Zika case reported in Oton 
    df.loc[i,'city'] = 'Maka' #<<<  Maka City Experiences Influenza Outbreak 
    df.loc[i,'city'] = 'Mcallen' #<<<  More Zika patients reported in Mcallen 
    df.loc[i,'city'] = 'Mclean' #<<<  More people in Mclean are infected with Hepatitis A every year 
    df.loc[i,'city'] = 'Sussex' #<<<  Malaria Exposure in Sussex 
    df.loc[i,'city'] = 'Greenwich' #<<<  Greenwich Establishes Zika Task Force 
    df.loc[i,'city'] = 'Yulee' #<<<  Yulee takes a hit from Spreading Sickness 
    df.loc[i,'city'] = 'Boucau' #<<<  More people in Boucau are infected with HIV every year 
    df.loc[i,'city'] = 'Manhasset' #<<<  Bronchitis Outbreak in Manhasset 
    df.loc[i,'city'] = 'Padre Las Casas' #<<<  Zika Troubles come to Padre Las Casas 
    df.loc[i,'city'] = 'Destin' #<<<  Outbreak of Zika in Destin 
    df.loc[i,'city'] = 'Gympie' #<<<  Gympie Patient in Critical Condition after Contracting Chlamydia 
    df.loc[i,'city'] = 'Druid Hills' #<<<  Spike of Meningitis Cases in Druid Hills 
    df.loc[i,'city'] = 'Magnolia' #<<<  More Patients in Magnolia are Getting Diagnosed with Malaria 
    df.loc[i,'city'] = 'Penal' #<<<  Rumors about Syphilis spreading in Penal have been refuted 
    df.loc[i,'city'] = 'Lisbon' #<<<  Spanish Flu Outbreak in Lisbon 
    df.loc[i,'city'] = 'Madrid' #<<<  Spanish Flu Spreading through Madrid 
    df.loc[i,'city'] = 'Belvoir' #<<<  Fort Belvoir tests new cure for Hepatitis C 
    df.loc[i,'city'] = 'Oak Brook' #<<<  More people in Oak Brook are infected with Respiratory Syncytial Virus every year 
    df.loc[i,'city'] = 'Hutchins' #<<<  Outbreak of Zika in Hutchins 
    df.loc[i,'city'] = 'Longwood' #<<<  Longwood volunteers spreading Zika awareness 
    df.loc[i,'city'] = 'Quixere' #<<<  Zika symptoms spotted in Quixere 
    df.loc[i,'city'] = 'Davos' #<<<  Measles Hits Davos 
    df.loc[i,'city'] = 'Morehead City' #<<<  Spike of Hepatitis E Cases in Morehead City 
    df.loc[i,'city'] = 'Alvorad' #<<<  Outbreak of Zika in Alvorada 
    df.loc[i,'city'] = 'Dangriga' #<<<  Zika arrives in Dangriga 
    df.loc[i,'city'] = 'Maynard' #<<<  More Patients in Maynard are Getting Diagnosed with Syphilis 
    df.loc[i,'city'] = 'Antioquia' #<<<  Zika case reported in Antioquia 
    df.loc[i,'city'] = 'Pismo Beach' #<<<  Chikungunya has not Left Pismo Beach 
    df.loc[i,'city'] = 'La Joya' #<<<  Zika spreads to La Joya 
#     print("df.loc[i,'city'] = '' #<<< ",df.loc[i,'headline'] ,"\n")
df.info()   

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 650 entries, 0 to 649
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   headline  650 non-null    object
 1   country   15 non-null     object
 2   city      650 non-null    object
dtypes: object(3)
memory usage: 15.4+ KB


In [18]:
df.head(50)

Unnamed: 0,headline,country,city
0,Zika Outbreak Hits Miami,,Miami
1,Could Zika Reach New York City?,,New York City
2,First Case of Zika in Miami Beach,,Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
4,Dallas man comes down with case of Zika,,Dallas
5,Trinidad confirms first Zika case,,Trinidad
6,Zika Concerns are Spreading in Houston,,Houston
7,Geneve Scientists Battle to Find Cure,,Geneve
8,The CDC in Atlanta is Growing Worried,,Atlanta
9,Zika Infested Monkeys in Sao Paulo,,Sao Paulo


In [17]:
# Save to csv file
df.to_csv('cities_in_headline.csv',index=None) 