# Discovering Disease Outbreaks from News Headlines

Identifying and mapping epidemics is crucial to prevent or respond to deadly disease outbreaks. Your first assignment for the WHO is as follows:

- Extract the locations (city and/or country name) from each news headline.
- Find the geographic coordinates of each headline using the city/country.
- Cluster (group) the headlines based on the geographic location.
- Visualize the clusters on a map and analyze them for patterns indicating an epidemic.
- Investigate the largest clusters for signs of disease outbreaks.
- Review headlines in the largest clusters within the United States and around the world. If any disease outbreak is   particularly dominant, visualize all worldwide mentions of that disease.
- Provide a summary of your findings to your superiors at the WHO so they can direct resources.

## 1. Parsing the News Headlines

**Objective**

Find any city and/or country names mentioned in each of the news headlines.

**Workflow**

1. Load in the headline data and examine it for any data quality issues.
1. Use any library/data structure to read in the headlines
1. Read through some of the headlines and identify potential problems
1. Using regular expressions and the cities and countries within the geonamescache library, match any cities/countries within each headline.
1. Make sure to normalize headlines and city/country names by removing accent marks. This can be done with the unidecode library.
1. Watch out for multiple cities in a headline and matches on short words! We want the match to be on the entire city—for example San Marino—and not a partial match—San.
1. Put the extracted data into a pandas DataFrame with three columns: headline, city, country.
1. Make sure there were no issues with the extraction by sampling some of the headlines and examining the city and country names.
1. One method for finding problems is to look for the most common names and see if there are any issues.
1. Once you are confident you’ve found all the cities/countries in each headline, save the DataFrame for the next part.

**Importance to project**

* We can’t do much with just the headlines; although they contain the city/country names, they do not contain the geographic information—latitude and longitude—we need to find clusters of disease outbreaks. The first step in getting the geographic information is to isolate the names.

* Later, we will use the names to find the location of each headline, which requires bringing in external data (through geonamescache).

* This workflow is common in data science. First, we separate the useful information from the noise—data mining—and then we augment it with external data—data engineering.

### Import all relevent libraries

In [81]:
# Regular expression
import re

# Data analysis and wrangling
import numpy as np
import pandas as pd
# Set display options for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth', None)
# To set the float precision(the number of places after the decimal)
pd.set_option('precision', 3)

# Normalized unicode data (to remove accents)
import unidecode
 
## Visualization
# matplotlib
import matplotlib.pyplot as plt
get_ipython().magic('matplotlib inline')
import seaborn as sns

# Ignore warning
import warnings
warnings.filterwarnings('ignore')

In [82]:
# Read and unidecode dataset 
file_contents = ""
with open("headlines.txt", "r") as file_handle:
    for line in file_handle.readlines():
        file_contents += line.replace("-", " ")
        unidecode.unidecode(line)
print(file_contents) 

Zika Outbreak Hits Miami
Could Zika Reach New York City?
First Case of Zika in Miami Beach
Mystery Virus Spreads in Recife, Brazil
Dallas man comes down with case of Zika
Trinidad confirms first Zika case
Zika Concerns are Spreading in Houston
Geneve Scientists Battle to Find Cure
The CDC in Atlanta is Growing Worried
Zika Infested Monkeys in Sao Paulo
Brownsville teen contracts Zika virus
Mosquito control efforts in St. Louis take new tactics with Zika threat
San Juan reports 1st U.S. Zika related death amid outbreak
Flu outbreak in Galveston, Texas
Zika alert Ã¢â‚¬â€œ Manila now threatened
Zika afflicts 7 in Iloilo City
New Los Angeles Hairstyle goes Viral
Louisiana Zika cases up to 26
Orlando volunteers aid Zika research
Zika infects pregnant woman in Cebu
Chicago's First Zika Case Confirmed
Tampa Bay Area Zika Case Count Climbs
Bad Water Leads to Sickness in Flint, Michigan
Baltimore plans for Zika virus
London Health Unit Tracks Mad Cow Disease
Zika cases in Vietnam's Ho Chi Minh 

In [83]:
# Create dataframe of headlines
df = pd.DataFrame(file_contents.split('\n'),columns=['headline'])
print(df.info())
print(df.headline.value_counts())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 650 entries, 0 to 649
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   headline  650 non-null    object
dtypes: object(1)
memory usage: 5.2+ KB
None
Spanish Flu Outbreak in Lisbon                                                             2
Spanish Flu Spreading through Madrid                                                       2
Flu outbreak in Galveston, Texas                                                           1
Phnom Penh hit by Zika Threat                                                              1
Student sick in Campinas, Brazil                                                           1
Zika outbreak spreads to Mexico City                                                       1
Outbreak of Zika in Kozhikode                                                              1
Zika seminars in Yuma County                                                          

Unnamed: 0,headline
0,Zika Outbreak Hits Miami
1,Could Zika Reach New York City?
2,First Case of Zika in Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil"
4,Dallas man comes down with case of Zika


In [84]:
# Import geonamescache and retrive all name of cities and countries from its API
import geonamescache
'''
get_continents()
get_countries()
get_us_states()
get_cities()
get_countries_by_names()
get_us_states_by_names()
get_cities_by_name(name)
get_us_counties()
'''
gc = geonamescache.GeonamesCache()

In [85]:
# Retrive country names data and create a dataframe.
countries = pd.DataFrame(gc.get_countries_by_names()).T.reset_index(drop=True)
countries = countries.sort_values(by='name',ascending=False).reset_index(drop=True)
countries.info()
countries.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   geonameid        252 non-null    object
 1   name             252 non-null    object
 2   iso              252 non-null    object
 3   iso3             252 non-null    object
 4   isonumeric       252 non-null    object
 5   fips             252 non-null    object
 6   continentcode    252 non-null    object
 7   capital          252 non-null    object
 8   areakm2          252 non-null    object
 9   population       252 non-null    object
 10  tld              252 non-null    object
 11  currencycode     252 non-null    object
 12  currencyname     252 non-null    object
 13  phone            252 non-null    object
 14  postalcoderegex  252 non-null    object
 15  languages        252 non-null    object
 16  neighbours       252 non-null    object
dtypes: object(17)
memory usage: 33.6+ K

Unnamed: 0,geonameid,name,iso,iso3,isonumeric,fips,continentcode,capital,areakm2,population,tld,currencycode,currencyname,phone,postalcoderegex,languages,neighbours
0,878675,Zimbabwe,ZW,ZWE,716,ZI,AF,Harare,390580,13061000,.zw,ZWL,Dollar,263,,"en-ZW,sn,nr,nd","ZA,MZ,BW,ZM"
1,895949,Zambia,ZM,ZMB,894,ZA,AF,Lusaka,752614,13460305,.zm,ZMW,Kwacha,260,^(\d{5})$,"en-ZM,bem,loz,lun,lue,ny,toi","ZW,TZ,MZ,CD,NA,MW,AO"
2,69543,Yemen,YE,YEM,887,YM,AS,Sanaa,527970,23495361,.ye,YER,Rial,967,,ar-YE,"SA,OM"
3,2461445,Western Sahara,EH,ESH,732,WI,AF,El-Aaiun,266000,273008,.eh,MAD,Dirham,212,,"ar,mey","DZ,MR,MA"
4,4034749,Wallis and Futuna,WF,WLF,876,WF,OC,Mata Utu,274,16025,.wf,XPF,Franc,681,^(986\d{2})$,"wls,fud,fr-WF",


In [86]:
# Retrive city names data and create a dataframe.
cities = pd.DataFrame(gc.get_cities()).T.reset_index(drop=True)
cities = cities.sort_values(by='name').reset_index(drop=True)
cities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24336 entries, 0 to 24335
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   geonameid    24336 non-null  object
 1   name         24336 non-null  object
 2   latitude     24336 non-null  object
 3   longitude    24336 non-null  object
 4   countrycode  24336 non-null  object
 5   population   24336 non-null  object
 6   timezone     24336 non-null  object
 7   admin1code   24336 non-null  object
dtypes: object(8)
memory usage: 1.5+ MB


In [87]:
cities.sample(30)

Unnamed: 0,geonameid,name,latitude,longitude,countrycode,population,timezone,admin1code
3244,2346615,Buguma,4.74,6.86,NG,135404,Africa/Lagos,50
13362,186827,Meru,0.0463,37.7,KE,47226,Africa/Nairobi,35
20279,10294260,Stella,40.9,14.3,IT,30483,Europe/Rome,04
2187,2473826,Bekalta,35.6,11.0,TN,15937,Africa/Tunis,16
12875,978895,Margate,-30.9,30.4,ZA,34407,Africa/Johannesburg,02
23965,2509305,Zubia,37.1,-3.58,ES,17803,Europe/Madrid,51
8172,4272782,Hays,38.9,-99.3,US,21092,America/Chicago,KS
10494,2700839,Kinna,57.5,12.7,SE,15019,Europe/Stockholm,28
5405,217389,Demba,-5.5,22.3,CD,22263,Africa/Lubumbashi,23
15619,3353811,Otjiwarongo,-20.5,16.6,,21224,Africa/Windhoek,39


In [88]:
# transform all accented strings to English alphabets'
for index,city in enumerate(cities.loc[:,'name']):
    cities.loc[index,'name'] = unidecode.unidecode(city)
cities.loc[-338:,'name']

0                                                   'Ali Sabieh
1                                                's-Gravenzande
2                                              's-Hertogenbosch
3                                                      A Coruna
4                                                     A Estrada
5                                                      Aabenraa
6                                                        Aachen
7                                                       Aalborg
8                                                         Aalen
9                                                      Aalsmeer
10                                                        Aalst
11                                                       Aalten
12                                                       Aalter
13                                                        Aarau
14                                                     Aarschot
15                                      

In [89]:
# Use regex to retrive all country names from headline column and add them to country column.
df['country'] = pd.Series()

country_regexs = []  

for country in countries.name:
    r = '\\b'+country+'\\b'
    country_regexs.append(r)  

for regex in country_regexs:
    compiled_country = re.compile(regex)     
    for index,headline in enumerate(df.headline):
        match = compiled_country.search(headline) #flags=re.IGNORECASE can be used
        if match is not None:
            start, end = match.start(), match.end()
            matched_string = headline[start: end]
            print(index,matched_string, '<<<', headline)
            df.loc[index,'country'] = matched_string

25 Vietnam <<< Zika cases in Vietnam's Ho Chi Minh City surge
30 Thailand <<< Thailand Zika Virus in Bangkok
169 Singapore <<< Zika cases in Singapore reach 393
155 Panama <<< Outbreak of Zika in Panama City
576 Panama <<< Panama CityÃ¢â‚¬â„¢s first Zika related death
83 Mexico <<< Zika outbreak spreads to Mexico City
58 Malaysia <<< Zika surfaces in Klang, Malaysia
124 Malaysia <<< New Zika Case in Kota Kinabalu, Malaysia
127 Malaysia <<< Zika reaches Johor Bahru, Malaysia
131 Hong Kong <<< Norovirus Exposure in Hong Kong
59 Guatemala <<< Rumors about Meningitis spreading in Guatemala City have been refuted
3 Brazil <<< Mystery Virus Spreads in Recife, Brazil
44 Brazil <<< Zika outbreak in Piracicaba, Brazil
78 Brazil <<< Student sick in Campinas, Brazil
77 Belize <<< Belize City under threat from Zika


In [90]:
df.head()

Unnamed: 0,headline,country
0,Zika Outbreak Hits Miami,
1,Could Zika Reach New York City?,
2,First Case of Zika in Miami Beach,
3,"Mystery Virus Spreads in Recife, Brazil",Brazil
4,Dallas man comes down with case of Zika,


In [91]:
city_regexs = []

for city in cities.name:
    r = '\\b'+city+'\\b'
    city_regexs.append(r)

ind = []
city = []
hline = []

for regex in city_regexs:
    compiled_city = re.compile(regex)  
    
    for index,headline in enumerate(df.headline):
        match = compiled_city.search(headline)
        if match is not None:
            start, end = match.start(), match.end()
            matched_string = headline[start: end]
            if len(matched_string) > 3:
                ind.append(index)
                city.append(matched_string)
                hline.append(headline)
                print(index,matched_string, '<<<', headline)

294 Abilene <<< Zika case reported in Abilene
142 Abuja <<< Authorities are Worried about the Spread of Tuberculosis in Abuja
527 Addis Ababa <<< Will Rotavirus vaccine help Addis Ababa?
441 Akron <<< Authorities are Worried about the Spread of Hepatitis D in Akron
616 Alameda <<< Alameda Residents Recieve Rabies vaccine
223 Albany <<< Rumors about Hepatitis D spreading in Albany have been refuted
269 Albany <<< Albany Residents Recieve Respiratory Syncytial Virus vaccine
223 Albany <<< Rumors about Hepatitis D spreading in Albany have been refuted
269 Albany <<< Albany Residents Recieve Respiratory Syncytial Virus vaccine
223 Albany <<< Rumors about Hepatitis D spreading in Albany have been refuted
269 Albany <<< Albany Residents Recieve Respiratory Syncytial Virus vaccine
223 Albany <<< Rumors about Hepatitis D spreading in Albany have been refuted
269 Albany <<< Albany Residents Recieve Respiratory Syncytial Virus vaccine
223 Albany <<< Rumors about Hepatitis D spreading in Albany h

375 Bristol <<< Will Mad Cow Vaccine Help Bristol?
375 Bristol <<< Will Mad Cow Vaccine Help Bristol?
375 Bristol <<< Will Mad Cow Vaccine Help Bristol?
375 Bristol <<< Will Mad Cow Vaccine Help Bristol?
375 Bristol <<< Will Mad Cow Vaccine Help Bristol?
157 Brooklyn <<< Measles has not Left Brooklyn
10 Brownsville <<< Brownsville teen contracts Zika virus
10 Brownsville <<< Brownsville teen contracts Zika virus
10 Brownsville <<< Brownsville teen contracts Zika virus
413 Brussels <<< Mad Cow Disease Disastrous to Brussels
106 Buenos Aires <<< Authorities are Worried about the Spread of Norovirus in Buenos Aires
300 Bullhead City <<< How to Avoid Pneumonia in Bullhead City
372 Bundaberg <<< Rumors about Bronchitis spreading in Bundaberg have been refuted
454 Caguas <<< Tuberculosis Symptoms Spread all over Caguas
102 Calamba <<< Zika afflicts patient in Calamba
244 Calgary <<< Case of Hepatitis A Reported in Calgary
368 Calumpang <<< More Zika patients reported in Calumpang
287 Camacar

571 Fayetteville <<< Fayetteville authorities confirmed the spread of HIV
571 Fayetteville <<< Fayetteville authorities confirmed the spread of HIV
571 Fayetteville <<< Fayetteville authorities confirmed the spread of HIV
22 Flint <<< Bad Water Leads to Sickness in Flint, Michigan
22 Flint <<< Bad Water Leads to Sickness in Flint, Michigan
149 Florence <<< Hepatitis B has not Left Florence
149 Florence <<< Hepatitis B has not Left Florence
149 Florence <<< Hepatitis B has not Left Florence
149 Florence <<< Hepatitis B has not Left Florence
149 Florence <<< Hepatitis B has not Left Florence
98 Florida <<< Zika Patient in Seminole, Florida
239 Florida <<< Zika Mosquitoes May Have Bred in Bromeliads, Florida Officials Say
98 Florida <<< Zika Patient in Seminole, Florida
239 Florida <<< Zika Mosquitoes May Have Bred in Bromeliads, Florida Officials Say
98 Florida <<< Zika Patient in Seminole, Florida
239 Florida <<< Zika Mosquitoes May Have Bred in Bromeliads, Florida Officials Say
240 Fon

91 Kampala <<< Ebola outbreak in Kampala
592 Kamphaeng Phet <<< Zika spreads to Kamphaeng Phet
289 Kampong Cham <<< Zika Troubles come to Kampong Cham
615 Kampong Speu <<< More Zika patients reported in Kampong Speu
185 Kansas City <<< Hepatitis B Comes to Kansas City
185 Kansas City <<< Hepatitis B Comes to Kansas City
421 Kathmandu <<< Dengue Exposure in Kathmandu
603 Kensington <<< More Patients in Kensington are Getting Diagnosed with Varicella
603 Kensington <<< More Patients in Kensington are Getting Diagnosed with Varicella
396 Kent <<< Kent is infested with Rabies
396 Kent <<< Kent is infested with Rabies
52 Key West <<< Zika symtomps spotted in Key West
435 Khartoum <<< Rumors about Rabies Spreading in Khartoum have been Refuted
163 Kingston <<< Authorities are Worried about the Spread of Dengue in Kingston
500 Kingston <<< Herpes Symptoms Spread all over New Kingston
163 Kingston <<< Authorities are Worried about the Spread of Dengue in Kingston
500 Kingston <<< Herpes Sympto

217 Mobile <<< Mobile authorities confirmed the spread of Bronchitis
612 Moline <<< Will Gonorrhea vaccine help East Moline?
130 Molo <<< Molo Cholera Spread Causing Concern
355 Monroe <<< Lower Hospitalization in Monroe after Hepatitis D Vaccine becomes Mandatory
458 Monroe <<< Spike of Syphilis Cases in West Monroe
542 Monroe <<< West Nile Virus Hits Monroe
607 Monroe <<< The Spread of Respiratory Syncytial Virus in Monroe has been Confirmed
355 Monroe <<< Lower Hospitalization in Monroe after Hepatitis D Vaccine becomes Mandatory
458 Monroe <<< Spike of Syphilis Cases in West Monroe
542 Monroe <<< West Nile Virus Hits Monroe
607 Monroe <<< The Spread of Respiratory Syncytial Virus in Monroe has been Confirmed
355 Monroe <<< Lower Hospitalization in Monroe after Hepatitis D Vaccine becomes Mandatory
458 Monroe <<< Spike of Syphilis Cases in West Monroe
542 Monroe <<< West Nile Virus Hits Monroe
607 Monroe <<< The Spread of Respiratory Syncytial Virus in Monroe has been Confirmed
355 

68 Quezon <<< More Quezon City Zika Transmissions
438 Quezon <<< Zika arrives in Quezon
68 Quezon <<< More Quezon City Zika Transmissions
438 Quezon <<< Zika arrives in Quezon
68 Quezon City <<< More Quezon City Zika Transmissions
369 Quisqueya <<< Zika symptoms spotted in Quisqueya
477 Quito <<< Zika symptoms spotted in Quito
82 Quebec <<< Hepatitis B Vaccine is now Required in Quebec
389 Racine <<< West Nile Virus Exposure in Racine
173 Raleigh <<< Will Norovirus vaccine help Raleigh?
3 Recife <<< Mystery Virus Spreads in Recife, Brazil
613 Redlands <<< New medicine wipes out Influenza in Redlands
492 Redmond <<< Rumors about Chlamydia spreading in Redmond have been refuted
492 Redmond <<< Rumors about Chlamydia spreading in Redmond have been refuted
409 Reno <<< The Spread of Gonorrhea in Reno has been Confirmed
534 Reynosa <<< Zika case reported in Reynosa
459 Ribeirao <<< More Zika patients reported in Ribeirao Preto
459 Ribeirao Preto <<< More Zika patients reported in Ribeirao P

61 Sarasota <<< New Zika Case Confirmed in Sarasota County
111 Savannah <<< Authorities are Worried about the Spread of Influenza in Savannah
321 Schenectady <<< Malaria Vaccine is now Required in Schenectady
412 Scranton <<< Scranton authorities confirmed the spread of Gonorrhea
54 Seattle <<< Seattle scientists get $500,000 grant to pursue Zika vaccine New 7:50 pm
98 Seminole <<< Zika Patient in Seminole, Florida
132 Sengkang <<< 19 new Zika Cases in Sengkang
81 Seoul <<< Seoul confirms 14th Zika infection
357 Sevierville <<< Spike of Rhinovirus Cases in Sevierville
639 Sevilla <<< New medicine wipes out Meningitis in Sevilla
639 Sevilla <<< New medicine wipes out Meningitis in Sevilla
66 Shenzhen <<< Schools in Shenzhen Closed Due to Malaria Outbreak
175 Shreveport <<< How to Avoid Gonorrhea in Shreveport
437 Sibu <<< Zika symptoms spotted in Sibu
74 Silver Spring <<< Authorities are Worried about the Spread of Bronchitis in Silver Spring
334 Simpsonville <<< Chickenpox Hits Simpson

468 Winter Park <<< Zika spreads to Winter Park
447 Wisconsin Rapids <<< Wisconsin Rapids Patient in Critical Condition after Contracting Chickenpox
514 Wuhan <<< Wuhan is infested with Varicella
489 Yakima <<< The Spread of Dengue in Yakima has been Confirmed
34 Yangon <<< First Zika case confirmed in Yangon
272 Yaounde <<< Schools in Yaounde Closed Due to Mumps Outbreak
323 Yerevan <<< West Nile Virus Symptoms Spread all over Yerevan
635 Yiwu <<< Authorities are Worried about the Spread of Hepatitis B in Yiwu
281 Yogyakarta <<< West Nile Virus Hits Yogyakarta
501 Yokohama <<< More people in Yokohama are infected with Norovirus every year
1 York <<< Could Zika Reach New York City?
1 York <<< Could Zika Reach New York City?
579 Yuma <<< Zika seminars in Yuma County
636 Yurimaguas <<< Zika Outbreak in Yurimaguas
479 Zamboanga <<< Sulu, Zamboanga brace for Zika
497 Zanzibar <<< The Spread of Malaria in Zanzibar has been Confirmed
498 Zanzibar <<< Malaria Outbreak Hits Zanzibar's Tourist 

In [92]:
# Create dataframe of matched results and sort values by headline_no
matched = {'headline_no': ind, 'headline': hline, 'city': city}
matched_cities = pd.DataFrame(matched)
matched_cities = matched_cities.sort_values(by='headline_no').reset_index(drop=True)

matched_cities

Unnamed: 0,headline_no,headline,city
0,0,Zika Outbreak Hits Miami,Miami
1,1,Could Zika Reach New York City?,York
2,1,Could Zika Reach New York City?,York
3,1,Could Zika Reach New York City?,New York City
4,2,First Case of Zika in Miami Beach,Miami Beach
5,2,First Case of Zika in Miami Beach,Miami
6,3,"Mystery Virus Spreads in Recife, Brazil",Recife
7,4,Dallas man comes down with case of Zika,Dallas
8,4,Dallas man comes down with case of Zika,Dallas
9,5,Trinidad confirms first Zika case,Trinidad


In [93]:
cities[cities.name == 'York']

Unnamed: 0,geonameid,name,latitude,longitude,countrycode,population,timezone,admin1code
23653,4562407,York,40,-76.7,US,43992,America/New_York,PA
23654,2633352,York,54,-1.08,GB,153717,Europe/London,ENG


**Note: We found some interesting patterns of matched results.**

**For example**
* There were **three** matched results on a headline "Could Zika Reach New York City?"
* The first result was "New York City" which was a correct matched. 
* The rest two were "York" which could either be matched but why did we have two "York"?
* Check out on the `cities` dataframe then we found that there are two different locations of "York", one in the US and another one in the GB.

In [94]:
# How many headline had matched result more than 1? 
print('**** Before dropped duplicates ****')
print('\n')
print(str(len(matched_cities.headline.value_counts()[matched_cities.headline.value_counts()>1]))+'/650 headlines had matched result more than 1.')
print('\n')
print(matched_cities.headline.value_counts()[matched_cities.headline.value_counts()>1].head(10))
print('\n')

# Drop duplicates of matched city names
matched_cities_uniq = matched_cities.drop_duplicates()
print('\n')
print(matched_cities_uniq.head())
print('\n')
print('Note: There were still number of headlines which had matched result more than 1. However, among those unique matched city names, only the longest string of city is the correct matched')
print('\n')
print('*********'*10)
print('\n')
print('**** After dropped duplicates ****')
print('\n')
print(str(len(matched_cities_uniq.headline.value_counts()[matched_cities_uniq.headline.value_counts()>1]))+'/650 headlines had matched result more than 1 after dropped duplicates.')
print('\n')
print(matched_cities_uniq.headline.value_counts()[matched_cities_uniq.headline.value_counts()>1].head(10))
print('\n')
print(matched_cities_uniq.head())
print('\n')

# Make a list of headlines that had matched result more than 1.
redun_headlines = matched_cities_uniq.headline.value_counts()[matched_cities_uniq.headline.value_counts()>1].index.tolist()

for hl in redun_headlines:
    cities_to_compare = matched_cities_uniq.city[matched_cities_uniq.headline == hl]
    length_str = [(len(city),index) for index,city in list(zip(cities_to_compare.index,cities_to_compare))]
    for length,index in length_str:
        if (length,index) != max(length_str):
            matched_cities_uniq.drop(index,axis=0,inplace=True)
matched_cities_uniq = matched_cities_uniq.reset_index(drop=False).set_index('headline_no',drop=True)         
matched_cities_uniq.headline.value_counts()[matched_cities_uniq.headline.value_counts()>1] 

**** Before dropped duplicates ****


219/650 headlines had matched result more than 1.


Spike of Pneumonia Cases in Springfield                                            8
Authorities are Worried about the Spread of Chickenpox in Richmond                 7
Lower Hospitalization in Richmond after Mumps Vaccine becomes Mandatory            7
Will Hepatitis B vaccine help La Paz?                                              6
Zika Virus Reaches San Francisco                                                   6
Zika spreads to San Luis Potosi                                                    6
Rumors about Hepatitis D Spreading in San Juan Capistrano have been Refuted        6
San Juan reports 1st U.S. Zika related death amid outbreak                         5
Herpes Symptoms Spread all over New Kingston                                       5
Madison lab developing vaccine against Zika virus [The Wisconsin State Journal]    5
Name: headline, dtype: int64




   headline_no             

Series([], Name: headline, dtype: int64)

In [95]:
matched_cities_uniq    

Unnamed: 0_level_0,index,headline,city
headline_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,Zika Outbreak Hits Miami,Miami
1,3,Could Zika Reach New York City?,New York City
2,4,First Case of Zika in Miami Beach,Miami Beach
3,6,"Mystery Virus Spreads in Recife, Brazil",Recife
4,7,Dallas man comes down with case of Zika,Dallas
5,9,Trinidad confirms first Zika case,Trinidad
6,12,Zika Concerns are Spreading in Houston,Houston
7,13,Geneve Scientists Battle to Find Cure,Geneve
8,14,The CDC in Atlanta is Growing Worried,Atlanta
9,15,Zika Infested Monkeys in Sao Paulo,Sao Paulo


In [96]:
df = df.join(matched_cities_uniq[['city']]) 
df.info()   

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 650 entries, 0 to 649
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   headline  650 non-null    object
 1   country   15 non-null     object
 2   city      606 non-null    object
dtypes: object(3)
memory usage: 15.4+ KB


In [97]:
 df

Unnamed: 0,headline,country,city
0,Zika Outbreak Hits Miami,,Miami
1,Could Zika Reach New York City?,,New York City
2,First Case of Zika in Miami Beach,,Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
4,Dallas man comes down with case of Zika,,Dallas
5,Trinidad confirms first Zika case,,Trinidad
6,Zika Concerns are Spreading in Houston,,Houston
7,Geneve Scientists Battle to Find Cure,,Geneve
8,The CDC in Atlanta is Growing Worried,,Atlanta
9,Zika Infested Monkeys in Sao Paulo,,Sao Paulo
