# Parsing the News Headlines

Objective

 - Find any city and/or country names mentioned in each of the news headlines.

Project outline
The project is broken into five parts.

1. Extracting City and Country Information from News Headlines

2. Finding Geographic Locations of Headlines

3. Clustering Headlines Based on Location

4. Identifying Disease Outbreaks

5. Presenting the Disease Outbreak Data

The skills covered in order are the following: text extraction, data manipulation, clustering, interpreting algorithm outputs, and producing an actionable report. The deliverable from each part is a Jupyter notebook (preferably on GitHub) documenting your workflow and results. The final milestone will be a notebook converted to html or a PDF to share the conclusions with your superiors at the WHO. Each section builds upon the previous and will test your skills in a different area of data science. As you go through the project, keep in mind the overall objective: to identify disease outbreaks around the world. The project is representative of the problems solved by data scientists in industry or academia and utilizes the most popular tools for data science in Python.

In [1]:
import pandas as pd
import numpy as np

In [2]:
headlines = pd.read_csv('discovering-disease-outbreaks-base/data/headlines.txt',
                        header=None, 
                        delimiter='\n',
                        names=['Headlines'])

In [11]:
headlines.sample(20)

Unnamed: 0,Headlines
562,Hepatitis D Symptoms Spread all over Evansville
644,Influenza Exposure in Muscat
403,Huzhou Residents Receive Hepatitis B vaccine
40,Hepatitis E re-emerges in Santa Rosa
86,Visitor to Cucuta contracts Zika
83,Zika outbreak spreads to Mexico City
434,Case of Mad Cow Disease Reported in Hilden
169,Zika cases in Singapore reach 393
94,New Zika Case Confirmed in Belo Horizonte
603,More Patients in Kensington are Getting Diagno...


In [4]:
print("Number of headlines: {}".format(headlines.shape[0]))

Number of headlines: 650


In [5]:
# Check some informations about headlines
print("Max lenght of headlines: {}".format(max([len(each[0]) for each in headlines.values])))
print("Min lenght of headlines: {}".format(min([len(each[0]) for each in headlines.values])))
print("Average lenght of headlines: {:.2f}".format(np.mean([len(each[0]) for each in headlines.values])))

Max lenght of headlines: 87
Min lenght of headlines: 16
Average lenght of headlines: 40.77


In [6]:
import re
import geonamescache
import unicodedata

In [153]:
# Init geonamecashe instance
gc = geonamescache.GeonamesCache()

city_names = {'name': [], 'id':[]}
for each in gc.get_cities().values():
    city_names['name'].append(each['name'])
    city_names['id'].append(str(each['geonameid']))

In [163]:
def get_country_name_lat_long(country_name, idx):
    city_name = gc.get_cities().get(city_names['id'][idx]).get('name')
    country_code = gc.get_cities().get(city_names['id'][idx]).get('countrycode')
    country = gc.get_countries().get(country_code).get('name') 
    lat = gc.get_cities().get(city_names['id'][idx]).get('latitude')
    long = gc.get_cities().get(city_names['id'][idx]).get('longitude')
    return city_name, country, lat, long
    

In [164]:
df_dict = {"Headlines":[],"Cities":[], "Countries":[], "Latitude":[], "Longitude":[]}

In [169]:
for idx, regex in enumerate(city_names['name']):
    for each in headlines.values[:10]:
        if re.match((regex), each[0], flags=re.IGNORECASE):
            city, country, lat, long = get_country_name_lat_long(regex, idx)
            df_dict['Headlines'].append(each[0])
            df_dict['Cities'].append(city)
            df_dict['Countries'].append(country)
            df_dict['Latitude'].append(lat)
            df_dict['Longitude'].append(long)            

In [170]:
pd.DataFrame(df_dict).sample(10)

Unnamed: 0,Headlines,Cities,Countries,Latitude,Longitude
33,Trinidad confirms first Zika case,Trinidad,Uruguay,-33.5165,-56.89957
20,Could Zika Reach New York City?,York,United States,39.9626,-76.72774
2,Dallas man comes down with case of Zika,Man,Ivory Coast,7.41251,-7.55383
24,Could Zika Reach New York City?,New York City,United States,40.71427,-74.00597
31,Dallas man comes down with case of Zika,Dallas,United States,32.78306,-96.80667
8,Zika Infested Monkeys in Sao Paulo,Mon,India,26.73583,95.05841
12,Zika Outbreak Hits Miami,Ami,Japan,36.03333,140.2
26,Zika Outbreak Hits Miami,Brea,United States,33.91668,-117.90006
21,Dallas man comes down with case of Zika,Dallas,United States,32.78306,-96.80667
11,"Mystery Virus Spreads in Recife, Brazil",Bra,Italy,44.69776,7.85128


In [158]:
gc.get_cities().get(city_names['id'][idx]).get('name')

'Trinidad'

In [159]:
gc.get_cities().get(city_names['id'][idx]).get('countrycode')

'BO'

In [161]:
gc.get_countries().get('BO').get('name')

'Bolivia'

In [162]:
gc.get_cities().get(city_names['id'][idx]).get('latitude')

-14.83333