# Discovering Disease Outbreaks from News Headlines

Identifying and mapping epidemics is crucial to prevent or respond to deadly disease outbreaks. Your first assignment for the WHO is as follows:

- Extract the locations (city and/or country name) from each news headline.
- Find the geographic coordinates of each headline using the city/country.
- Cluster (group) the headlines based on the geographic location.
- Visualize the clusters on a map and analyze them for patterns indicating an epidemic.
- Investigate the largest clusters for signs of disease outbreaks.
- Review headlines in the largest clusters within the United States and around the world. If any disease outbreak is   particularly dominant, visualize all worldwide mentions of that disease.
- Provide a summary of your findings to your superiors at the WHO so they can direct resources.

# 1. Parsing the News Headlines

**Objective**

Find any city and/or country names mentioned in each of the news headlines.

**Workflow**

1. Load in the headline data and examine it for any data quality issues.
1. Use any library/data structure to read in the headlines
1. Read through some of the headlines and identify potential problems
1. Using regular expressions and the cities and countries within the geonamescache library, match any cities/countries within each headline.
1. Make sure to normalize headlines and city/country names by removing accent marks. This can be done with the unidecode library.
1. Watch out for multiple cities in a headline and matches on short words! We want the match to be on the entire city—for example San Marino—and not a partial match—San.
1. Put the extracted data into a pandas DataFrame with three columns: headline, city, country.
1. Make sure there were no issues with the extraction by sampling some of the headlines and examining the city and country names.
1. One method for finding problems is to look for the most common names and see if there are any issues.
1. Once you are confident you’ve found all the cities/countries in each headline, save the DataFrame for the next part.

**Importance to project**

* We can’t do much with just the headlines; although they contain the city/country names, they do not contain the geographic information—latitude and longitude—we need to find clusters of disease outbreaks. The first step in getting the geographic information is to isolate the names.

* Later, we will use the names to find the location of each headline, which requires bringing in external data (through geonamescache).

* This workflow is common in data science. First, we separate the useful information from the noise—data mining—and then we augment it with external data—data engineering.

## Import all relevent libraries

In [None]:
# Regular expression
import re
from typing import Optional

# Data analysis and wrangling
import numpy as np
import pandas as pd

# Set display options for pandas
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("max_colwidth", None)

import warnings

## Visualization
# matplotlib
import matplotlib.pyplot as plt

# Normalized unicode data (to remove accents)
import unidecode

# Geonames cache for city and country names
import geonamescache

# Ignore warning
warnings.filterwarnings("ignore")

## Headline data


In [27]:
HEADLINES_FILE_PATH = r"..\data\headlines.txt"

with open(HEADLINES_FILE_PATH, "r") as file:
    headlines = file.readlines()

df = pd.DataFrame([h.strip() for h in headlines], columns=["headline"])

# Initial data quality checks (already present in the notebook, but now applied to df)
df = df.sort_values(by="headline", ascending=True).reset_index(drop=True)

# View first few headlines
df.head()

Unnamed: 0,headline
0,18 new Zika Cases in Bogota
1,19 new Zika Cases in Sengkang
2,Alameda Residents Recieve Rabies vaccine
3,Albany Residents Recieve Respiratory Syncytial Virus vaccine
4,Antipolo under threat from Zika Virus


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 650 entries, 0 to 649
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   headline  650 non-null    object
dtypes: object(1)
memory usage: 5.2+ KB


#### Check the data quality of the `headlines`

> we will check for 
> - **duplicates**
> - **missing values**
> - **data types**




##### Duplicates check

In [29]:
df[df.duplicated(keep=False)]

Unnamed: 0,headline
31,Barcelona Struck by Spanish Flu
32,Barcelona Struck by Spanish Flu
421,Spanish Flu Outbreak in Lisbon
422,Spanish Flu Outbreak in Lisbon
424,Spanish Flu Spreading through Madrid
425,Spanish Flu Spreading through Madrid


> 📝**Note**: There were 3 duplicated headlines
>
> 💡 **Solution**: We will keep the first occurrence of the duplicated headlines and drop the rest. 

###### Removes duplicates

In [30]:
# remove duplicates
df.drop_duplicates(subset="headline", keep="first", ignore_index=True, inplace=True)
print(df.shape)

(647, 1)


> 📝**Note**: We have 647 headlines remaining after removed 2 duplicated headlines.

##### Missing values check

In [31]:
df.headline.isnull().sum()

np.int64(0)

> 📝**Note**: No missing value in `headline`

##### Data types check

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 647 entries, 0 to 646
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   headline  647 non-null    object
dtypes: object(1)
memory usage: 5.2+ KB



###### change data type of headline to string



In [33]:
# This is important for the next step where we will use regex to extract city and country names
df.headline = df.headline.astype(dtype="string")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 647 entries, 0 to 646
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   headline  647 non-null    string
dtypes: string(1)
memory usage: 5.2 KB


#### Add headline IDs `headlineid` and Normalize the `headline` column

> 📝**Note**: `headline` has 647 unique values. All headlines are unique.



In [34]:
# Add headline IDs and normalize
df.reset_index(drop=False, names="headlineid", inplace=True)
df["headlineid"] = df["headlineid"].astype(str).str.zfill(width=3)
df["headline"] = df["headline"].apply(lambda x: unidecode.unidecode(x))

df.head()  # Add a head() to show the result of these operations

Unnamed: 0,headlineid,headline
0,0,18 new Zika Cases in Bogota
1,1,19 new Zika Cases in Sengkang
2,2,Alameda Residents Recieve Rabies vaccine
3,3,Albany Residents Recieve Respiratory Syncytial Virus vaccine
4,4,Antipolo under threat from Zika Virus


## GeonamesCache data


In [47]:
from geonamescache import GeonamesCache

"""
get_continents()
get_countries()
get_us_states()
get_cities()
get_countries_by_names()
get_us_states_by_names()
get_cities_by_name(name)
get_us_counties()
"""
gc = GeonamesCache()

### US states and counties 

#### US states

In [48]:
states = pd.DataFrame(gc.get_us_states_by_names()).T.reset_index(drop=True)
states = states.sort_values(by="name", ascending=False).reset_index(drop=True)
states = states.rename(
    columns={"name": "statename", "code": "statecode", "geonameid": "stateid"}
)
# states.info()
print(states.shape)
states.head()

(51, 4)


Unnamed: 0,statecode,statename,fips,stateid
0,WY,Wyoming,56,5843591
1,WI,Wisconsin,55,5279468
2,WV,West Virginia,54,4826850
3,WA,Washington,53,5815135
4,VA,Virginia,51,6254928


> 📝 **Note**: `states` dataFrame consits of 50 states and 1 district of Columbia

#### US Counties

In [49]:
# Retrive countynames data and create a dataframe.
counties = pd.DataFrame(gc.get_us_counties()).T.reset_index(drop=True).T

counties.columns = ["countyid", "countyname", "statecode"]
counties = counties.sort_values(
    by=["statecode", "countyname"], ascending=False
).reset_index(drop=True)
print(counties.shape)
counties.head(10)

(3235, 3)


Unnamed: 0,countyid,countyname,statecode
0,56045,Weston County,WY
1,56043,Washakie County,WY
2,56041,Uinta County,WY
3,56039,Teton County,WY
4,56037,Sweetwater County,WY
5,56035,Sublette County,WY
6,56033,Sheridan County,WY
7,56031,Platte County,WY
8,56029,Park County,WY
9,56027,Niobrara County,WY


In [50]:
3225 - 3143

82

> 📝 **Note**: `counties` dataFrame consists of 3,225 counties in the US and its territories. The dataframe has 6 more statecodes and 92 more counies of the US territories. 

In [51]:
counties.statecode.value_counts()

statecode
TX    254
GA    159
VA    134
KY    120
MO    115
KS    105
IL    102
NC    100
IA     99
TN     95
NE     93
IN     92
OH     88
MN     87
MI     83
MS     82
PR     78
OK     77
AR     75
WI     72
AL     67
FL     67
PA     67
SD     66
LA     64
CO     64
NY     62
CA     58
MT     56
WV     55
ND     53
SC     46
ID     44
WA     39
OR     36
NM     33
AK     29
UT     29
MD     24
WY     23
NJ     21
NV     17
ME     16
AZ     15
VT     14
MA     14
NH     10
CT      8
RI      5
HI      5
AS      5
MP      4
VI      3
DE      3
GU      1
DC      1
UM      1
Name: count, dtype: int64

In [52]:
states.statecode.nunique()

51

In [53]:
counties.statecode.nunique()

57

#### Merge US states and counties

In [54]:
# merge state and county dataframes

us_counties = (
    pd.merge(
        counties, states, on="statecode", how="left", suffixes=("_county", "_state")
    )
    .sort_values(by=["statename", "countyname"], ascending=False)
    .reset_index(drop=True)
)
us_counties = us_counties.rename(columns={"countyname": "county", "statename": "state"})

# Remove accents from county names
us_counties["county"] = us_counties["county"].apply(lambda x: unidecode.unidecode(x))

print(us_counties.shape)
us_counties.head()

(3235, 6)


Unnamed: 0,countyid,county,statecode,state,fips,stateid
0,56045,Weston County,WY,Wyoming,56,5843591
1,56043,Washakie County,WY,Wyoming,56,5843591
2,56041,Uinta County,WY,Wyoming,56,5843591
3,56039,Teton County,WY,Wyoming,56,5843591
4,56037,Sweetwater County,WY,Wyoming,56,5843591


In [55]:
# How many counties and states that have no state code in `states` dataframe?
us_counties[us_counties.state.isna()]

Unnamed: 0,countyid,county,statecode,state,fips,stateid
3143,72153,Yauco Municipio,PR,,,
3144,72151,Yabucoa Municipio,PR,,,
3145,60050,Western District,AS,,,
3146,72149,Villalba Municipio,PR,,,
3147,72147,Vieques Municipio,PR,,,
3148,72145,Vega Baja Municipio,PR,,,
3149,72143,Vega Alta Municipio,PR,,,
3150,72141,Utuado Municipio,PR,,,
3151,72139,Trujillo Alto Municipio,PR,,,
3152,72137,Toa Baja Municipio,PR,,,


In [56]:
def remove_county_suffix(county: str) -> Optional[str]:
    """Remove common suffixes from county names."""
    s = [
        "County",
        "Municipality",
        "Municipio",
        "Census Area",
        "City and Borough",
        "Borough",
        "Parish",
    ]
    #  Create a regex pattern to match any of the suffixes

    regexs = "|".join(s)

    # Use regex to remove the suffixes
    county_name_only = re.sub(rf"\s*({regexs})\s*$", "", county, flags=re.IGNORECASE)
    return county_name_only.strip() if county_name_only else None


# Insert a new column for cleaned county names
us_counties.insert(loc=2, column="countyname", value=None)

# Apply the function to the 'county' column
us_counties["countyname"] = us_counties["county"].apply(remove_county_suffix)


print(counties.shape)


us_counties.head(10)

(3235, 3)


Unnamed: 0,countyid,county,countyname,statecode,state,fips,stateid
0,56045,Weston County,Weston,WY,Wyoming,56,5843591
1,56043,Washakie County,Washakie,WY,Wyoming,56,5843591
2,56041,Uinta County,Uinta,WY,Wyoming,56,5843591
3,56039,Teton County,Teton,WY,Wyoming,56,5843591
4,56037,Sweetwater County,Sweetwater,WY,Wyoming,56,5843591
5,56035,Sublette County,Sublette,WY,Wyoming,56,5843591
6,56033,Sheridan County,Sheridan,WY,Wyoming,56,5843591
7,56031,Platte County,Platte,WY,Wyoming,56,5843591
8,56029,Park County,Park,WY,Wyoming,56,5843591
9,56027,Niobrara County,Niobrara,WY,Wyoming,56,5843591


### Countries and cities


#### Countries

In [57]:
countries = pd.DataFrame(gc.get_countries_by_names()).T
# sort countries by name and reset index
countries = countries.sort_values(by="name", ascending=True).reset_index(drop=True)
# Rename columns for clarity
countries.rename(
    columns={"iso": "countrycode", "name": "countryname", "geonameid": "countryid"},
    inplace=True,
)

# countries.info()
print(countries.shape)
countries.head(10)

(252, 17)


Unnamed: 0,countryid,countryname,countrycode,iso3,isonumeric,fips,continentcode,capital,areakm2,population,tld,currencycode,currencyname,phone,postalcoderegex,languages,neighbours
0,1149361,Afghanistan,AF,AFG,4,AF,AS,Kabul,647500,37172386,.af,AFN,Afghani,93,,"fa-AF,ps,uz-AF,tk","TM,CN,IR,TJ,PK,UZ"
1,661882,Aland Islands,AX,ALA,248,,EU,Mariehamn,1580,26711,.ax,EUR,Euro,+358-18,^(?:FI)*(\d{5})$,sv-AX,
2,783754,Albania,AL,ALB,8,AL,EU,Tirana,28748,2866376,.al,ALL,Lek,355,^(\d{4})$,"sq,el","MK,GR,ME,RS,XK"
3,2589581,Algeria,DZ,DZA,12,AG,AF,Algiers,2381740,42228429,.dz,DZD,Dinar,213,^(\d{5})$,ar-DZ,"NE,EH,LY,MR,TN,MA,ML"
4,5880801,American Samoa,AS,ASM,16,AQ,OC,Pago Pago,199,55465,.as,USD,Dollar,+1-684,96799,"en-AS,sm,to",
5,3041565,Andorra,AD,AND,20,AN,EU,Andorra la Vella,468,77006,.ad,EUR,Euro,376,^(?:AD)*(\d{3})$,ca,"ES,FR"
6,3351879,Angola,AO,AGO,24,AO,AF,Luanda,1246700,30809762,.ao,AOA,Kwanza,244,,pt-AO,"CD,NA,ZM,CG"
7,3573511,Anguilla,AI,AIA,660,AV,,The Valley,102,13254,.ai,XCD,Dollar,+1-264,,en-AI,
8,6697173,Antarctica,AQ,ATA,10,AY,AN,,14000000,0,.aq,,,,,,
9,3576396,Antigua and Barbuda,AG,ATG,28,AC,,St. John's,443,96286,.ag,XCD,Dollar,+1-268,,en-AG,


In [58]:
# Keep only nessessary columns
countries = countries[["countryname", "capital", "countrycode", "countryid"]]

countries.head(10)

Unnamed: 0,countryname,capital,countrycode,countryid
0,Afghanistan,Kabul,AF,1149361
1,Aland Islands,Mariehamn,AX,661882
2,Albania,Tirana,AL,783754
3,Algeria,Algiers,DZ,2589581
4,American Samoa,Pago Pago,AS,5880801
5,Andorra,Andorra la Vella,AD,3041565
6,Angola,Luanda,AO,3351879
7,Anguilla,The Valley,AI,3573511
8,Antarctica,,AQ,6697173
9,Antigua and Barbuda,St. John's,AG,3576396


In [None]:
# normalize country names to remove accents
countries.countryname = countries.countryname.apply(lambda x: unidecode.unidecode(x))

In [60]:
countries

Unnamed: 0,countryname,capital,countrycode,countryid
0,Afghanistan,Kabul,AF,1149361
1,Aland Islands,Mariehamn,AX,661882
2,Albania,Tirana,AL,783754
3,Algeria,Algiers,DZ,2589581
4,American Samoa,Pago Pago,AS,5880801
5,Andorra,Andorra la Vella,AD,3041565
6,Angola,Luanda,AO,3351879
7,Anguilla,The Valley,AI,3573511
8,Antarctica,,AQ,6697173
9,Antigua and Barbuda,St. John's,AG,3576396


#### Cities


In [61]:
# create a dataframe for cities
cities = pd.DataFrame(gc.get_cities()).T.reset_index(drop=True)
print(cities.shape)
cities.head()

(26463, 9)


Unnamed: 0,geonameid,name,latitude,longitude,countrycode,population,timezone,admin1code,alternatenames
0,3040051,les Escaldes,42.50729,1.53414,AD,15853,Europe/Andorra,8,"[Ehskal'des-Ehndzhordani, Escaldes, Escaldes-Engordany, Les Escaldes, esukarudesu=engorudani jiao qu, lai sai si ka er de-en ge er da, Эскальдес-Энджордани, エスカルデス＝エンゴルダニ教区, 萊塞斯卡爾德-恩戈爾達, 萊塞斯卡爾德－恩戈爾達]"
1,3041563,Andorra la Vella,42.50779,1.52109,AD,20430,Europe/Andorra,7,"[ALV, Ando-la-Vyey, Andora, Andora la Vela, Andora la Velja, Andora lja Vehl'ja, Andoro Malnova, Andorra, Andorra Tuan, Andorra a Vella, Andorra la Biella, Andorra la Vella, Andorra la Vielha, Andorra-a-Velha, Andorra-la-Vel'ja, Andorra-la-Vielye, Andorre-la-Vieille, Andò-la-Vyèy, Andòrra la Vièlha, an dao er cheng, andolalabeya, andwra la fyla, Ανδόρρα, Андора ла Веля, Андора ла Веља, Андора ля Вэлья, Андорра-ла-Велья, אנדורה לה וולה, أندورا لا فيلا, አንዶራ ላ ቬላ, アンドラ・ラ・ヴェリャ, 安道爾城, 안도라라베야]"
2,290594,Umm Al Quwain City,25.56473,55.55517,AE,62747,Asia/Dubai,7,"[Oumm al Qaiwain, Oumm al Qaïwaïn, Um al Kawain, Um al Quweim, Umm Al Quwain City, Umm al Qaiwain, Umm al Qawain, Umm al Qaywayn, Umm al-Quwain, Umm-ehl'-Kajvajn, Yumul al Quwain, am alqywyn, mdynt am alqywyn, Умм-эль-Кайвайн, أم القيوين, مدينة ام القيوين]"
3,291074,Ras Al Khaimah City,25.78953,55.9432,AE,351943,Asia/Dubai,5,"[Julfa, Khaimah, RAK City, RKT, Ra's al Khaymah, Ra's al-Chaima, Ras Al Khaimah City, Ras al Khaimah, Ras al-Khaimah, Ras el Khaimah, Ras el Khaïmah, Ras el-Kheima, Ras-ehl'-Khajma, Ra’s al Khaymah, Ra’s al-Chaima, mdynt ras alkhymt, ras alkhymt, Рас-эль-Хайма, رأس الخيمة, مدينة رأس الخيمة]"
4,291580,Zayed City,23.65416,53.70522,AE,63482,Asia/Dubai,1,"[Bid' Zayed, Bid’ Zayed, Madinat Za'id, Madinat Zayid, Madīnat Zāyid, Madīnat Zā’id, Zayed City, mdynt zayd, مدينة زايد]"


In [62]:
# sort cities by country code and name in descending order
cities = cities.sort_values(by=["countrycode", "name"]).reset_index(drop=True)
# drop unnecessary columns
# cities = cities.drop(
#     columns=[
#         "population"
#           , "alternatenames"
#     ]
# )
# rename columns for clarity
cities = cities.rename({"geonameid": "cityid", "name": "cityname_raw"}, axis=1)
# transform all accented strings to English alphabets'
cities.insert(loc=2, column="city", value=np.nan)
cities.city = cities["cityname_raw"].apply(func=lambda x: unidecode.unidecode(string=x))

cities.head(10)

Unnamed: 0,cityid,cityname_raw,city,latitude,longitude,countrycode,population,timezone,admin1code,alternatenames
0,3041563,Andorra la Vella,Andorra la Vella,42.50779,1.52109,AD,20430,Europe/Andorra,7,"[ALV, Ando-la-Vyey, Andora, Andora la Vela, Andora la Velja, Andora lja Vehl'ja, Andoro Malnova, Andorra, Andorra Tuan, Andorra a Vella, Andorra la Biella, Andorra la Vella, Andorra la Vielha, Andorra-a-Velha, Andorra-la-Vel'ja, Andorra-la-Vielye, Andorre-la-Vieille, Andò-la-Vyèy, Andòrra la Vièlha, an dao er cheng, andolalabeya, andwra la fyla, Ανδόρρα, Андора ла Веля, Андора ла Веља, Андора ля Вэлья, Андорра-ла-Велья, אנדורה לה וולה, أندورا لا فيلا, አንዶራ ላ ቬላ, アンドラ・ラ・ヴェリャ, 安道爾城, 안도라라베야]"
1,3040051,les Escaldes,les Escaldes,42.50729,1.53414,AD,15853,Europe/Andorra,8,"[Ehskal'des-Ehndzhordani, Escaldes, Escaldes-Engordany, Les Escaldes, esukarudesu=engorudani jiao qu, lai sai si ka er de-en ge er da, Эскальдес-Энджордани, エスカルデス＝エンゴルダニ教区, 萊塞斯卡爾德-恩戈爾達, 萊塞斯卡爾德－恩戈爾達]"
2,292968,Abu Dhabi,Abu Dhabi,24.45118,54.39696,AE,1807000,Asia/Dubai,1,"[A-pu-that-pi, AEbu Saby, AUH, Aboe Dhabi, Abou Dabi, Abu Dabi, Abu Dabis, Abu Daby, Abu Daibi, Abu Dhabi, Abu Dhabi Island and Internal Islands City, Abu Dhabi emiraat, Abu Zabi, Abu Zaby, Abu Zabye, Abu Zabyo, Abu Ḍabi, Abu Ḑabi, Abu-Dabi, Abu-Dabi khot, Abu-Dabio, Abu-Dzabi, Abú Dabí, Abú Daibí, Abú Zabí, Abû Daby, Abū Dabī, Abū Z̧aby, Abū Z̧abye, Abū Z̧abyo, Abū Z̧abī, Ampou Ntampi, Ebu Dabi, Ebu Dhabi, a bu zha bi, abu dhabi, abu-dabi, abudabi, abudhabi, abw zby, abwzby, aputapi, jzyrt abwzby wjzr dakhlyt akhry, xa bud abi, Â-pu-tha̍t-pí, Äbu Saby, Əbu-Dabi, Άμπου Ντάμπι, Αμπου Νταμπι, Αμπού Ντάμπι, Абу Даби, Абу-Даби, Абу-Даби хот, Абу-Дабі, Әбу-Даби, Աբու Դաբի, אבו דאבי, أبوظبي, ئەبووزەبی, ابو ظبى, ابوظبی, ابوظہبی, جزيرة أبوظبي وجزر داخلية اخرى, अबु धाबी, अबू धाबी, আবুধাবি, ਅਬੂ ਧਾਬੀ, ଆବୁଧାବି, அபுதாபி, ಅಬು ಧಾಬಿ, അബുദാബി, අබුඩාබි, อาบูดาบี, ཨ་པོའུ་དྷ་པེ།, အဘူဒါဘီမြို့, აბუ-დაბი, አቡ ዳቢ, アブダビ, 阿布扎比, 아부다비]"
3,292953,Adh Dhayd,Adh Dhayd,25.28812,55.88157,AE,20165,Asia/Dubai,6,"[Adh Dhaid, Adh Dhayd, Al Daid, Al-Dhayd, Dayd, Dhaid, Dhayd, Duhayd, Ihaid, aldhyd, الذيد, Ḑayd]"
4,292932,Ajman City,Ajman City,25.40177,55.47878,AE,490035,Asia/Dubai,2,"[Ajman, Ajman City, Al Ajman, QAJ, Ujman, mdynt ʿjman, ʿjman, عجمان, مدينة عجمان]"
5,292913,Al Ain City,Al Ain City,24.19167,55.76056,AE,846747,Asia/Dubai,1,"[AAN, Ainas, Al Ain, Al Ain City, Al Ajn, Al Ayn, Al `Ayn, Al Ɛayn, Al ‘Ayn, Al-Ain, Al-Ajn, Al-Ayin, Al-Ayn, Al-Aïn, Ehl'-Ajn, El Ain, El-Ajn, ai yin, al ain, al-ain, al-aini, alʿyn, ela ena, mdynt alʿyn, Ел Аин, Эль-Айн, Ալ-Ային, אל-עין, العين, العین, مدينة العين, एल एन, அல் ஐன், അൽ ഐൻ, ალ-აინი, アル・アイン, 艾因, 알아인]"
6,292878,Al Fujairah City,Al Fujairah City,25.11641,56.34141,AE,86512,Asia/Dubai,4,"[Al Fujairah City, Al Fujayrah, Al-Fudjayra, Al-Fujayrah' emiraat, FJR, Fudschaira, Fudzhejra, Fujaira, Fujairah, Fujajro, Fujayrah, Fuĵajro, alfjyrt, fjyrt, fu ji la, fujaira, mdynt alfjyrt, Фуджейра, الفجيرة, فجيرة, مدينة الفجيرة, フジャイラ, 富吉拉]"
7,12047416,Al Shamkhah City,Al Shamkhah City,24.39268,54.70779,AE,61710,Asia/Dubai,1,"[Al Shamkhah City, mdynt alshamkht, مدينة الشامخة]"
8,292688,Ar Ruways,Ar Ruways,24.11028,52.73056,AE,25000,Asia/Dubai,1,"[Ar Ru'ays, Ar Ruways, Ar Ru’ays, Ar-Ruvais, Ruwais, Ар-Руваис]"
9,12042052,Bani Yas City,Bani Yas City,24.30978,54.62944,AE,80498,Asia/Dubai,1,"[Bani Yas, Bani Yas City, mdynt bny yas, مدينة بني ياس]"


#### Merge `countries` dataframes to `cities` dataframe 

In [63]:
# How many different country codes are there in the cities dataframe?
len(cities.countrycode.unique())

244

> ⚠️ **Caution**: The `countries` and `cities` dataframes do not have the same number of countries. The `countries` dataframe has 252 countries, but `cities` dataframe has only 244 countries. Because of some countries not having cities in the `cities` df, we will use an outer join to merge the two dataframes. This will ensure that all countries are included, even if they do not have any cities listed in the `cities` df.


In [None]:
city = (
    cities[["city", "population", "latitude", "longitude", "countrycode", "cityid"]]
    .merge(countries, how="left", on="countrycode", suffixes=("_city", "_country"))
    .sort_values(by=["countrycode", "city"], ascending=True)
    .reset_index(drop=True)
)
# rearrange columns for better readability
city = city[
    [
        "city",
        "countryname",
        "population",
        "capital",
        "latitude",
        "longitude",
        "countrycode",
        "cityid",
        "countryid",
    ]
]

print(city.shape)
city.head(10)

In [None]:
# city.columns.tolist()

## Extract cities and countries from headlines

### Use regex to retrive all country and city names from headline column

In [None]:
from re import Pattern
from typing import Any, Optional, Iterable


# def get_compiled_regex(name: str) -> Pattern[Any]:
#     """Create a regex pattern for the given name. Compiled one name at a time."""
#     regex = f"\\b{name}\\b"
#     return re.compile(regex)  # re.IGNORECASE)


def get_big_compiled_regex(namelist: Iterable) -> Pattern[Any]:
    """
    Create a regex pattern for the given name list.
    Ensures input is a list, sorts by length (descending), then compiles.
    """
    if not isinstance(namelist, list):
        namelist = list(namelist)
    namelist.sort(key=len, reverse=True)
    all_names = r"\b|\b".join(re.escape(pattern=name) for name in namelist)
    regex = f"(\\b(?:{all_names})\\b)"
    # return re.compile(regex, re.IGNORECASE)
    return re.compile(regex)

In [None]:
# def get_matched_country(headline: str) -> Optional[str]:
#     """Extract the matched country string from a headline string by compiled one regex pattern of country at a time."""
#     # Create a compiled regex for each country name
#     for country_name in city.countryname:
#         compiled_country = get_compiled_regex(country_name)
#         # match country in headline
#         match_country = compiled_country.search(headline)
#         if match_country is not None:
#             start, end = match_country.start(), match_country.end()
#             return headline[start:end]


def get_matched_country(headline: str) -> Optional[str]:
    """Extract the matched country name string from a headline string by compiled regex pattern of all countries."""
    compiled_countries = get_big_compiled_regex(city.countryname)
    matched_countries = re.findall(compiled_countries, headline)
    return matched_countries[0] if matched_countries else None


def get_matched_city(headline: str, compiled_regex) -> Optional[str]:
    """Extract the matched city name string from a headline string by compiled regex pattern of all cities."""
    # compiled_cities = get_big_compiled_regex(city.cityname)
    matched_cities = re.findall(compiled_regex, headline)
    return matched_cities[0] if matched_cities else None


def get_matched_usstate(headline: str) -> Optional[str]:
    """Extract the matched US state name string from a headline string by compiled regex pattern of all US states."""
    compiled_states = get_big_compiled_regex(states.statename)
    matched_states = re.findall(compiled_states, headline)
    return matched_states[0] if matched_states else None


def get_matched_uscounty(headline: str) -> Optional[str]:
    """Extract the matched US county name string from a headline string by compiled regex pattern of all US counties."""
    compiled_counties = get_big_compiled_regex(us_counties.countyname)
    matched_counties = re.findall(compiled_counties, headline)
    return matched_counties[0] if matched_counties else None

    # """Extract the matched city string from a regex match object."""
    # compiled_city = get_city_regex(city_name)
    # match_city = compiled_city.search(headline)
    # if match_city is not None:
    #     start, end = match_city.start(), match_city.end()
    #     return headline[start:end]

### Cities extraction


In [None]:
compiled_cities = get_big_compiled_regex(city.city)
df["city"] = df.headline.apply(
    func=lambda x: get_matched_city(headline=x, compiled_regex=compiled_cities)
)


print(df.shape)


df.head(10)

> 📝 **Note**:
> The `cities` extraction is done using regex to match the city names in the headlines. The regex pattern and matching sequence (lengthiest name first) is designed to capture full city names, avoiding partial matches. 
>
> - of total 648 headlines, 
>   - 606 headlines (93.5%) have city names extracted.
>   - 42 headlines (6.5%) do not have any city names extracted.    





In [None]:
df[df.city.notnull()]

In [None]:
city[city.cityname.str.contains("Vero Beach")]

In [None]:
print(len(df[df.city.isnull()]))
df[df.city.isnull()]

In [None]:
city[city.cityid.duplicated(keep=False)]

### Add country names column to the DataFrame by merging with `city` dataframe 

In [None]:
merged = df.merge(
    right=city[
        [
            "countryname",
            "population",
            "cityid",
            "cityname",
            # "latitude", "longitude"
        ]
    ],
    how="left",
    left_on="city",
    right_on="cityname",
    suffixes=("_headline", "_city"),
)

merged.population = merged.population.astype(
    dtype="Int64"
)  # Use Int64 to allow for NaN values in the population column
merged

In [None]:
merged[merged.cityid.duplicated(keep=False)].sort_values(
    by=["headline", "population", "cityid"]
)

In [None]:
merged[merged.duplicated(subset=["headlineid"], keep=False)].sort_values(
    by=["headline", "population"]
)

> 📝 **Note**: We found 176 headlines that were matched with multiple cities the have the same name. For example, There were 3 different **"Rome"** city, one in the Italy and another two in the US. Thus, the headline **"Authorities are Worried about the Spread of Mad Cow Disease in Rome"** got three duplicates.
> What we need to do is to 

### US counties extraction

In [None]:
df["us_county"] = df.headline.apply(func=lambda x: get_matched_uscounty(x))

print(df.shape)
df.head(10)

### US States extraction

In [None]:
# df["us_state"] = df.headline.apply(func=lambda x: get_matched_usstate(x))
# print(df.shape)
# df.head(10)


### Countries extraction

In [None]:
df["country"] = df["headline"].apply(func=lambda x: get_matched_country(x))
print(df.shape)
df.head(10)

In [None]:
df[df.country.notnull()]

In [None]:
df.merge(
    right=city[["countryname", "cityname", "cityid"]],
    how="left",
    left_on="city",
    right_on="cityname",
    suffixes=("_headline", "_city"),
)

In [None]:
df

In [None]:
for ct_idx, ct in enumerate(countries.countryname):
    for st_idx, st in enumerate(df.headline):
        if ct in st:
            print(f"Headline: {st}\t\t\tCountry: {ct}")
            break

In [None]:
suffix = [
    "County",
    "Municipality",
    "Municipio",
    "Census Area",
    "City and Borough",
    "Borough",
    "Parish",
    "City",
]

for s in suffix:
    print(f"\nSuffix: {s}")
    for headline in df.headline:
        print(f"\t{headline}") if s in headline else None

In [None]:
city_regexs = []

for city in cities.name:
    #     r1 = '\\b'+city+'\\b'
    #     city_regexs.append(r1)
    r2 = city
    city_regexs.append(r2)
for state in states.name:
    r3 = "\\b" + state + "\\b"
    city_regexs.append(r3)
for county in counties.county:
    r4 = "\\b" + county + "\\b"
    city_regexs.append(r4)

ind = []
city = []
hline = []

for val in [
    "Mala",
    "Bron",
    "Viru",
    "Pati",
    "Rota",
    "Will",
    "Green",
    "\\b" + "Will" + "\\b",
]:
    for regex in city_regexs:
        if regex == val:
            city_regex = city_regexs.remove(val)

for regex in city_regexs:
    compiled_city = re.compile(regex)

    for index, headline in enumerate(df.headline):
        match = compiled_city.search(headline)
        if match is not None:
            start, end = match.start(), match.end()
            matched_string = headline[start:end]
            if len(matched_string) > 3:
                ind.append(index)
                city.append(matched_string)
                hline.append(headline)
#                 print(index,matched_string, '<<<', headline)

# Create dataframe of matched results and sort values by headline_no
matched = {"headline_no": ind, "headline": hline, "city": city}
matched_cities = pd.DataFrame(matched)
matched_cities = matched_cities.sort_values(by="headline_no").reset_index(drop=True)

matched_cities.head(n=15)

In [None]:
cities[cities.name == "York"]

**Note: We found some interesting patterns of matched results.**

**For example**
* There were **three** matched results on a headline "Could Zika Reach New York City?"
* The first result was "New York City" which was a correct matched. 
* The rest two were "York" which could either be matched but why did we have two "York"?
* Check out on the `cities` dataframe then we found that there are two different locations of "York", one in the US and another one in the GB.

In [None]:
# How many headline had matched result more than 1?
print("*********" * 10)
print("**** Checked duplicates ****")
print("*********" * 10)
print("\n")
print(
    str(
        len(
            matched_cities.headline.value_counts()[
                matched_cities.headline.value_counts() > 1
            ]
        )
    )
    + "/650 headlines had matched result more than 1."
)

# Drop duplicates of matched city names
matched_cities_uniq = matched_cities.drop_duplicates()

print("\n")
print(
    "Note: There were still number of headlines which had matched result more than 1. However, among those unique matched city names, only the longest string of city is the correct matched"
)
print("\n")

print("*********" * 10)
print("**** Dropped duplicates ****")
print("*********" * 10)
print("\n")
print(
    str(
        len(
            matched_cities_uniq.headline.value_counts()[
                matched_cities_uniq.headline.value_counts() > 1
            ]
        )
    )
    + "/650 headlines had matched result more than 1 after dropped duplicates."
)
print("\n")
print(
    matched_cities_uniq.headline.value_counts()[
        matched_cities_uniq.headline.value_counts() > 1
    ].head(10)
)
print("\n")


# Make a list of headlines that had matched result more than 1.
redun_headlines = matched_cities_uniq.headline.value_counts()[
    matched_cities_uniq.headline.value_counts() > 1
].index.tolist()
# Keep only the longest matched cities
for hl in redun_headlines:
    cities_to_compare = matched_cities_uniq.city[matched_cities_uniq.headline == hl]
    length_str = [
        (len(city), index)
        for index, city in list(zip(cities_to_compare.index, cities_to_compare))
    ]
    for length, index in length_str:
        if (length, index) != max(length_str):
            matched_cities_uniq.drop(index, axis=0, inplace=True)

print("\n")
print("*********" * 10)
print("**** Filtered only the longest matched cities ****")
print("*********" * 10)
print("\n")
print(
    str(
        len(
            matched_cities_uniq.headline.value_counts()[
                matched_cities_uniq.headline.value_counts() > 1
            ]
        )
    )
    + "/650 headlines had matched result more than 1 after filtered the longest."
)
matched_cities_uniq = matched_cities_uniq.reset_index(drop=False).set_index(
    "headline_no", drop=True
)
display(matched_cities_uniq.head())

In [None]:
#  Add city column to df dataframe
df = df.join(matched_cities_uniq[["city"]])

# Manually fill the nulls in city column.
for i in df[df.city.isnull()].index:
    df.loc[i, "city"] = "Cebu "  # <<<  Zika infects pregnant woman in Cebu
    df.loc[i, "city"] = "Antigua "  # <<<  Spanish Flu Sighted in Antigua
    df.loc[i, "city"] = (
        "Rio De Janeiro"  # <<<  Carnival under threat in Rio De Janeiro due to Zika outbreak
    )
    df.loc[i, "city"] = "Oton"  # <<<  Zika case reported in Oton
    df.loc[i, "city"] = "Maka"  # <<<  Maka City Experiences Influenza Outbreak
    df.loc[i, "city"] = "Mcallen"  # <<<  More Zika patients reported in Mcallen
    df.loc[i, "city"] = (
        "Mclean"  # <<<  More people in Mclean are infected with Hepatitis A every year
    )
    df.loc[i, "city"] = "Sussex"  # <<<  Malaria Exposure in Sussex
    df.loc[i, "city"] = "Greenwich"  # <<<  Greenwich Establishes Zika Task Force
    df.loc[i, "city"] = "Yulee"  # <<<  Yulee takes a hit from Spreading Sickness
    df.loc[i, "city"] = (
        "Boucau"  # <<<  More people in Boucau are infected with HIV every year
    )
    df.loc[i, "city"] = "Manhasset"  # <<<  Bronchitis Outbreak in Manhasset
    df.loc[i, "city"] = "Padre Las Casas"  # <<<  Zika Troubles come to Padre Las Casas
    df.loc[i, "city"] = "Destin"  # <<<  Outbreak of Zika in Destin
    df.loc[i, "city"] = (
        "Gympie"  # <<<  Gympie Patient in Critical Condition after Contracting Chlamydia
    )
    df.loc[i, "city"] = "Druid Hills"  # <<<  Spike of Meningitis Cases in Druid Hills
    df.loc[i, "city"] = (
        "Magnolia"  # <<<  More Patients in Magnolia are Getting Diagnosed with Malaria
    )
    df.loc[i, "city"] = (
        "Penal"  # <<<  Rumors about Syphilis spreading in Penal have been refuted
    )
    df.loc[i, "city"] = "Lisbon"  # <<<  Spanish Flu Outbreak in Lisbon
    df.loc[i, "city"] = "Madrid"  # <<<  Spanish Flu Spreading through Madrid
    df.loc[i, "city"] = "Belvoir"  # <<<  Fort Belvoir tests new cure for Hepatitis C
    df.loc[i, "city"] = (
        "Oak Brook"  # <<<  More people in Oak Brook are infected with Respiratory Syncytial Virus every year
    )
    df.loc[i, "city"] = "Hutchins"  # <<<  Outbreak of Zika in Hutchins
    df.loc[i, "city"] = "Longwood"  # <<<  Longwood volunteers spreading Zika awareness
    df.loc[i, "city"] = "Quixere"  # <<<  Zika symptoms spotted in Quixere
    df.loc[i, "city"] = "Davos"  # <<<  Measles Hits Davos
    df.loc[i, "city"] = (
        "Morehead City"  # <<<  Spike of Hepatitis E Cases in Morehead City
    )
    df.loc[i, "city"] = "Alvorad"  # <<<  Outbreak of Zika in Alvorada
    df.loc[i, "city"] = "Dangriga"  # <<<  Zika arrives in Dangriga
    df.loc[i, "city"] = (
        "Maynard"  # <<<  More Patients in Maynard are Getting Diagnosed with Syphilis
    )
    df.loc[i, "city"] = "Antioquia"  # <<<  Zika case reported in Antioquia
    df.loc[i, "city"] = "Pismo Beach"  # <<<  Chikungunya has not Left Pismo Beach
    df.loc[i, "city"] = "La Joya"  # <<<  Zika spreads to La Joya
#     print("df.loc[i,'city'] = '' #<<< ",df.loc[i,'headline'] ,"\n")
df.info()

In [None]:
df.head(50)

In [None]:
# Save to csv file
# output_file_path = r"..\data\cities_in_headline"
# df.to_csv(output_file_path, index=None)