# Cleaning Covid data for Mapbox: provinces

**Background**: We use Covid-19 cases data in the Philippines from data from the health department and shapefile processed through geopandas to create an interactive map. 

**Tools**: pandas, geopandas, Mapbox

# Do your imports

In [1]:
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point, LineString
pd.set_option('display.max_columns', None)

# Read your CSV

In [2]:
df= pd.read_csv('provinces-cases.csv')
df

Unnamed: 0.1,Unnamed: 0,ProvRes,CaseCode
0,0,Abra,5933
1,1,Agusan Del Norte,22310
2,2,Agusan Del Sur,16476
3,3,Aklan,16217
4,4,Albay,13858
...,...,...,...
82,82,Tawi-Tawi,747
83,83,Zambales,20864
84,84,Zamboanga Del Norte,12615
85,85,Zamboanga Del Sur,46107


# Cleaning the data

## Drop unnecessary columns

In [3]:
df= df.drop('Unnamed: 0', axis=1)
df

Unnamed: 0,ProvRes,CaseCode
0,Abra,5933
1,Agusan Del Norte,22310
2,Agusan Del Sur,16476
3,Aklan,16217
4,Albay,13858
...,...,...
82,Tawi-Tawi,747
83,Zambales,20864
84,Zamboanga Del Norte,12615
85,Zamboanga Del Sur,46107


## Lowercase column headers

In [4]:
df.columns = df.columns.str.lower()
df.head(35)

Unnamed: 0,provres,casecode
0,Abra,5933
1,Agusan Del Norte,22310
2,Agusan Del Sur,16476
3,Aklan,16217
4,Albay,13858
5,Antique,9442
6,Apayao,9335
7,Aurora,4291
8,Basilan,1765
9,Bataan,41695


## Clean provincial names

This is so they would match the names in the shapefile for merging later.

In [5]:
df.provres = df.provres.str.replace("City Of Isabela (Not A Province)", "City of Isabela", regex=False)
df.provres = df.provres.str.replace("Cotabato (North Cotabato)", "Cotabato", regex=False)
df.provres = df.provres.str.replace("Ncr", "NCR", regex=False)
df.provres = df.provres.str.replace("Samar (Western Samar)", "Samar", regex=False)
df.provres = df.provres.str.replace("Cotabato City (Not A Province)", "Cotabato City", regex=False)
df.provres = df.provres.str.replace("Del", "del", regex=False)
df.provres = df.provres.str.replace("De", "de", regex=False)

## Rename columns

This is again to match the shapefile column containing the provinces' names which is 'adm2_en'. We are also renaming the column containing the Covid-19 cases tally.

In [6]:
df= df.rename(columns={"provres": "adm2_en"})
df= df.rename(columns={"casecode": "covid_cases"})
df

Unnamed: 0,adm2_en,covid_cases
0,Abra,5933
1,Agusan del Norte,22310
2,Agusan del Sur,16476
3,Aklan,16217
4,Albay,13858
...,...,...
82,Tawi-Tawi,747
83,Zambales,20864
84,Zamboanga del Norte,12615
85,Zamboanga del Sur,46107


# Geopandas

## Read through file

In [7]:
provinces = gpd.read_file('ph-provinces.zip')
provinces

Unnamed: 0,Shape_Leng,Shape_Area,ADM2_EN,ADM2_PCODE,ADM2_REF,ADM2ALT1EN,ADM2ALT2EN,ADM1_EN,ADM1_PCODE,ADM0_EN,ADM0_PCODE,date,validOn,validTo,geometry
0,2.640967,0.334223,Abra,PH140100000,,,,Cordillera Administrative Region,PH140000000,Philippines (the),PH,2016-06-30,2020-05-29,,"POLYGON ((120.96109 17.95348, 120.97201 17.946..."
1,3.674955,0.220065,Agusan del Norte,PH160200000,,,,Region XIII,PH160000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((125.58886 9.45793, 125.59687 9..."
2,5.222636,0.693968,Agusan del Sur,PH160300000,,,,Region XIII,PH160000000,Philippines (the),PH,2016-06-30,2020-05-29,,"POLYGON ((125.88961 8.98195, 125.88896 8.96446..."
3,4.626091,0.139664,Aklan,PH060400000,,,,Region VI,PH060000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.43980 11.59717, 122.43979 ..."
4,6.507665,0.205939,Albay,PH050500000,,,,Region V,PH050000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((124.20992 13.16871, 124.20993 ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,15.971439,0.094186,Tawi-Tawi,PH157000000,,,,Autonomous Region in Muslim Mindanao,PH150000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4..."
83,5.329770,0.313705,Zambales,PH037100000,,,,Region III,PH030000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ..."
84,8.170921,0.515482,Zamboanga del Norte,PH097200000,,,,Region IX,PH090000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.09474 7.53104, 122.09482 7..."
85,11.811347,0.439807,Zamboanga del Sur,PH097300000,,,,Region IX,PH090000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.05710 6.87274, 122.05724 6..."


## Merge data 

In [8]:
provinces.columns = provinces.columns.str.lower()
provinces

Unnamed: 0,shape_leng,shape_area,adm2_en,adm2_pcode,adm2_ref,adm2alt1en,adm2alt2en,adm1_en,adm1_pcode,adm0_en,adm0_pcode,date,validon,validto,geometry
0,2.640967,0.334223,Abra,PH140100000,,,,Cordillera Administrative Region,PH140000000,Philippines (the),PH,2016-06-30,2020-05-29,,"POLYGON ((120.96109 17.95348, 120.97201 17.946..."
1,3.674955,0.220065,Agusan del Norte,PH160200000,,,,Region XIII,PH160000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((125.58886 9.45793, 125.59687 9..."
2,5.222636,0.693968,Agusan del Sur,PH160300000,,,,Region XIII,PH160000000,Philippines (the),PH,2016-06-30,2020-05-29,,"POLYGON ((125.88961 8.98195, 125.88896 8.96446..."
3,4.626091,0.139664,Aklan,PH060400000,,,,Region VI,PH060000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.43980 11.59717, 122.43979 ..."
4,6.507665,0.205939,Albay,PH050500000,,,,Region V,PH050000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((124.20992 13.16871, 124.20993 ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,15.971439,0.094186,Tawi-Tawi,PH157000000,,,,Autonomous Region in Muslim Mindanao,PH150000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4..."
83,5.329770,0.313705,Zambales,PH037100000,,,,Region III,PH030000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ..."
84,8.170921,0.515482,Zamboanga del Norte,PH097200000,,,,Region IX,PH090000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.09474 7.53104, 122.09482 7..."
85,11.811347,0.439807,Zamboanga del Sur,PH097300000,,,,Region IX,PH090000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.05710 6.87274, 122.05724 6..."


## Rename 'Compostela Valley' to 'Davao de Oro'

The name change was by virtue of a law.

In [9]:
provinces.adm2_en = provinces.adm2_en.str.replace("Compostela Valley", "Davao de Oro", regex=False)

In [10]:
provinces_cases = provinces.merge(df, on='adm2_en')
provinces_cases

Unnamed: 0,shape_leng,shape_area,adm2_en,adm2_pcode,adm2_ref,adm2alt1en,adm2alt2en,adm1_en,adm1_pcode,adm0_en,adm0_pcode,date,validon,validto,geometry,covid_cases
0,2.640967,0.334223,Abra,PH140100000,,,,Cordillera Administrative Region,PH140000000,Philippines (the),PH,2016-06-30,2020-05-29,,"POLYGON ((120.96109 17.95348, 120.97201 17.946...",5933
1,3.674955,0.220065,Agusan del Norte,PH160200000,,,,Region XIII,PH160000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((125.58886 9.45793, 125.59687 9...",22310
2,5.222636,0.693968,Agusan del Sur,PH160300000,,,,Region XIII,PH160000000,Philippines (the),PH,2016-06-30,2020-05-29,,"POLYGON ((125.88961 8.98195, 125.88896 8.96446...",16476
3,4.626091,0.139664,Aklan,PH060400000,,,,Region VI,PH060000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.43980 11.59717, 122.43979 ...",16217
4,6.507665,0.205939,Albay,PH050500000,,,,Region V,PH050000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((124.20992 13.16871, 124.20993 ...",13858
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,15.971439,0.094186,Tawi-Tawi,PH157000000,,,,Autonomous Region in Muslim Mindanao,PH150000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4...",747
83,5.329770,0.313705,Zambales,PH037100000,,,,Region III,PH030000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ...",20864
84,8.170921,0.515482,Zamboanga del Norte,PH097200000,,,,Region IX,PH090000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.09474 7.53104, 122.09482 7...",12615
85,11.811347,0.439807,Zamboanga del Sur,PH097300000,,,,Region IX,PH090000000,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.05710 6.87274, 122.05724 6...",46107


## Drop unnecessary columns

In [11]:
provinces_cases = provinces_cases.drop(['adm2_ref', 'adm2_pcode','adm2alt1en', 'adm2alt2en', 'adm0_en', 'shape_leng', 'shape_area', 'adm0_pcode', 'adm1_pcode', 'date', 'validon', 'validto'], axis=1)
provinces_cases

Unnamed: 0,adm2_en,adm1_en,geometry,covid_cases
0,Abra,Cordillera Administrative Region,"POLYGON ((120.96109 17.95348, 120.97201 17.946...",5933
1,Agusan del Norte,Region XIII,"MULTIPOLYGON (((125.58886 9.45793, 125.59687 9...",22310
2,Agusan del Sur,Region XIII,"POLYGON ((125.88961 8.98195, 125.88896 8.96446...",16476
3,Aklan,Region VI,"MULTIPOLYGON (((122.43980 11.59717, 122.43979 ...",16217
4,Albay,Region V,"MULTIPOLYGON (((124.20992 13.16871, 124.20993 ...",13858
...,...,...,...,...
82,Tawi-Tawi,Autonomous Region in Muslim Mindanao,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4...",747
83,Zambales,Region III,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ...",20864
84,Zamboanga del Norte,Region IX,"MULTIPOLYGON (((122.09474 7.53104, 122.09482 7...",12615
85,Zamboanga del Sur,Region IX,"MULTIPOLYGON (((122.05710 6.87274, 122.05724 6...",46107


In [12]:
provinces_cases= provinces_cases.rename(columns={"adm2_en": "province"})

## Read and merge with population data

In [13]:
df2 = pd.read_excel('population.xlsx', sheet_name="province")
df2

Unnamed: 0,province,population
0,"NCR, City of Manila, First District",1846513
1,"NCR, Second District",4771371
2,"NCR, Third District",3004627
3,"NCR, Fourth District",3861951
4,Abra,250985
...,...,...
82,Agusan del Sur,739367
83,Dinagat Islands,128117
84,Surigao del Norte,534636
85,Surigao del Sur,642255


In [14]:
provinces_final = provinces_cases.merge(df2, on='province')
provinces_final

Unnamed: 0,province,adm1_en,geometry,covid_cases,population
0,Abra,Cordillera Administrative Region,"POLYGON ((120.96109 17.95348, 120.97201 17.946...",5933,250985
1,Agusan del Norte,Region XIII,"MULTIPOLYGON (((125.58886 9.45793, 125.59687 9...",22310,760413
2,Agusan del Sur,Region XIII,"POLYGON ((125.88961 8.98195, 125.88896 8.96446...",16476,739367
3,Aklan,Region VI,"MULTIPOLYGON (((122.43980 11.59717, 122.43979 ...",16217,615475
4,Albay,Region V,"MULTIPOLYGON (((124.20992 13.16871, 124.20993 ...",13858,1374768
...,...,...,...,...,...
82,Tawi-Tawi,Autonomous Region in Muslim Mindanao,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4...",747,440276
83,Zambales,Region III,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ...",20864,909832
84,Zamboanga del Norte,Region IX,"MULTIPOLYGON (((122.09474 7.53104, 122.09482 7...",12615,1047455
85,Zamboanga del Sur,Region IX,"MULTIPOLYGON (((122.05710 6.87274, 122.05724 6...",46107,2027902


## Compute for population ratio

We do this by dividing the number of Covid-19 cases to total population per province and then multiply by 100,000. That would give us cases per 100,000 people in the area.

In [28]:
provinces_final ['case_per_pop'] = provinces_final.covid_cases / provinces_final.population * 100000
provinces_final = provinces_final.round(1)
provinces_final

Unnamed: 0,province,adm1_en,geometry,covid_cases,population,case_per_pop
0,Abra,Cordillera Administrative Region,"POLYGON ((120.96109 17.95348, 120.97201 17.946...",5933,250985,2363.9
1,Agusan del Norte,Region XIII,"MULTIPOLYGON (((125.58886 9.45793, 125.59687 9...",22310,760413,2933.9
2,Agusan del Sur,Region XIII,"POLYGON ((125.88961 8.98195, 125.88896 8.96446...",16476,739367,2228.4
3,Aklan,Region VI,"MULTIPOLYGON (((122.43980 11.59717, 122.43979 ...",16217,615475,2634.9
4,Albay,Region V,"MULTIPOLYGON (((124.20992 13.16871, 124.20993 ...",13858,1374768,1008.0
...,...,...,...,...,...,...
82,Tawi-Tawi,Autonomous Region in Muslim Mindanao,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4...",747,440276,169.7
83,Zambales,Region III,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ...",20864,909832,2293.2
84,Zamboanga del Norte,Region IX,"MULTIPOLYGON (((122.09474 7.53104, 122.09482 7...",12615,1047455,1204.3
85,Zamboanga del Sur,Region IX,"MULTIPOLYGON (((122.05710 6.87274, 122.05724 6...",46107,2027902,2273.6


## Create bins for cases

The bins will allow us to categorize the number of cases, necessary for mapping later.

In [31]:
provinces_final['percentiles'] = pd.cut(np.array(provinces_final['case_per_pop']),
       10, labels=["0-1000", "1001-2000", "2001-3000", "3001-4000", "4001-5000", "5001-6000", "6001-7000", "7001-8000", "8001-9000", "9001-10000"])


In [32]:
provinces_final

Unnamed: 0,province,adm1_en,geometry,covid_cases,population,case_per_pop,percentiles
0,Abra,Cordillera Administrative Region,"POLYGON ((120.96109 17.95348, 120.97201 17.946...",5933,250985,2363.9,2001-3000
1,Agusan del Norte,Region XIII,"MULTIPOLYGON (((125.58886 9.45793, 125.59687 9...",22310,760413,2933.9,2001-3000
2,Agusan del Sur,Region XIII,"POLYGON ((125.88961 8.98195, 125.88896 8.96446...",16476,739367,2228.4,2001-3000
3,Aklan,Region VI,"MULTIPOLYGON (((122.43980 11.59717, 122.43979 ...",16217,615475,2634.9,2001-3000
4,Albay,Region V,"MULTIPOLYGON (((124.20992 13.16871, 124.20993 ...",13858,1374768,1008.0,0-1000
...,...,...,...,...,...,...,...
82,Tawi-Tawi,Autonomous Region in Muslim Mindanao,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4...",747,440276,169.7,0-1000
83,Zambales,Region III,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ...",20864,909832,2293.2,2001-3000
84,Zamboanga del Norte,Region IX,"MULTIPOLYGON (((122.09474 7.53104, 122.09482 7...",12615,1047455,1204.3,1001-2000
85,Zamboanga del Sur,Region IX,"MULTIPOLYGON (((122.05710 6.87274, 122.05724 6...",46107,2027902,2273.6,2001-3000


In [33]:
provinces_final.dtypes

province          object
adm1_en           object
geometry        geometry
covid_cases        int64
population         int64
case_per_pop     float64
percentiles     category
dtype: object

**Additional step**: Convert the contents of the percentiles into string. Not doing so will not be read by the GEOJSON file.

In [34]:
provinces_final.percentiles = provinces_final.percentiles.astype(str)

In [35]:
provinces_final

Unnamed: 0,province,adm1_en,geometry,covid_cases,population,case_per_pop,percentiles
0,Abra,Cordillera Administrative Region,"POLYGON ((120.96109 17.95348, 120.97201 17.946...",5933,250985,2363.9,2001-3000
1,Agusan del Norte,Region XIII,"MULTIPOLYGON (((125.58886 9.45793, 125.59687 9...",22310,760413,2933.9,2001-3000
2,Agusan del Sur,Region XIII,"POLYGON ((125.88961 8.98195, 125.88896 8.96446...",16476,739367,2228.4,2001-3000
3,Aklan,Region VI,"MULTIPOLYGON (((122.43980 11.59717, 122.43979 ...",16217,615475,2634.9,2001-3000
4,Albay,Region V,"MULTIPOLYGON (((124.20992 13.16871, 124.20993 ...",13858,1374768,1008.0,0-1000
...,...,...,...,...,...,...,...
82,Tawi-Tawi,Autonomous Region in Muslim Mindanao,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4...",747,440276,169.7,0-1000
83,Zambales,Region III,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ...",20864,909832,2293.2,2001-3000
84,Zamboanga del Norte,Region IX,"MULTIPOLYGON (((122.09474 7.53104, 122.09482 7...",12615,1047455,1204.3,1001-2000
85,Zamboanga del Sur,Region IX,"MULTIPOLYGON (((122.05710 6.87274, 122.05724 6...",46107,2027902,2273.6,2001-3000


# Save as GEOJSON file

In [36]:
provinces_final.to_file('provinces_cases.geojson', driver='GeoJSON')

  pd.Int64Index,


# Simplified file

So we were successful in combining geometry files with our dataset, but the file is too big. We, therefore, use [mapshaper](https://mapshaper.org/) to simplify the precision of the map so that we have a smaller map size.

Below is the simplified json file. 

In [21]:
simplified = gpd.read_file('simplified_provinces.json')
simplified

## Convert to GEOJSON

In [22]:
# simplified.to_file('simplified_provinces.geojson', driver='GeoJSON')

In [23]:
# simplified