# Cleaning Covid data for Mapbox: regions

**Background**: We use Covid-19 cases data in the Philippines from data from the health department and shapefile processed through geopandas to create an interactive map. 

**Tools**: pandas, geopandas, Mapbox

# Do your imports

In [1]:
import pandas as pd
import numpy as np
import geopandas as gpd
import re
from shapely.geometry import Point, LineString
pd.set_option('display.max_columns', None)

# Read your CSV

In [2]:
df= pd.read_csv('regions.csv')
df

Unnamed: 0.1,Unnamed: 0,RegionRes
0,NCR,1174255
1,Region IV-A: CALABARZON,656289
2,Region III: Central Luzon,362209
3,Region VI: Western Visayas,194989
4,Region VII: Central Visayas,193299
5,Region II: Cagayan Valley,162378
6,Region XI: Davao Region,140681
7,Region I: Ilocos Region,133758
8,CAR,119192
9,Region X: Northern Mindanao,106189


## Lowercase column headers

In [3]:
df.columns = df.columns.str.lower()
df

Unnamed: 0,unnamed: 0,regionres
0,NCR,1174255
1,Region IV-A: CALABARZON,656289
2,Region III: Central Luzon,362209
3,Region VI: Western Visayas,194989
4,Region VII: Central Visayas,193299
5,Region II: Cagayan Valley,162378
6,Region XI: Davao Region,140681
7,Region I: Ilocos Region,133758
8,CAR,119192
9,Region X: Northern Mindanao,106189


In [4]:
df= df.rename(columns={"unnamed: 0": "regions"})
df= df.rename(columns={"regionres": "covid_cases"})
df.head()

Unnamed: 0,regions,covid_cases
0,NCR,1174255
1,Region IV-A: CALABARZON,656289
2,Region III: Central Luzon,362209
3,Region VI: Western Visayas,194989
4,Region VII: Central Visayas,193299
5,Region II: Cagayan Valley,162378
6,Region XI: Davao Region,140681
7,Region I: Ilocos Region,133758
8,CAR,119192
9,Region X: Northern Mindanao,106189


# Geopandas

## Read through file

In [5]:
region_shape = gpd.read_file('regions-ph.zip')
region_shape.head()

Unnamed: 0,Shape_Leng,Shape_Area,ADM1_EN,ADM1_PCODE,ADM1_REF,ADM1ALT1EN,ADM1ALT2EN,ADM0_EN,ADM0_PCODE,date,validOn,validTo,geometry
0,53.623497,1.050272,Autonomous Region in Muslim Mindanao,PH150000000,,ARMM,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4..."
1,8.027454,1.546712,Cordillera Administrative Region,PH140000000,,CAR,,Philippines (the),PH,2016-06-30,2020-05-29,,"POLYGON ((121.22208 18.50058, 121.22086 18.483..."
2,2.320234,0.050216,National Capital Region,PH130000000,,NCR,,Philippines (the),PH,2016-06-30,2020-05-29,,"POLYGON ((121.03842 14.78525, 121.03876 14.785..."
3,14.995101,1.043983,Region I,PH010000000,,Ilocos Region,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((119.86596 15.81539, 119.86597 ..."
4,19.139048,2.241812,Region II,PH020000000,,Cagayan Valley,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.46667 16.92135, 122.46674 ..."
5,15.949563,1.793513,Region III,PH030000000,,Central Luzon,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ..."
6,27.624115,1.32671,Region IV-A,PH040000000,,Calabarzon,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.72165 13.36485, 122.72181 ..."
7,78.804542,2.220374,Region IV-B,PH170000000,,Mimaropa,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((117.31260 7.50671, 117.31249 7..."
8,23.181441,1.196677,Region IX,PH090000000,,Zamboanga Peninsula,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((121.88379 6.69138, 121.88380 6..."
9,44.923243,1.446324,Region V,PH050000000,,Bicol Region,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.98823 11.73079, 122.98824 ..."


## Clean regional names in the dataset

This is so they would match the names in the shapefile for merging later.

In [6]:
df.regions = df.regions.str.replace("NCR", "National Capital Region", regex=False)
df.regions = df.regions.str.replace("BARMM", "Autonomous Region in Muslim Mindanao", regex=False)
df.regions = df.regions.str.replace("CAR", "Cordillera Administrative Region", regex=False)
df.regions = df.regions.str.replace("CARAGA", "Region XIII", regex=False)
df.regions = df.regions.str.replace(r'[:].*$', "", regex=True)
df.regions

0                  National Capital Region
1                              Region IV-A
2                               Region III
3                                Region VI
4                               Region VII
5                                Region II
6                                Region XI
7                                 Region I
8         Cordillera Administrative Region
9                                 Region X
10                              Region XII
11                               Region IX
12                                Region V
13                             Region VIII
14     Cordillera Administrative RegionAGA
15                             Region IV-B
16                                     ROF
17    Autonomous Region in Muslim Mindanao
18                                     NaN
Name: regions, dtype: object

## Clean the shapefile data

### Lowercase headers

In [7]:
region_shape.columns = region_shape.columns.str.lower()

Unnamed: 0,shape_leng,shape_area,adm1_en,adm1_pcode,adm1_ref,adm1alt1en,adm1alt2en,adm0_en,adm0_pcode,date,validon,validto,geometry
0,53.623497,1.050272,Autonomous Region in Muslim Mindanao,PH150000000,,ARMM,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4..."
1,8.027454,1.546712,Cordillera Administrative Region,PH140000000,,CAR,,Philippines (the),PH,2016-06-30,2020-05-29,,"POLYGON ((121.22208 18.50058, 121.22086 18.483..."
2,2.320234,0.050216,National Capital Region,PH130000000,,NCR,,Philippines (the),PH,2016-06-30,2020-05-29,,"POLYGON ((121.03842 14.78525, 121.03876 14.785..."
3,14.995101,1.043983,Region I,PH010000000,,Ilocos Region,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((119.86596 15.81539, 119.86597 ..."
4,19.139048,2.241812,Region II,PH020000000,,Cagayan Valley,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.46667 16.92135, 122.46674 ..."
5,15.949563,1.793513,Region III,PH030000000,,Central Luzon,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ..."
6,27.624115,1.32671,Region IV-A,PH040000000,,Calabarzon,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.72165 13.36485, 122.72181 ..."
7,78.804542,2.220374,Region IV-B,PH170000000,,Mimaropa,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((117.31260 7.50671, 117.31249 7..."
8,23.181441,1.196677,Region IX,PH090000000,,Zamboanga Peninsula,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((121.88379 6.69138, 121.88380 6..."
9,44.923243,1.446324,Region V,PH050000000,,Bicol Region,,Philippines (the),PH,2016-06-30,2020-05-29,,"MULTIPOLYGON (((122.98823 11.73079, 122.98824 ..."


### Drop unnecessary columns from shapefile

In [8]:
region_shape = region_shape.drop(['adm1_pcode', 'adm1_ref','adm1alt1en', 'adm1alt2en', 'adm0_en', 'shape_leng', 'shape_area', 'adm0_pcode', 'date', 'validon', 'validto'], axis=1)
region_shape

Unnamed: 0,adm1_en,geometry
0,Autonomous Region in Muslim Mindanao,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4..."
1,Cordillera Administrative Region,"POLYGON ((121.22208 18.50058, 121.22086 18.483..."
2,National Capital Region,"POLYGON ((121.03842 14.78525, 121.03876 14.785..."
3,Region I,"MULTIPOLYGON (((119.86596 15.81539, 119.86597 ..."
4,Region II,"MULTIPOLYGON (((122.46667 16.92135, 122.46674 ..."
5,Region III,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ..."
6,Region IV-A,"MULTIPOLYGON (((122.72165 13.36485, 122.72181 ..."
7,Region IV-B,"MULTIPOLYGON (((117.31260 7.50671, 117.31249 7..."
8,Region IX,"MULTIPOLYGON (((121.88379 6.69138, 121.88380 6..."
9,Region V,"MULTIPOLYGON (((122.98823 11.73079, 122.98824 ..."


In [9]:
region_shape= region_shape.rename(columns={"adm1_en": "regions"})

## Merge data 

In [10]:
regions_cases = region_shape.merge(df, on='regions')
regions_cases

Unnamed: 0,regions,geometry,covid_cases
0,Autonomous Region in Muslim Mindanao,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4...",25876
1,Cordillera Administrative Region,"POLYGON ((121.22208 18.50058, 121.22086 18.483...",119192
2,National Capital Region,"POLYGON ((121.03842 14.78525, 121.03876 14.785...",1174255
3,Region I,"MULTIPOLYGON (((119.86596 15.81539, 119.86597 ...",133758
4,Region II,"MULTIPOLYGON (((122.46667 16.92135, 122.46674 ...",162378
5,Region III,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ...",362209
6,Region IV-A,"MULTIPOLYGON (((122.72165 13.36485, 122.72181 ...",656289
7,Region IV-B,"MULTIPOLYGON (((117.31260 7.50671, 117.31249 7...",44238
8,Region IX,"MULTIPOLYGON (((121.88379 6.69138, 121.88380 6...",66097
9,Region V,"MULTIPOLYGON (((122.98823 11.73079, 122.98824 ...",65786


In [13]:
regions_cases.covid_cases.sort_values(ascending=False)

2     1174255
6      656289
5      362209
10     194989
11     193299
4      162378
14     140681
3      133758
1      119192
13     106189
15      75975
8       66097
9       65786
12      64444
7       44238
0       25876
Name: covid_cases, dtype: int64

## Create bins for cases

The bins will allow us to categorize the number of cases, necessary for mapping later.

In [15]:
regions_cases['percentiles'] = pd.cut(np.array(regions_cases['covid_cases']),
   10, labels=["0-100000", "100000-200000", "200000-300000", "300000-400000", "400000-500000", "500000-600000", "600000-700000", "700000-800000", "800000-900000", "900000-1200000"])

regions_cases

Unnamed: 0,regions,geometry,covid_cases,percentiles
0,Autonomous Region in Muslim Mindanao,"MULTIPOLYGON (((119.46876 4.59360, 119.46881 4...",25876,0-100000
1,Cordillera Administrative Region,"POLYGON ((121.22208 18.50058, 121.22086 18.483...",119192,0-100000
2,National Capital Region,"POLYGON ((121.03842 14.78525, 121.03876 14.785...",1174255,900000-1200000
3,Region I,"MULTIPOLYGON (((119.86596 15.81539, 119.86597 ...",133758,0-100000
4,Region II,"MULTIPOLYGON (((122.46667 16.92135, 122.46674 ...",162378,100000-200000
5,Region III,"MULTIPOLYGON (((120.11687 14.76309, 120.11689 ...",362209,200000-300000
6,Region IV-A,"MULTIPOLYGON (((122.72165 13.36485, 122.72181 ...",656289,500000-600000
7,Region IV-B,"MULTIPOLYGON (((117.31260 7.50671, 117.31249 7...",44238,0-100000
8,Region IX,"MULTIPOLYGON (((121.88379 6.69138, 121.88380 6...",66097,0-100000
9,Region V,"MULTIPOLYGON (((122.98823 11.73079, 122.98824 ...",65786,0-100000


In [16]:
regions_cases.dtypes

regions          object
geometry       geometry
covid_cases       int64
percentiles    category
dtype: object

**Additional step**: Convert the contents of the percentiles into string. Not doing so will not be read by the GEOJSON file.

In [17]:
regions_cases.percentiles = regions_cases.percentiles.astype(str)
regions_cases.dtypes

regions          object
geometry       geometry
covid_cases       int64
percentiles      object
dtype: object

# Save as GEOJSON file

In [18]:
regions_cases.to_file('regions_cases.geojson', driver='GeoJSON')

  pd.Int64Index,


# Simplified file

So we were successful in combining geometry files with our dataset, but the file is too big. We, therefore, use [mapshaper](https://mapshaper.org/) to simplify the precision of the map so that we have a smaller map size.

Below is the simplified json file. 

In [None]:
simplified_regions = gpd.read_file('simplified_regions.json')
simplified_regions

## Convert to GEOJSON

In [None]:
simplified_regions.to_file('simplified_regions.geojson', driver='GeoJSON')

In [None]:
simplified_regions