# Peru: Two Years of COVID-19
A very basic analysis of the impact of COVID-19 in Peru with Pandas, Geopandas and Matplotlib in Python

### How hard has COVID-19 struck a city / country / region?

One way to address this question is to take a look at the deaths, or more precisely, the excess mortality caused by the virus.

In this occassion, we desire to make a very simple analysis of the effects of COVID-19 in each "departamento" (sort of equivalent to a U.S. state) of Peru.

In order to do that, we are going to create **choropleth maps** with monthly deaths ocurred in a given month since 2020-01 for each "departamento" adjusted by population, and display these maps through a .gif.

Now, for the sake of simplicity, we are going to work with total and not exxess deaths and assume that "DEPARTAMENTO DOMICILIO" stands for the place ("departamento") where the decease took place. In reality, "DEPARTAMENTO DOMICILIO" accounts for the last informed "departamento" of residency.

We are going to work with the national deaths database of Peru, SINADEF, and the GeoJSON data of this country, available at GitHub.

In [None]:
# Import dependencies
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from datetime import datetime
import urllib
import subprocess
import glob

Read dataset and store it in "df" `dataframe`.

In [28]:
df=pd.read_csv("fallecidos_sinadef.csv",engine="python",encoding='utf-8-sig',sep="|")

In [30]:
df.head()

Unnamed: 0,Nº,TIPO SEGURO,SEXO,EDAD,TIEMPO EDAD,ESTADO CIVIL,NIVEL DE INSTRUCCIÓN,ETNIA,COD# UBIGEO DOMICILIO,PAIS DOMICILIO,...,DEBIDO A (CAUSA B),CAUSA B (CIE-X),DEBIDO A (CAUSA C),CAUSA C (CIE-X),DEBIDO A (CAUSA D),CAUSA D (CIE-X),DEBIDO A (CAUSA E),CAUSA E (CIE-X),DEBIDO A (CAUSA F),CAUSA F (CIE-X)
0,1,IGNORADO,FEMENINO,64,AÑOS,SOLTERO,IGNORADO,MESTIZO,92-33-24-01-01-000,PERU,...,INFARTO RECIENTE Y ANTIGUO DE MIOCARDIO,,,,,,,,,
1,2,SIS,FEMENINO,15,MINUTOS,SOLTERO,SUPERIOR NO UNIV. COMP.,MESTIZO,92-33-12-08-06-000,PERU,...,DIFICULTAD RESPIRATORIA DEL RECIEN NACIDO,P229,INMATURIDAD EXTREMA,P072,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO
2,3,ESSALUD,MASCULINO,97,AÑOS,CASADO,PRIMARIA INCOMPLETA,MESTIZO,92-33-04-01-23-000,PERU,...,ENFERMEDAD RENAL,N189,ENFERMEDAD PULMONAR INTERSTICIAL DIFUSA,J849,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO
3,4,IGNORADO,MASCULINO,31,AÑOS,SOLTERO,IGNORADO,MESTIZO,92-33-07-06-01-000,PERU,...,EDEMA PULMONAR,J81X,EN INVESTIGACION,R99X,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO
4,5,IGNORADO,MASCULINO,59,AÑOS,SOLTERO,IGNORADO,MESTIZO,92-33-24-01-01-000,PERU,...,SHOCK HIPOVOLEMICO,SIN REGISTRO,SUCESO DE TRANSITO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO,SIN REGISTRO


In [29]:
print(df.shape)
print(df.columns)

(846732, 32)
Index(['Nº', 'TIPO SEGURO', 'SEXO', 'EDAD', 'TIEMPO EDAD', 'ESTADO CIVIL',
       'NIVEL DE INSTRUCCIÓN', 'ETNIA', 'COD# UBIGEO DOMICILIO',
       'PAIS DOMICILIO', 'DEPARTAMENTO DOMICILIO', 'PROVINCIA DOMICILIO',
       'DISTRITO DOMICILIO', 'FECHA', 'AÑO', 'MES', 'TIPO LUGAR',
       'INSTITUCION', 'MUERTE VIOLENTA', 'NECROPSIA', 'DEBIDO A (CAUSA A)',
       'CAUSA A (CIE-X)', 'DEBIDO A (CAUSA B)', 'CAUSA B (CIE-X)',
       'DEBIDO A (CAUSA C)', 'CAUSA C (CIE-X)', 'DEBIDO A (CAUSA D)',
       'CAUSA D (CIE-X)', 'DEBIDO A (CAUSA E)', 'CAUSA E (CIE-X)',
       'DEBIDO A (CAUSA F)', 'CAUSA F (CIE-X)'],
      dtype='object')


- We are interested in "FECHA" `string` and "DEPARTAMENTO DOMICILIO" `string` . In other words, the date of the decease and where it ocurred. 
- We also want to order this dataset by date so we cretae a new column "DATE" `datetime` based on "FECHA" and apply the `sort_values` method.
- Inspecting "FECHA" `string` we realise that it goes back to 2017. Since covid-19 pandemic struck Peru in March 2020, we filter the dataset on "AÑO" `int` >= 2020.
- Since "DEPARTAMENTO DOMICILIO" `string` includes places ourside Peru, we filter the dataset on "PAIS DOMICILIO" `string` == "PERU".
- There is an empty string in "DEPARTAMENTO DOMICILIO" `string` representing the unknown, so we get rid of it by applying yet another filter.
- Now that our dataset is ordered by date we create a new column "MES-AÑO" `string` based of the year and month of "FECHA" `string`, as in, 202001, 202002, and so on.
- We finally slice our dataset by "DEPARTAMENTO DOMICILIO" `string`, "MES-AÑO" `string` and "SEXO" `string`. We could have picked any other column instead of "SEXO" as long as it holds one entry per row, which most columns do. This is beceause we are going to group our dataset by "MES-AÑO" and count occurrences of the third column (deaths).

In [None]:
df['DATE'] = df['FECHA'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
df = df.sort_values(by=['DATE'], ascending=True)
df=df[(df["PAIS DOMICILIO"]=="PERU") & (df["AÑO"].isin([2020,2021,2022]))]
df["DEPARTAMENTO DOMICILIO"] = df["DEPARTAMENTO DOMICILIO"].map(str.strip)
df=df[df["DEPARTAMENTO DOMICILIO"]!=""]
df["MES-AÑO"]=df["FECHA"].apply(lambda x: x[:5])+df["FECHA"].apply(lambda x: x[5:7])
df=df[['DEPARTAMENTO DOMICILIO',"MES-AÑO","SEXO"]]

Now that we have a more concise dataset we can apply two `for loop`s in order to obtain a list of dicts, where each dict represents a "DEPARTAMENTO DOMICILIO" and is going to be comprised of "date", "deaths" key-value pairs.

In [None]:
result=[]
max=0
for dep in df["DEPARTAMENTO DOMICILIO"].unique():
  data={}
  sdf=df[df["DEPARTAMENTO DOMICILIO"]==dep]
  sdf=sdf.groupby(sdf['MES-AÑO']).count().reset_index()
  for date in df['MES-AÑO'].unique():
    try:
      data[date]=sdf[sdf["MES-AÑO"]==date]["SEXO"].values[0]
    except IndexError:
      data[date]=0
    if data[date]>max:
      max=data[date]
  result.append(data)

Now that we have a list of dicts, we can easily convert it into a pandas dataframe object and store it in "final_df" `dataframe`, setting each "DEPARTAMENTO DOMICILIO" as an index. Then we sort this brand new dataframe by its index in alphabetical order so we can match it with the new dataset (GeoJSON) that we are going to import.

In [None]:
final_df=pd.DataFrame(result,index=df["DEPARTAMENTO DOMICILIO"].unique()) 
final_df.sort_index(inplace=True)

If we check this newly created dataset, we can appreciate that it contains 25 "DEPARTAMENTO DOMICILIO" (places where deaths have occurred) and 27 "MES-AÑO" (date of decease).

In [27]:
final_df.shape

(25, 27)

- A proper comparison of deaths in each "departamento" shall take into account how many people live in such "departamento". So, in the end, we are going to measure deaths by population.
- In order to do that we need the population of each "departamento" and that can be found [here](https://es.wikipedia.org/wiki/Anexo:Departamentos_del_Perú_por_población).
- This wikipedia dataset can be read directly with pandas `read_html` method but first we need to parse part of the url for it contains accents.
- We finally store that population dataset in "pop_array" `dataframe`.


In [None]:
base_url="https://es.wikipedia.org/wiki/"
query='Anexo:Departamentos_del_Perú_por_población'
query=urllib.parse.quote(query)
url=base_url+query
url
pop_df=pd.read_html(url)[0]
pop_array=pop_df[("Población","Estimado 2020")].apply(lambda x: int(x.replace("\xa0",""))).values

Then, we divide each column in "final_df" `dataframe` by "pop_array" `dataframe` and replace "final_df"'s values. In other words, we go from "deaths" to "deaths adjusted by population".

In [None]:
for date in final_df.columns:
  final_df[date]=final_df[date]/pop_array

Now we proceed to import the GeoJSON data for every "departamento" or state of Peru and store it in "df_peru" `dataframe`.

In [None]:
df_peru = gpd.read_file('https://raw.githubusercontent.com/juaneladio/peru-geojson/master/peru_departamental_simple.geojson')

Then, we create a new column "coords" which is going to be used to label each "departamento" with its name on the map. 

In [None]:
df_peru['coords'] = df_peru['geometry'].apply(lambda x: x.representative_point().coords[:])
df_peru['coords'] = [coords[0] for coords in df_peru['coords']]

Having finshed the data manipulation part, we can move onto plotting our choropleth maps.

- First, we set vmin and vmax variables to store the min and max global amount of deaths.
> If you don’t set this beforehand, Matplotlib will change the range of the choropleth each time the for loop iterates, so it will be harder to see how values have increased or decreased over time.
- Then, we create a for loop that, for each date, appends to df_peru `dataframe` (GeoJSON data) a column of final_df `dataframe`, plots the choropleth map and then removes that column back so as not to increase the size of df_peru `dataframe` in each loop.
- Also, in each loop, we store the plotted map inside the just created "img" directory with padding zeros to keep a proper order.

This whole process will create a total of 27 choropleth maps (one per date) inside "img" directory .


In [55]:
os.mkdir("img")
title="Deaths per 'departamento', adjusted by population"
vmin, vmax=0,final_df.max().max()
df.columns
i=1
for date in final_df.columns.values:
  df_peru[date]=final_df[date].values
  fig, ax = plt.subplots(1, figsize=(13, 15))

  df_peru.plot(column=date,cmap='cool',
  linewidth=1, ax=ax,edgecolor='1', vmin=vmin, vmax=vmax,legend=True,
  norm=plt.Normalize(vmin=vmin, vmax=vmax))
  ax.axis("off")
  ax.set_title(title,fontsize=20)
  for idx, row in df_peru.iterrows():
    ax.text(row.coords[0], row.coords[1], row["NOMBDEP"], 
    horizontalalignment='center', 
    bbox={'facecolor': 'white', 'alpha':0.8, 'pad': 2, 'edgecolor':'none'})
  ax.annotate(f"{date}", xy=(0.2, .3), xycoords='figure fraction',
            horizontalalignment='left', verticalalignment='bottom',
            fontsize=30)
  if i<10:
    filepath = f"img/00{i}.jpg"
  else:
    filepath = f"img/0{i}.jpg"

  chart = ax.get_figure()
  chart.savefig(filepath, dpi=200)
  plt.close()
  df_peru.drop(columns=date, inplace=True)
  print(f"{i} of {len(final_df.columns)} processed")
  i+=1

1 of 27 processed
2 of 27 processed
3 of 27 processed
4 of 27 processed
5 of 27 processed
6 of 27 processed
7 of 27 processed
8 of 27 processed
9 of 27 processed
10 of 27 processed
11 of 27 processed
12 of 27 processed
13 of 27 processed
14 of 27 processed
15 of 27 processed
16 of 27 processed
17 of 27 processed
18 of 27 processed
19 of 27 processed
20 of 27 processed
21 of 27 processed
22 of 27 processed
23 of 27 processed
24 of 27 processed
25 of 27 processed
26 of 27 processed
27 of 27 processed


Last but not least, we make a .gif out of those 27 images and, if preferred, erase them to save disk space.

Please note that in order to perform this action, you must install ImageMagick, as in `brew install imagemagick` (MacOS).

In [56]:
back=os.getcwd()
os.chdir("img")
subprocess.call([
  "convert", "-delay", "90", "-loop", "0", "*.jpg","output.gif"
])
for file_name in glob.glob("*.jpg"):
    os.remove(file_name)
os.chdir(back)

Optionally, we can also create a .mp4 video instead of a .gif.

Please note that in order to perform this action, you must install FFmpeg, as in `brew install ffmpeg` (MacOS).

In [None]:
back=os.getcwd()
os.chdir("img")
subprocess.call([
    'ffmpeg', '-framerate', '24', '-i','%03d.jpg', '-r', '30',"-crf", "0", "-vcodec", "mpeg4", "-vf", "setpts=10*PTS",
    'video_name.mp4'
])
for file_name in glob.glob("*.jpg"):
    os.remove(file_name)
os.chdir(back)

<img src=img/output.gif>

### References:

- https://towardsdatascience.com/how-to-make-a-gif-map-using-python-geopandas-and-matplotlib-cd8827cefbc8
- https://stackoverflow.com/questions/38899190/geopandas-label-polygons