<a href="https://colab.research.google.com/github/marianeneiva/zikaVirus/blob/main/downloadZIKA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing ZIKA disease (SINAN Datasus)

In this code, we'll delve into the The Information System for Notifiable Diseases (SINAN) of DATASUS.
Sinan - is primarily fed by the notification and investigation of cases of diseases and health conditions listed in the national mandatory notification diseases list (Consolidation Ordinance No. 4, of September 28, 2017, Annex), but states and municipalities are allowed to include other important health issues in their region, such as diphyllobothriasis in the municipality of São Paulo. Its effective use allows for a dynamic diagnosis of the occurrence of an event in the population, providing subsidies for causal explanations of mandatory notifiable health conditions, as well as indicating risks to which people are exposed, thus contributing to the identification of the epidemiological reality of a specific geographical area. Its systematic, decentralized use contributes to the democratization of information, allowing all health professionals to access information and make it available to the community. It is, therefore, a relevant tool to assist in health planning, define intervention priorities, and allow for the evaluation of the impact of interventions

The data dictionary can be found here (english): https://docs.google.com/spreadsheets/d/e/2PACX-1vR8bNIx4IXRycpKcxqaPE1BGU9yPlwW_TuCn4lfZYE1fkM34gE8yRKosrcxZHV7jcZHS1kP11VHvg15/pub?gid=1507587706&single=true&output=pdf

## Setting Up

Before we begin, we need to install the pysus library, which facilitates the downloading of SIH data:

In [22]:
#installing necessary libraries
!pip install pysus



##Fetching the Data

With pysus installed, we can easily fetch the SINAN data for our desired DISEASE (ZIKA) and YEAR

In [23]:
#importing pysus
from pysus.online_data.SINAN import download
from pysus.online_data import parquets_to_dataframe as to_df
import pandas as pd

zika = to_df(download('Zika',2016))
print(type(zika))


<class 'pandas.core.frame.DataFrame'>


## Exploring the Data

Let's take a quick look at our data:

In [24]:
len(zika) #number of instances

281464

In [25]:
zika.columns #columns

Index(['TP_NOT', 'ID_AGRAVO', 'CS_SUSPEIT', 'DT_NOTIFIC', 'SEM_NOT', 'NU_ANO',
       'SG_UF_NOT', 'ID_MUNICIP', 'ID_REGIONA', 'DT_SIN_PRI', 'SEM_PRI',
       'NU_IDADE_N', 'CS_SEXO', 'CS_GESTANT', 'CS_RACA', 'CS_ESCOL_N', 'SG_UF',
       'ID_MN_RESI', 'ID_RG_RESI', 'ID_PAIS', 'NDUPLIC_N', 'IN_VINCULA',
       'DT_INVEST', 'ID_OCUPA_N', 'CLASSI_FIN', 'CRITERIO', 'TPAUTOCTO',
       'COUFINF', 'COPAISINF', 'COMUNINF', 'DOENCA_TRA', 'EVOLUCAO',
       'DT_OBITO', 'DT_ENCERRA', 'CS_FLXRET', 'FLXRECEBI', 'TP_SISTEMA',
       'TPUNINOT'],
      dtype='object')

In [26]:
zika.dtypes

TP_NOT        string
ID_AGRAVO     string
CS_SUSPEIT    string
DT_NOTIFIC    object
SEM_NOT       string
NU_ANO        string
SG_UF_NOT     string
ID_MUNICIP    string
ID_REGIONA    string
DT_SIN_PRI    object
SEM_PRI       string
NU_IDADE_N    string
CS_SEXO       string
CS_GESTANT    string
CS_RACA       string
CS_ESCOL_N    string
SG_UF         string
ID_MN_RESI    string
ID_RG_RESI    string
ID_PAIS       string
NDUPLIC_N     string
IN_VINCULA    string
DT_INVEST     string
ID_OCUPA_N    string
CLASSI_FIN    string
CRITERIO      string
TPAUTOCTO     string
COUFINF       string
COPAISINF     string
COMUNINF      string
DOENCA_TRA    string
EVOLUCAO      string
DT_OBITO      string
DT_ENCERRA    string
CS_FLXRET     string
FLXRECEBI     string
TP_SISTEMA    string
TPUNINOT      string
dtype: object

In [27]:
zika.head() #first 5 rows

Unnamed: 0,TP_NOT,ID_AGRAVO,CS_SUSPEIT,DT_NOTIFIC,SEM_NOT,NU_ANO,SG_UF_NOT,ID_MUNICIP,ID_REGIONA,DT_SIN_PRI,...,COPAISINF,COMUNINF,DOENCA_TRA,EVOLUCAO,DT_OBITO,DT_ENCERRA,CS_FLXRET,FLXRECEBI,TP_SISTEMA,TPUNINOT
0,2,A928,,2016-02-14,201607,2016,33,330490,,2016-02-12,...,0,,9.0,1,,20160414,0,2,1,
1,2,A928,,2016-02-14,201607,2016,33,330187,,2016-02-10,...,0,,,1,,20160418,0,2,1,
2,2,A928,,2016-02-14,201607,2016,33,330187,,2016-02-10,...,0,,,1,,20160418,0,2,1,
3,2,A928,,2016-02-14,201607,2016,33,330187,,2016-02-11,...,0,,,1,,20160418,0,2,1,
4,2,A928,,2016-02-14,201607,2016,33,330455,,2016-02-14,...,0,,2.0,1,,20160214,0,2,1,


In [28]:
#get information on columns and data types
print(zika.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 281464 entries, 0 to 281463
Data columns (total 38 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   TP_NOT      281464 non-null  string
 1   ID_AGRAVO   281464 non-null  string
 2   CS_SUSPEIT  281464 non-null  string
 3   DT_NOTIFIC  281464 non-null  object
 4   SEM_NOT     281464 non-null  string
 5   NU_ANO      281464 non-null  string
 6   SG_UF_NOT   281464 non-null  string
 7   ID_MUNICIP  281464 non-null  string
 8   ID_REGIONA  281464 non-null  string
 9   DT_SIN_PRI  281464 non-null  object
 10  SEM_PRI     281464 non-null  string
 11  NU_IDADE_N  281464 non-null  string
 12  CS_SEXO     281464 non-null  string
 13  CS_GESTANT  281464 non-null  string
 14  CS_RACA     281464 non-null  string
 15  CS_ESCOL_N  281464 non-null  string
 16  SG_UF       281464 non-null  string
 17  ID_MN_RESI  281464 non-null  string
 18  ID_RG_RESI  281464 non-null  string
 19  ID_PAIS     281464 non-

##Data Cleaning

Real-world data is often messy. The empty values in this dataset are filled with ''.

Let's replace empty strings('') with NaN values for better data handling

In [29]:
import numpy as np
zika = zika.replace('',np.nan)

## Diving into the Cases of zika

The DT_NOTIFIC column is of particular interest as it indicates the date of notification. Let's analyze it:

In [30]:
zika['DT_NOTIFIC'] = pd.to_datetime(zika['DT_NOTIFIC'])

In [31]:
print(zika['DT_NOTIFIC'].unique()) #unique values

['2016-02-14T00:00:00.000000000' '2016-02-15T00:00:00.000000000'
 '2016-02-16T00:00:00.000000000' '2016-02-17T00:00:00.000000000'
 '2016-02-18T00:00:00.000000000' '2016-02-19T00:00:00.000000000'
 '2016-02-20T00:00:00.000000000' '2016-02-21T00:00:00.000000000'
 '2016-02-22T00:00:00.000000000' '2016-02-23T00:00:00.000000000'
 '2016-05-09T00:00:00.000000000' '2016-05-10T00:00:00.000000000'
 '2016-05-11T00:00:00.000000000' '2016-05-12T00:00:00.000000000'
 '2016-05-13T00:00:00.000000000' '2016-05-14T00:00:00.000000000'
 '2016-05-15T00:00:00.000000000' '2016-05-16T00:00:00.000000000'
 '2016-05-17T00:00:00.000000000' '2016-05-18T00:00:00.000000000'
 '2016-05-19T00:00:00.000000000' '2016-05-20T00:00:00.000000000'
 '2016-05-21T00:00:00.000000000' '2016-05-22T00:00:00.000000000'
 '2016-05-23T00:00:00.000000000' '2016-05-24T00:00:00.000000000'
 '2016-05-25T00:00:00.000000000' '2016-05-26T00:00:00.000000000'
 '2016-05-27T00:00:00.000000000' '2016-05-28T00:00:00.000000000'
 '2016-05-29T00:00:00.000

In [32]:
print(zika['DT_NOTIFIC'].nunique()) #number of unique values

366


In [33]:
print(zika['DT_NOTIFIC'].value_counts()) #frequency of each unique value

2016-02-22    4880
2016-02-29    4555
2016-02-23    4299
2016-03-07    4180
2016-02-24    4094
              ... 
2016-10-30      13
2016-10-02      12
2016-12-04      12
2016-09-25      11
2016-10-09       8
Name: DT_NOTIFIC, Length: 366, dtype: int64


## Visualizing the Data

A histogram can provide a clear picture of the distribution of causes of death:

In [34]:
import plotly.express as px
fig = px.histogram(zika, x="DT_NOTIFIC",color='SG_UF_NOT',text_auto=True)
fig.show()