# SRAG (SARS) data exploration from Brazil's OpenDataSUS

This notebook is meant to explore and understand the data contained in the SRAG data from Brazil's OpenDataSUS data. This data is available at [the OpenDataSUS website](https://opendatasus.saude.gov.br/dataset/bd-srag-2020).

The data is divided into the 2020 data and the earlier data. The first step is to download the data and consume it into `pandas` dataframes.

To download the data, we use the bash script `GET_DATA.sh`, which contains the following code, that downloads and checks the data for the project:

In [5]:
!pygmentize GET_DATA.sh

[37m#!/bin/bash[39;49;00m

[31mBLUE[39;49;00m=[33m"\e[1;34m"[39;49;00m
[31mWHITE[39;49;00m=[33m"\e[1;37m"[39;49;00m
[31mRED[39;49;00m=[33m"\e[1;31m"[39;49;00m
[31mGREEN[39;49;00m=[33m"\e[1;32m"[39;49;00m

[31mMESSAGE[39;49;00m=[33m"[39;49;00m[33m${[39;49;00m[31mBLUE[39;49;00m[33m}[39;49;00m[33m -> [39;49;00m[33m"[39;49;00m
[31mERROR[39;49;00m=[33m"[39;49;00m[33m${[39;49;00m[31mRED[39;49;00m[33m}[39;49;00m[33mERROR: [39;49;00m[33m${[39;49;00m[31mWHITE[39;49;00m[33m}[39;49;00m[33m"[39;49;00m
[31mSUCCESS[39;49;00m=[33m"[39;49;00m[33m${[39;49;00m[31mGREEN[39;49;00m[33m}[39;49;00m[33m"[39;49;00m

[37m# This scripts gets all the data from the SRAG databases from Brazil's OpenDataSUS. The script verifies if the files exist and verifies the SHA256 sums.[39;49;00m
[36mecho[39;49;00m
[36mecho[39;49;00m -e [33m"[39;49;00m[33m${[39;49;00m[31mMESSAGE[39;49;00m[33m}[39;49;00m[33mCreating data directory...[39;49;00m[33m"

In [4]:
!bash GET_DATA.sh



[1;34m -> Creating data directory...


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 23031  100 23031    0     0  20975      0  0:00:01  0:00:01 --:--:-- 20994
[1;34m -> Downloading the [1;37m2014 [1;34mdata...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 5749k    0 5749k    0     0  1465k      0 --:--:--  0:00:03 --:--:-- 1465k


[1;34m -> Checking sha256 sum for [1;37m2014 [1;34mfile...


[1;32mSuccessfully downloaded files for year [1;37m2014[1;32m!


[1;34m -> Downloading the [1;37m2015 [1;34mdata...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4414k    0 4414k    0     0  1465k      0 --:--:--  0:00:03 --:--:-- 1465k


[1

After downloading the data, the next step is to consume it into 

In [46]:
import pandas as pd
%matplotlib inline

In [54]:
df_2020 = pd.read_csv('data/2020.csv', encoding='latin2', sep=';', low_memory=False)

In [53]:
df_2019=pd.read_csv('data/2019.csv', encoding='latin2', sep=';', low_memory=False)

In [68]:

df_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 326193 entries, 0 to 326192
Columns: 134 entries, DT_NOTIFIC to PAC_DSCBO
dtypes: float64(76), int64(10), object(48)
memory usage: 333.5+ MB


['DT_NOTIFIC',
 'SEM_NOT',
 'DT_SIN_PRI',
 'SEM_PRI',
 'SG_UF_NOT',
 'ID_REGIONA',
 'CO_REGIONA',
 'ID_MUNICIP',
 'CO_MUN_NOT',
 'ID_UNIDADE',
 'CO_UNI_NOT',
 'CS_SEXO',
 'DT_NASC',
 'NU_IDADE_N',
 'TP_IDADE',
 'COD_IDADE',
 'CS_GESTANT',
 'CS_RACA',
 'CS_ETINIA',
 'CS_ESCOL_N',
 'ID_PAIS',
 'CO_PAIS',
 'SG_UF',
 'ID_RG_RESI',
 'CO_RG_RESI',
 'ID_MN_RESI',
 'CO_MUN_RES',
 'CS_ZONA',
 'SURTO_SG',
 'NOSOCOMIAL',
 'AVE_SUINO',
 'FEBRE',
 'TOSSE',
 'GARGANTA',
 'DISPNEIA',
 'DESC_RESP',
 'SATURACAO',
 'DIARREIA',
 'VOMITO',
 'OUTRO_SIN',
 'OUTRO_DES',
 'PUERPERA',
 'CARDIOPATI',
 'HEMATOLOGI',
 'SIND_DOWN',
 'HEPATICA',
 'ASMA',
 'DIABETES',
 'NEUROLOGIC',
 'PNEUMOPATI',
 'IMUNODEPRE',
 'RENAL',
 'OBESIDADE',
 'OBES_IMC',
 'OUT_MORBI',
 'MORB_DESC',
 'VACINA',
 'DT_UT_DOSE',
 'MAE_VAC',
 'DT_VAC_MAE',
 'M_AMAMENTA',
 'DT_DOSEUNI',
 'DT_1_DOSE',
 'DT_2_DOSE',
 'ANTIVIRAL',
 'TP_ANTIVIR',
 'OUT_ANTIV',
 'DT_ANTIVIR',
 'HOSPITAL',
 'DT_INTERNA',
 'SG_UF_INTE',
 'ID_RG_INTE',
 'CO_RG_INTE',
 '

In [None]:
import boto3
from sagemaker import get_execution_role