**Atenção: Devido a natureza dos dados, para correto funcionamento, este notebook precisa ser executado em ambiente com memória RAM maior ou igual a 25 Gb.**

#**Ciência e Visualização de Dados**
##**Projeto Final - Entrega 02**
###Alunos: 
###Gleyson Roberto do Nascimento. RA: 043801. Elétrica.
###Negli René Gallardo Alvarado. RA: 234066. Saúde.
###Rafael Vinícius da Silveira. RA: 137382. Física.
###Sérgio Sevileanu. RA: 941095. Elétrica.



##Neste notebook do Google Colaboratory será realizado o pré-processamento do Big Data do [SIHSUS](https://bigdata-metadados.icict.fiocruz.br/dataset/sistema-de-informacoes-hospitalares-do-sus-sihsus/resource/ae85ac54-6734-43b8-a820-6129a854e1ff)  para os estados de São Paulo, Bahia, Paraná, Pará e Goiás para os anos de 2008 a 2018.

##Para este projeto, algumas definições iniciais e um disclaimer se fazem necessários para este projeto:

##Será definido como **diagnóstico correto** aquele em que houve apenas um diagnóstico de CID10, sem alterações durante o período até a alta;  
##Será definido como **diagnóstico equivocado** aquele em que houve mais de um diagnóstico de CID10, contudo, eles fazem parte do mesmo grupo, de forma que é plausível o equívoco dada a semelhança de sintomas entre os CID10;
##Será definido como **falha de diagnóstico** aquele em que houve mais de um diagnóstico de CID10, contudo, eles fazem parte de grupos distintos, de forma que embora possam existir sintomas semelhantes entre os CID10, caberia ao profissional uma análise mais aprofundada antes do diagnóstico.
##**Disclaimer**: Considerando a natureza do banco de dados do SIHSUS, isto é, um Big Data em que inúmeros funcionários do Sistema Único de Saúde possuem acesso e inserem os dados de forma manual em realdades e condições bastante distintas, existe a séria possibildade de erro sistemático, desta forma, a acurácia deste trabalho deve ser considerada com ressalvas.

##Instalando e Importando as Bibliotecas Necessárias.

In [None]:
!pip install dask[complete]
!pip install -U -q PyDrive

In [None]:
%matplotlib inline
%load_ext google.colab.data_table
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import dask
from dask import dataframe as dd
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from google.colab import files
from oauth2client.client import GoogleCredentials
pd.set_option('display.max_columns', None)
pd.options.display.precision = 2
pd.options.display.max_rows = 50
import seaborn as sns
import missingno as msno
import matplotlib as mpl
from matplotlib import rcParams
mpl.rc('figure', max_open_warning = 0)
from sklearn import preprocessing

##Fazendo Autenticação do Gooogle Drive, Baixando os Arquivos e Criando Dataframes Dask.

##O dicionário de Variáveis para os arquivos se econtra no [GitHub](https://github.com/grnbatera/Data4health/blob/main/assets/Dicion%C3%A1rio%20de%20Vari%C3%A1veis.csv) 



In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
ids=['19qRgwzO-i0Ejc17R0Q-jVIW5DBMRQe7a','1-0BXgVicBrqWbP8tm0ZPjQNHRwWUUdHK','1-43afQU0_aKKiA7XHmy_dMmZETKZO5h4','1-7dboJihisKLJvb5vwkQKO9AOpI_TSl3',
     '1-FGuzGbYdF5lfQR34MoRn7blSQa1LV0m','1-HEQ0GtwIMM2pZ8Wpst46aoiONPTs51_','1-OrgEde62ejoO-5i4uYw8123X4Uz1fyq','1-PCRB_I-Rhfn7XeDMmY7MYs6ac86dpRx',
     '1-S0JTZIecr9yIDKHwi3A5Wp5Kw8bXUFH','1-Stj6Meniaeu7vu76mwLuDoGio-GiddT','1-WWjpojvWH0Uw1AcUF3i92_OvA-FZbXB']
files = ['2008.csv','2009.csv','2010.csv','2011.csv','2012.csv','2013.csv','2014.csv','2015.csv','2016.csv','2017.csv','2018.csv']

In [None]:
dflist=[]
for i in range (len(ids)):
  fileDownloaded = drive.CreateFile({'id':ids[i]})
  fileDownloaded.GetContentFile(files[i])
  globals()['df{0}'.format(i)] = dd.read_csv(files[i],low_memory=False,
        dtype={
                  'v0':'int64',
                  'v1':'int64',
                  'v2':'int64',
                  'v3':'int64',
                  'v4':'float64',
                  'v5':'int64',
                  'v6':'object',
                  'v7':'int64',
                  'v8':'int64',
                  'v9':'object',
                  'v10':'int64',
                  'v11':'int64',
                  'v12':'int64',
                  'v13':'int64',
                  'v14':'int64',
                  'v15':'object',
                  'v19':'int64',
                  'v20':'int64',
                  'v21':'int64',
                  'v22':'int64',
                  'v23':'object',
                  'v27':'int64',
                  'v28':'int64',
                  'v29':'int64',
                  'v30':'int64',
                  'v31':'int64',
                  'v32':'float64',
                  'v33':'float64',
                  'v43':'float64',
                  'v44':'float64',
                  'v45':'float64',
                  'v46':'int64',
                  'v47':'int64',
                  'v48':'object',
                  'v49':'object',
                  'v50':'int64',
                  'v51':'object',
                  'v52':'int64',
                  'v53':'object',
                  'v54':'float64',
                  'v55':'object',
                  'v56':'int64',
                  'v57':'object',
                  'v59':'int64',
                  'v60':'object',
                  'v61':'int64',
                  'v62':'int64',
                  'v63':'int64',
                  'v64':'object',
                  'v65':'int64',
                  'v66':'int64',
                  'v67':'int64',
                  'v68':'int64',
                  'v69':'object',
                  'v70':'int64',
                  'v72':'int64',
                  'v73':'int64',
                  'v76':'int64',
                  'v77':'object',
                  'v78':'int64',
                  'v79':'object',
                  'v80':'int64',
                  'v81':'object',
                  'v82':'object',
                  'v83':'int64',
                  'v84':'object',
                  'v85':'int64',
                  'v86':'object',
                  'v87':'int64',
                  'v88':'object',
                  'v89':'int64',
                  'v90':'int64',
                  'v91':'object',
                  'v92':'int64',
                  'v93':'int64',
                  'v94':'int64',
                  'v95':'object',
                  'v96':'float64',
                  'v97':'int64',
                  'v98':'int64',
                  'v99':'float64',
                  'v100':'int64',
                  'v101':'float64',
                  'v102':'float64',
                  'v103':'float64',
                  'v104':'object',
                  'v105':'object',
                  'v106':'int64',
                  'v107':'object',
                  'v108':'int64',
                  'v109':'object',
                  'v110':'float64',
                  'v111':'object',
                  'v112':'int64',
                  'v113':'object',
                  'v114':'int64',
                  'v115':'object',
                  'v116':'float64',
                  'v117':'int64',
                  'v118':'object',
                  'v119':'object',
                  'v120':'object',
                  'v121':'float64',
                  'v122':'float64',
                  'v123':'float64',
                  'v124':'float64',
                  'v125':'float64',
                  'v126':'float64',
                  'v127':'object',
                  'v128':'object',
                  'v129':'object',
                  'v130':'object',
                  'v131':'object',
                  'v132':'object',
                  'v133':'object',
                  'v134':'object',
                  'v135':'float64',
                  'v136':'float64',
                  'v137':'float64',
                  'v138':'object',
                  'v139':'float64',
                  'v140':'object',
                  'v141':'float64',
                  'v142':'object',
                  'v143':'float64',
                  'v144':'object',
                  'v145':'float64',
                  'v146':'object',
                  'v147':'float64',
                  'v148':'object',
                  'v149':'float64',
                  'v150':'object',
                  'v151':'float64',
                  'v152':'float64',
                  'v153':'float64',
                  'v154':'float64',
                  'v155':'int64',
                  'v156':'object',
                  'v157':'object',
                  'v158':'object',
                  'v159':'object',
                  'v160':'object',
                  'v161':'int64',
                  'v162':'int64',
                  'v163':'int64',
                  'v164':'int64',
                  'v165':'float64',
                  'v166':'float64',
                  'v167':'float64',
                  'v168':'float64',
                  'v169':'int64',
                  'v170':'int64',
                  'v171':'object',
                  'v172':'object',
                  'v173':'object',
                  'v174':'object',
                  'v175':'object',
                  'v176':'int64',
                  'v177':'int64',
                  'v178':'int64',
                  'v179':'int64',
                  'v180':'float64',
                  'v181':'float64',
                  'v182':'int64',
                  'v183':'float64',
                  'v184':'int64',
                  'v185':'object',
                  'v186':'int64',
                  'v187':'object',
                  'v188':'object',
                  'v189':'object',
                  'v192':'object',
                  'v193':'object',
                  'v194':'object',
                  'v195':'object',
                  'v196':'object',
                  'v197':'object',
                  'v198':'object',
                  'v199':'object',
                  'v200':'object',
                  'v201':'object',
                  'v202':'object',
                  'v203':'object',
                  'v204':'object',
                  'v205':'object',
                  'v206':'object',
                  'v207':'object',
                  'v208':'object',
                  'v209':'object',
                  'v210':'object',
                  'v211':'object',
                  'v212':'object',
                  'v213':'object',
                  'v214':'object',
                  'v215':'object',
                  'v216':'object',
                  'v217':'object',
                  'v218':'object',
                  'v219':'object',
                  'v220':'object',
                  'v221':'int64',
                  'v222':'object',
                  'v223':'object',
                  'v224':'object',
                  'v225':'object',
                  'v226':'object',
                  'v227':'object',
                  'v228':'object',
                  'v229':'object',
                  'v230':'object',
                  'v231':'object',
                  'v232':'object',
                  'v234':'int64',
                  'v235':'object',
                  'v237':'int64',
                  'v238':'object',
                  'v239':'int64'})
  globals()['df{0}'.format(i)] = (globals()['df{0}'.format(i)]).drop('Unnamed: 0', axis=1)
  dflist.append(eval('df{0}'.format(i)))
raw = dd.concat(dflist)

###Verficando a estrutura do dask dataframe concatenado com os anos de 2008 a 2018, que será o arquivo raw.

In [None]:
raw

Unnamed: 0_level_0,v0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v19,v20,v21,v22,v23,v27,v28,v29,v30,v31,v32,v33,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v59,v60,v61,v62,v63,v64,v65,v66,v67,v68,v69,v70,v72,v73,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v91,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v151,v152,v153,v154,v155,v156,v157,v158,v159,v160,v161,v162,v163,v164,v165,v166,v167,v168,v169,v170,v171,v172,v173,v174,v175,v176,v177,v178,v179,v180,v181,v182,v183,v184,v185,v186,v187,v188,v189,v192,v193,v194,v195,v196,v197,v198,v199,v200,v201,v202,v203,v204,v205,v206,v207,v208,v209,v210,v211,v212,v213,v214,v215,v216,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v234,v235,v237,v238,v239
npartitions=245,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1,Unnamed: 187_level_1,Unnamed: 188_level_1,Unnamed: 189_level_1,Unnamed: 190_level_1,Unnamed: 191_level_1,Unnamed: 192_level_1,Unnamed: 193_level_1,Unnamed: 194_level_1,Unnamed: 195_level_1,Unnamed: 196_level_1,Unnamed: 197_level_1,Unnamed: 198_level_1,Unnamed: 199_level_1,Unnamed: 200_level_1,Unnamed: 201_level_1,Unnamed: 202_level_1,Unnamed: 203_level_1,Unnamed: 204_level_1,Unnamed: 205_level_1,Unnamed: 206_level_1,Unnamed: 207_level_1,Unnamed: 208_level_1,Unnamed: 209_level_1,Unnamed: 210_level_1,Unnamed: 211_level_1,Unnamed: 212_level_1,Unnamed: 213_level_1,Unnamed: 214_level_1,Unnamed: 215_level_1,Unnamed: 216_level_1,Unnamed: 217_level_1
,int64,int64,int64,int64,float64,int64,object,int64,int64,object,int64,int64,int64,int64,int64,object,int64,int64,int64,int64,object,int64,int64,int64,int64,int64,float64,float64,float64,float64,float64,int64,int64,object,object,int64,object,int64,object,float64,object,int64,object,int64,object,int64,int64,int64,object,int64,int64,int64,int64,object,int64,int64,int64,int64,object,int64,object,int64,object,object,int64,object,int64,object,int64,object,int64,int64,object,int64,int64,int64,object,float64,int64,int64,float64,int64,float64,float64,float64,object,object,int64,object,int64,object,float64,object,int64,object,int64,object,float64,int64,object,object,object,float64,float64,float64,float64,float64,float64,object,object,object,object,object,object,object,object,float64,float64,float64,object,float64,object,float64,object,float64,object,float64,object,float64,object,float64,object,float64,float64,float64,float64,int64,object,object,object,object,object,int64,int64,int64,int64,float64,float64,float64,float64,int64,int64,object,object,object,object,object,int64,int64,int64,int64,float64,float64,int64,float64,int64,object,int64,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,int64,object,object,object,object,object,object,object,object,object,object,object,int64,object,int64,object,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [None]:
raw.head()

Unnamed: 0,v0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v19,v20,v21,v22,v23,v27,v28,v29,v30,v31,v32,v33,v43,v44,v45,v46,v47,v48,v49,v50,v51,v52,v53,v54,v55,v56,v57,v59,v60,v61,v62,v63,v64,v65,v66,v67,v68,v69,v70,v72,v73,v76,v77,v78,v79,v80,v81,v82,v83,v84,v85,v86,v87,v88,v89,v90,v91,v92,v93,v94,v95,v96,v97,v98,v99,v100,v101,v102,v103,v104,v105,v106,v107,v108,v109,v110,v111,v112,v113,v114,v115,v116,v117,v118,v119,v120,v121,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131,v132,v133,v134,v135,v136,v137,v138,v139,v140,v141,v142,v143,v144,v145,v146,v147,v148,v149,v150,v151,v152,v153,v154,v155,v156,v157,v158,v159,v160,v161,v162,v163,v164,v165,v166,v167,v168,v169,v170,v171,v172,v173,v174,v175,v176,v177,v178,v179,v180,v181,v182,v183,v184,v185,v186,v187,v188,v189,v192,v193,v194,v195,v196,v197,v198,v199,v200,v201,v202,v203,v204,v205,v206,v207,v208,v209,v210,v211,v212,v213,v214,v215,v216,v217,v218,v219,v220,v221,v222,v223,v224,v225,v226,v227,v228,v229,v230,v231,v232,v234,v235,v237,v238,v239
0,120000,2008,12,2,529000000000.0,1208100502730,AIH normal,1208100502730,1,Normal,69900100,120040,12,19791012,3,Feminino,0,0,0,0,Não utilizou UTI,0,0,5,411010026,411010026,700.79,447.15,1147.94,0.0,496.94,20081206,20081211,O324,O630,17,Alta da mãe/puérpera e do recém-nascido,61,Privado,,,2,Estadual plena,1,Sim,120040,12,4,Anos,29,5,5,0,Sem óbito,10,2,2,2,2,0,Sem filhos/Não inform,0,,,0,,0,,1,Sim,1208103541,0,Seqüencial zerado,0,0,0,,,1,21706700172,,2002078,529000000000.0,,,,,2,Média complexidade,6,Média e Alta Complexidade (MAC),,,0,Sem regra contratual,3,Parda,,3046,HE12000001N200812.DTS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,120040,Rio Branco,RIO BRANCO,S,S,S,12,1290,1201,12900,-9.97,-67.81,153.0,9222.58,120040,120040,Rio Branco,RIO BRANCO,S,S,S,12,1290,1201,12900,-9.97,-67.81,153,9222.58,120040,AC,12,ACRE,AC,ACRE,Região Norte,Região Norte,Acre,AC,Acre,AC,Rio Branco,Rio Branco,,,Baixo Acre e Purus,Baixo Acre e Purus,Rio Branco,Rio Branco,Vale do Acre,Vale do Acre,Rio Branco,Acre,Acre,Região não definida - AC,Região não definida - AC,,,Brasil,Não Informado,,Obstétricos,PARTO CESARIANO EM GESTACAO DE ALTO RISCO,PARTO CESARIANO EM GESTACAO DE ALTO RISCO,29,15-49a,25-34a,25-29a,Capítulo XV - Gravidez; parto e puerpério,Assistência prestada à mãe por motivos ligados...,Assistência prestada à mãe por motivo de apres...,Assistência prestada à mãe por polo cefálico a...,,,,,2008,sab,2008,qui,29
1,120000,2008,12,2,529000000000.0,1208100502741,AIH normal,1208100502741,1,Normal,69945000,120001,12,19910727,3,Feminino,0,0,0,0,Não utilizou UTI,0,0,2,411010026,411010026,701.79,447.15,1148.94,0.0,497.37,20081210,20081212,O321,,17,Alta da mãe/puérpera e do recém-nascido,61,Privado,,,2,Estadual plena,1,Sim,120040,12,4,Anos,17,2,2,0,Sem óbito,10,2,2,0,Não,0,Sem filhos/Não inform,0,,,0,,0,,1,Sim,1208010001,0,Seqüencial zerado,0,0,0,,,0,0,,2002078,529000000000.0,,,,,2,Média complexidade,6,Média e Alta Complexidade (MAC),,,0,Sem regra contratual,3,Parda,,3047,HE12000001N200812.DTS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,120001,Acrelândia,ACRELANDIA,S,S,N,12,1290,1201,12900,-9.83,-66.88,25.0,1574.55,120001,120040,Rio Branco,RIO BRANCO,S,S,S,12,1290,1201,12900,-9.97,-67.81,153,9222.58,120040,AC,12,ACRE,AC,ACRE,Região Norte,Região Norte,Acre,AC,Acre,AC,Rio Branco,Acrelândia,,,Baixo Acre e Purus,Baixo Acre e Purus,Rio Branco,,Vale do Acre,Vale do Acre,Rio Branco,Acre,Acre,Região não definida - AC,Região não definida - AC,,,Brasil,Não Informado,,Obstétricos,PARTO CESARIANO EM GESTACAO DE ALTO RISCO,PARTO CESARIANO EM GESTACAO DE ALTO RISCO,17,15-49a,15-24a,15-19a,Capítulo XV - Gravidez; parto e puerpério,Assistência prestada à mãe por motivos ligados...,Assistência prestada à mãe por motivo de apres...,Assistência prestada à mãe por apresentação pé...,,,,,2008,qua,2008,sex,17
2,120000,2008,12,2,529000000000.0,1208100502752,AIH normal,1208100502752,1,Normal,69928000,120038,12,19911118,3,Feminino,0,0,0,0,Não utilizou UTI,0,0,2,411010026,411010026,700.79,447.15,1147.94,0.0,496.94,20081210,20081212,O324,,17,Alta da mãe/puérpera e do recém-nascido,61,Privado,,,2,Estadual plena,1,Sim,120040,12,4,Anos,17,2,2,0,Sem óbito,10,2,2,0,Não,0,Sem filhos/Não inform,0,,,0,,0,,1,Sim,1208103530,0,Seqüencial zerado,0,0,0,,,0,0,,2002078,529000000000.0,,,,,2,Média complexidade,6,Média e Alta Complexidade (MAC),,,0,Sem regra contratual,3,Parda,,3048,HE12000001N200812.DTS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,120038,Plácido de Castro,PLACIDO DE CASTRO,S,S,N,12,1290,1201,12900,-10.28,-67.15,136.0,2047.45,120038,120040,Rio Branco,RIO BRANCO,S,S,S,12,1290,1201,12900,-9.97,-67.81,153,9222.58,120040,AC,12,ACRE,AC,ACRE,Região Norte,Região Norte,Acre,AC,Acre,AC,Rio Branco,Plácido de Castro,,,Baixo Acre e Purus,Baixo Acre e Purus,Rio Branco,,Vale do Acre,Vale do Acre,Rio Branco,Acre,Acre,Região não definida - AC,Região não definida - AC,,,Brasil,Não Informado,,Obstétricos,PARTO CESARIANO EM GESTACAO DE ALTO RISCO,PARTO CESARIANO EM GESTACAO DE ALTO RISCO,17,15-49a,15-24a,15-19a,Capítulo XV - Gravidez; parto e puerpério,Assistência prestada à mãe por motivos ligados...,Assistência prestada à mãe por motivo de apres...,Assistência prestada à mãe por polo cefálico a...,,,,,2008,qua,2008,sex,17
3,120000,2008,12,2,529000000000.0,1208100502763,AIH normal,1208100502763,1,Normal,69900970,120040,12,19920129,3,Feminino,0,0,0,0,Não utilizou UTI,0,0,3,411010026,411010026,700.79,447.15,1147.94,0.0,496.94,20081209,20081212,O324,,17,Alta da mãe/puérpera e do recém-nascido,61,Privado,,,2,Estadual plena,1,Sim,120040,12,4,Anos,16,3,3,0,Sem óbito,10,2,2,0,Não,0,Sem filhos/Não inform,0,,,0,,0,,1,Sim,1208103115,0,Seqüencial zerado,0,0,0,,,0,0,,2002078,529000000000.0,,,,,2,Média complexidade,6,Média e Alta Complexidade (MAC),,,0,Sem regra contratual,3,Parda,,3049,HE12000001N200812.DTS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,120040,Rio Branco,RIO BRANCO,S,S,S,12,1290,1201,12900,-9.97,-67.81,153.0,9222.58,120040,120040,Rio Branco,RIO BRANCO,S,S,S,12,1290,1201,12900,-9.97,-67.81,153,9222.58,120040,AC,12,ACRE,AC,ACRE,Região Norte,Região Norte,Acre,AC,Acre,AC,Rio Branco,Rio Branco,,,Baixo Acre e Purus,Baixo Acre e Purus,Rio Branco,Rio Branco,Vale do Acre,Vale do Acre,Rio Branco,Acre,Acre,Região não definida - AC,Região não definida - AC,,,Brasil,Não Informado,,Obstétricos,PARTO CESARIANO EM GESTACAO DE ALTO RISCO,PARTO CESARIANO EM GESTACAO DE ALTO RISCO,16,15-49a,15-24a,15-19a,Capítulo XV - Gravidez; parto e puerpério,Assistência prestada à mãe por motivos ligados...,Assistência prestada à mãe por motivo de apres...,Assistência prestada à mãe por polo cefálico a...,,,,,2008,ter,2008,sex,16
4,120000,2008,12,2,529000000000.0,1208100502774,AIH normal,1208100502774,1,Normal,69900970,120040,12,19760827,3,Feminino,0,0,0,0,Não utilizou UTI,0,0,3,411010026,411010026,700.79,447.15,1147.94,0.0,496.94,20081209,20081212,O410,O48,17,Alta da mãe/puérpera e do recém-nascido,61,Privado,,,2,Estadual plena,1,Sim,120040,12,4,Anos,32,3,3,0,Sem óbito,10,2,2,0,Não,0,Sem filhos/Não inform,0,,,0,,0,,1,Sim,1208103511,0,Seqüencial zerado,0,0,0,,,0,0,,2002078,529000000000.0,,,,,2,Média complexidade,6,Média e Alta Complexidade (MAC),,,0,Sem regra contratual,3,Parda,,3050,HE12000001N200812.DTS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,120040,Rio Branco,RIO BRANCO,S,S,S,12,1290,1201,12900,-9.97,-67.81,153.0,9222.58,120040,120040,Rio Branco,RIO BRANCO,S,S,S,12,1290,1201,12900,-9.97,-67.81,153,9222.58,120040,AC,12,ACRE,AC,ACRE,Região Norte,Região Norte,Acre,AC,Acre,AC,Rio Branco,Rio Branco,,,Baixo Acre e Purus,Baixo Acre e Purus,Rio Branco,Rio Branco,Vale do Acre,Vale do Acre,Rio Branco,Acre,Acre,Região não definida - AC,Região não definida - AC,,,Brasil,Não Informado,,Obstétricos,PARTO CESARIANO EM GESTACAO DE ALTO RISCO,PARTO CESARIANO EM GESTACAO DE ALTO RISCO,32,15-49a,25-34a,30-34a,Capítulo XV - Gravidez; parto e puerpério,Assistência prestada à mãe por motivos ligados...,Outros transtornos das membranas e do líquido ...,Oligohidrâmnio,,,,,2008,ter,2008,sex,32


###Fazendo a análise estatística do banco raw por ano.

In [None]:
for i in range(len(dflist)):
  print('\033[1m' +'Dados Estatísticos de '+ '{0}'.format(i+2008) + '\033[0m')
  print('\n')
  print((dflist[i]).compute().describe())
  print('\n')

[1mDados Estatísticos de 2008[0m


              v0        v1        v2         v3        v4        v5        v7  \
count  893695.00  893695.0  893695.0  893695.00  7.15e+05  8.94e+05  8.94e+05   
mean   326509.59    2008.0      12.0       2.86  3.53e+13  3.27e+12  3.27e+12   
std     94977.88       0.0       0.0       1.95  2.92e+13  9.52e+11  9.52e+11   
min    110000.00    2008.0      12.0       1.00  6.18e+08  1.11e+12  1.11e+12   
25%    260790.00    2008.0      12.0       1.00  8.78e+12  2.61e+12  2.61e+12   
50%    330190.00    2008.0      12.0       3.00  2.71e+13  3.31e+12  3.31e+12   
75%    355670.00    2008.0      12.0       3.00  5.56e+13  3.51e+12  3.51e+12   
max    530000.00    2008.0      12.0      14.00  9.87e+13  9.91e+12  9.91e+12   

              v8       v10        v11        v12       v13        v14  \
count  893695.00  8.94e+05  893695.00  893695.00  8.94e+05  893695.00   
mean        1.15  5.07e+07  327007.23      32.53  1.97e+07       2.18   
std         0.

###Computando o valor total de linhas no banco raw.

In [None]:
observacoes = (df0['v0']).compute().count()+(df1['v0']).compute().count()+(df2['v0']).compute().count()+(df3['v0']).compute().count()+(df4['v0']).compute().count()+(df5['v0']).compute().count()+(df6['v0']).compute().count()+(df7['v0']).compute().count()+(df8['v0']).compute().count()+(df9['v0']).compute().count()+(df10['v0']).compute().count()
observacoes

10152242

###Computando um dataframe com as cidades para um range de códigos de endereçamento do IBGE. 

In [None]:
proc1 = raw.copy()
proc = proc1.loc[proc1['v0'].between(350000, 356000,inclusive=True)]
proc.compute().shape

(2176208, 217)

In [None]:
a = proc.compute()

###Retirando colunas com quantidade de NaN maior que 75% e sem descrição no dicionário de variáveis.

In [None]:
cid_columns = ['v48','v49','v82','v104','v105']
df = a.loc[:, a.isnull().mean() >.75]
df.columns

Index(['v81', 'v82', 'v84', 'v86', 'v95', 'v99', 'v102', 'v103', 'v110',
       'v111', 'v119', 'v120', 'v128', 'v129', 'v130', 'v131', 'v132', 'v133',
       'v134', 'v135', 'v136', 'v138', 'v140', 'v142', 'v144', 'v146', 'v148',
       'v150', 'v152', 'v154', 'v214', 'v217', 'v229', 'v230', 'v231', 'v232'],
      dtype='object')

In [None]:
b=a.drop(['v53', 'v81','v84', 'v86', 'v95', 'v99', 'v102', 'v103', 'v110','v111', 'v119', 'v120', 'v129', 'v130', 'v131', 'v132', 'v133', 'v134','v135', 'v136', 'v140', 'v142', 'v144', 'v146', 'v148', 'v150', 'v152','v154', 'v214', 'v217'], axis=1)

In [None]:
b.shape

(2176208, 187)

###Salvando o arquivo raw.

In [None]:
with open('b.pkl', 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(b, f, pickle.HIGHEST_PROTOCOL)

In [None]:
files.download('b.pkl') 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>