# Projeto de Linguagem R para Tratamento de Dados e Machine Learning com Árvore de Decisão

## Objetivo:

Desenvolver uma solução em linguagem R para realizar o tratamento de dados e implementar modelos de machine learning utilizando a técnica de árvore de decisão.

## Subobjetivos:

Preparação de Dados:

- Importar conjuntos de dados relevantes.
- Realizar limpeza e pré-processamento dos dados.
- Tratar valores ausentes e outliers.
- Implementação da Árvore de Decisão:

Estudar o funcionamento da técnica de árvore de decisão.

- Utilizar pacotes R para construir e treinar modelos de árvore de decisão.

Avaliação do Modelo:

- Dividir os dados em conjunto de treinamento e teste.
- Avaliar o desempenho da árvore de decisão usando métricas como acurácia.

Este projeto visa criar um sistema em linguagem R para realizar o tratamento de dados e implementar técnicas de machine learning usando árvores de decisão. Os dados serão preparados e processados antes de construir e treinar o modelo.

#### Dificuldades:

- Confundindo comandos do python com a linguagem R
- Em criar função para troca de valores no dataframe.

Lembrando que esses tópicos estão apresentados de forma simplificada e resumida. Um projeto completo envolveria detalhes mais aprofundados e etapas adicionais.

# Projeto:

In [1]:
# Carregando o Dataframe
df <- read.csv('/content/heart_2020_cleaned.csv')
head(df)

Unnamed: 0_level_0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
2,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
3,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
4,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
5,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No
6,Yes,28.87,Yes,No,No,6.0,0.0,Yes,Female,75-79,Black,No,No,Fair,12.0,No,No,No


In [2]:
# Renomeando as colunas
novos_nomes <- c("doenca_cardiaca",	"IMC",	"fumante",	"bebe_alcool",	"teve_AVC",	"PhysicalHealth",	"MentalHealth",	"dificul_andar_subir_escadas",	"sexo",	"IdadeCategoria",	"raca",	"diabetico",	"atividade_fisica_regular",	"GenHealth",	"horas_de_sono",	"asma",	"doenca_renal",	"cancer_de_pele")
colnames(df) <- novos_nomes
head(df)

Unnamed: 0_level_0,doenca_cardiaca,IMC,fumante,bebe_alcool,teve_AVC,PhysicalHealth,MentalHealth,dificul_andar_subir_escadas,sexo,IdadeCategoria,raca,diabetico,atividade_fisica_regular,GenHealth,horas_de_sono,asma,doenca_renal,cancer_de_pele
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
2,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
3,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
4,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
5,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No
6,Yes,28.87,Yes,No,No,6.0,0.0,Yes,Female,75-79,Black,No,No,Fair,12.0,No,No,No


In [3]:
# Informações relevantes
str(df)

'data.frame':	613231 obs. of  18 variables:
 $ doenca_cardiaca            : chr  "No" "No" "No" "No" ...
 $ IMC                        : chr  "16.6" "20.34" "26.58" "24.21" ...
 $ fumante                    : chr  "Yes" "No" "Yes" "No" ...
 $ bebe_alcool                : chr  "No" "No" "No" "No" ...
 $ teve_AVC                   : chr  "No" "Yes" "No" "No" ...
 $ PhysicalHealth             : chr  "3.0" "0.0" "20.0" "0.0" ...
 $ MentalHealth               : chr  "30.0" "0.0" "30.0" "0.0" ...
 $ dificul_andar_subir_escadas: chr  "No" "No" "No" "No" ...
 $ sexo                       : chr  "Female" "Female" "Male" "Female" ...
 $ IdadeCategoria             : chr  "55-59" "80 or older" "65-69" "75-79" ...
 $ raca                       : chr  "White" "White" "White" "White" ...
 $ diabetico                  : chr  "Yes" "No" "Yes" "No" ...
 $ atividade_fisica_regular   : chr  "Yes" "Yes" "Yes" "No" ...
 $ GenHealth                  : chr  "Very good" "Very good" "Fair" "Good" ...
 $ horas_d

In [4]:
# Substituir 'Yes' por 1 e 'No' por 0 em todas as colunas
df <- data.frame(lapply(df, function(col) ifelse(col == 'Yes', 1, ifelse(col == 'No', 0, col))))

In [5]:
head(df)

Unnamed: 0_level_0,doenca_cardiaca,IMC,fumante,bebe_alcool,teve_AVC,PhysicalHealth,MentalHealth,dificul_andar_subir_escadas,sexo,IdadeCategoria,raca,diabetico,atividade_fisica_regular,GenHealth,horas_de_sono,asma,doenca_renal,cancer_de_pele
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,0,16.6,1,0,0,3.0,30.0,0,Female,55-59,White,1,1,Very good,5.0,1,0,1
2,0,20.34,0,0,1,0.0,0.0,0,Female,80 or older,White,0,1,Very good,7.0,0,0,0
3,0,26.58,1,0,0,20.0,30.0,0,Male,65-69,White,1,1,Fair,8.0,1,0,0
4,0,24.21,0,0,0,0.0,0.0,0,Female,75-79,White,0,0,Good,6.0,0,0,1
5,0,23.71,0,0,0,28.0,0.0,1,Female,40-44,White,0,1,Very good,8.0,0,0,0
6,1,28.87,1,0,0,6.0,0.0,1,Female,75-79,Black,0,0,Fair,12.0,0,0,0


In [6]:
# Listando valores únicos de cada coluna
library(dplyr)
valores_unicos <- df%>%
  summarise_all(~ list(unique(.)))
print(valores_unicos)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




                                                         doenca_cardiaca
1 0, 1, Fair, White, , Excellent, 0.0, Noes, 65-69, Male, Very good, ood
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

In [7]:
# Substituir valores
df <- data.frame(lapply(df, function(col) ifelse(col == 'Yes (during pregnancy', 0, ifelse(col == 'No, borderline diabetes', 1, col))))
df <- data.frame(lapply(df, function(col) ifelse(col == 'Female', 0, ifelse(col == 'Male', 1, col))))

valores_unicos <- df %>%
  summarise_all(~ list(unique(.)))
print(valores_unicos)

                                                   doenca_cardiaca
1 0, 1, Fair, White, , Excellent, 0.0, Noes, 65-69, Very good, ood
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

In [8]:
# Função para classificar IMC
imc <- function(imc) {
  if (imc < 18.5) {
    return("baixopeso")
  } else if (imc <= 24.9) {
    return("eutrofia(pesoadequado)")
  } else if (imc <= 29.9) {
    return("sobrepeso")
  } else if (imc <= 34.9) {
    return("obesidadegrau1")
  } else if (imc <= 39.9) {
    return("obesidadegrau2")
  } else {
    return("obesidadeextrema")
  }
}

# Aplicar a função à coluna IMC e criar nova coluna IMC_grau
df$IMC_grau <- sapply(df$IMC, imc)

head(df)

Unnamed: 0_level_0,doenca_cardiaca,IMC,fumante,bebe_alcool,teve_AVC,PhysicalHealth,MentalHealth,dificul_andar_subir_escadas,sexo,IdadeCategoria,raca,diabetico,atividade_fisica_regular,GenHealth,horas_de_sono,asma,doenca_renal,cancer_de_pele,IMC_grau
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,0,16.6,1,0,0,3.0,30.0,0,0,55-59,White,1,1,Very good,5.0,1,0,1,baixopeso
2,0,20.34,0,0,1,0.0,0.0,0,0,80 or older,White,0,1,Very good,7.0,0,0,0,eutrofia(pesoadequado)
3,0,26.58,1,0,0,20.0,30.0,0,1,65-69,White,1,1,Fair,8.0,1,0,0,sobrepeso
4,0,24.21,0,0,0,0.0,0.0,0,0,75-79,White,0,0,Good,6.0,0,0,1,eutrofia(pesoadequado)
5,0,23.71,0,0,0,28.0,0.0,1,0,40-44,White,0,1,Very good,8.0,0,0,0,eutrofia(pesoadequado)
6,1,28.87,1,0,0,6.0,0.0,1,0,75-79,Black,0,0,Fair,12.0,0,0,0,sobrepeso


In [9]:
# Deletar a coluna IMC
df <- df[, -which(names(df) == "IMC")]
head(df)

Unnamed: 0_level_0,doenca_cardiaca,fumante,bebe_alcool,teve_AVC,PhysicalHealth,MentalHealth,dificul_andar_subir_escadas,sexo,IdadeCategoria,raca,diabetico,atividade_fisica_regular,GenHealth,horas_de_sono,asma,doenca_renal,cancer_de_pele,IMC_grau
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,0,1,0,0,3.0,30.0,0,0,55-59,White,1,1,Very good,5.0,1,0,1,baixopeso
2,0,0,0,1,0.0,0.0,0,0,80 or older,White,0,1,Very good,7.0,0,0,0,eutrofia(pesoadequado)
3,0,1,0,0,20.0,30.0,0,1,65-69,White,1,1,Fair,8.0,1,0,0,sobrepeso
4,0,0,0,0,0.0,0.0,0,0,75-79,White,0,0,Good,6.0,0,0,1,eutrofia(pesoadequado)
5,0,0,0,0,28.0,0.0,1,0,40-44,White,0,1,Very good,8.0,0,0,0,eutrofia(pesoadequado)
6,1,1,0,0,6.0,0.0,1,0,75-79,Black,0,0,Fair,12.0,0,0,0,sobrepeso


In [10]:
# Encontrar colunas vazias
colunas_vazias <- colnames(df)[colSums(is.na(df)) == nrow(df)]

print(colunas_vazias)

character(0)


In [11]:
str(df)

'data.frame':	613231 obs. of  18 variables:
 $ doenca_cardiaca            : chr  "0" "0" "0" "0" ...
 $ fumante                    : chr  "1" "0" "1" "0" ...
 $ bebe_alcool                : chr  "0" "0" "0" "0" ...
 $ teve_AVC                   : chr  "0" "1" "0" "0" ...
 $ PhysicalHealth             : chr  "3.0" "0.0" "20.0" "0.0" ...
 $ MentalHealth               : chr  "30.0" "0.0" "30.0" "0.0" ...
 $ dificul_andar_subir_escadas: chr  "0" "0" "0" "0" ...
 $ sexo                       : chr  "0" "0" "1" "0" ...
 $ IdadeCategoria             : chr  "55-59" "80 or older" "65-69" "75-79" ...
 $ raca                       : chr  "White" "White" "White" "White" ...
 $ diabetico                  : chr  "1" "0" "1" "0" ...
 $ atividade_fisica_regular   : chr  "1" "1" "1" "0" ...
 $ GenHealth                  : chr  "Very good" "Very good" "Fair" "Good" ...
 $ horas_de_sono              : chr  "5.0" "7.0" "8.0" "6.0" ...
 $ asma                       : chr  "1" "0" "1" "0" ...
 $ doenca_rena

In [12]:
# Alterar o tipo das colunas para numérico
df <- df%>%
  mutate(
    doenca_cardiaca = as.numeric(doenca_cardiaca),
    fumante = as.numeric(fumante),
    bebe_alcool = as.numeric(bebe_alcool),
    teve_AVC = as.numeric(teve_AVC),
    dificul_andar_subir_escadas = as.numeric(dificul_andar_subir_escadas),
    sexo = as.numeric(sexo),
    diabetico = as.numeric(diabetico),
    atividade_fisica_regular = as.numeric(atividade_fisica_regular),
    asma = as.numeric(asma),
    doenca_renal = as.numeric(doenca_renal),
    cancer_de_pele  = as.numeric(cancer_de_pele)
  )

str(df)

[1m[22m[36mℹ[39m In argument: `doenca_cardiaca = as.numeric(doenca_cardiaca)`.
[33m![39m NAs introduced by coercion


'data.frame':	613231 obs. of  18 variables:
 $ doenca_cardiaca            : num  0 0 0 0 0 1 0 0 0 0 ...
 $ fumante                    : num  1 0 1 0 0 1 0 1 0 0 ...
 $ bebe_alcool                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ teve_AVC                   : num  0 1 0 0 0 0 0 0 0 0 ...
 $ PhysicalHealth             : chr  "3.0" "0.0" "20.0" "0.0" ...
 $ MentalHealth               : chr  "30.0" "0.0" "30.0" "0.0" ...
 $ dificul_andar_subir_escadas: num  0 0 0 0 1 1 0 1 0 1 ...
 $ sexo                       : num  0 0 1 0 0 0 0 0 0 1 ...
 $ IdadeCategoria             : chr  "55-59" "80 or older" "65-69" "75-79" ...
 $ raca                       : chr  "White" "White" "White" "White" ...
 $ diabetico                  : num  1 0 1 0 0 0 0 1 1 0 ...
 $ atividade_fisica_regular   : num  1 1 1 0 1 0 1 0 0 1 ...
 $ GenHealth                  : chr  "Very good" "Very good" "Fair" "Good" ...
 $ horas_de_sono              : chr  "5.0" "7.0" "8.0" "6.0" ...
 $ asma                       : num  1 

In [13]:
# Pesquisando valores únicos na coluna IdadeCategoria
valores_unicos <- unique(df$IdadeCategoria)

# Imprimir os valores únicos
print(valores_unicos)

 [1] "55-59"       "80 or older" "65-69"       "75-79"       "40-44"      
 [6] "70-74"       "60-64"       "50-54"       "45-49"       "18-24"      
[11] "35-39"       "30-34"       "25-29"       "0"           ""           
[16] "70"          "70o"         "Good"        "Excellent"   "35-"        
[21] "35-o"        "Very good"  


In [14]:
# Vetor de faixas etárias que correspondem a "Adulto"
faixas_adulto <- c("18-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59")

# Substituir valores na coluna IdadeCategoria
df$IdadeCategoria <- ifelse(df$IdadeCategoria %in% faixas_adulto, "Adulto", "Idoso")

# Imprimir os valores únicos atualizados
valores_unicos_atualizados <- unique(df$IdadeCategoria)

In [15]:
valores_unicos <- unique(df$IdadeCategoria)

# Imprimir os valores únicos
print(valores_unicos)

[1] "Adulto" "Idoso" 


In [16]:
# Deletar colunas
df <- df[, -which(names(df) == "PhysicalHealth")]
df <- df[, -which(names(df) == "MentalHealth")]
df <- df[, -which(names(df) == "horas_de_sono")]

In [17]:
str(df)

'data.frame':	613231 obs. of  15 variables:
 $ doenca_cardiaca            : num  0 0 0 0 0 1 0 0 0 0 ...
 $ fumante                    : num  1 0 1 0 0 1 0 1 0 0 ...
 $ bebe_alcool                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ teve_AVC                   : num  0 1 0 0 0 0 0 0 0 0 ...
 $ dificul_andar_subir_escadas: num  0 0 0 0 1 1 0 1 0 1 ...
 $ sexo                       : num  0 0 1 0 0 0 0 0 0 1 ...
 $ IdadeCategoria             : chr  "Adulto" "Idoso" "Idoso" "Idoso" ...
 $ raca                       : chr  "White" "White" "White" "White" ...
 $ diabetico                  : num  1 0 1 0 0 0 0 1 1 0 ...
 $ atividade_fisica_regular   : num  1 1 1 0 1 0 1 0 0 1 ...
 $ GenHealth                  : chr  "Very good" "Very good" "Fair" "Good" ...
 $ asma                       : num  1 0 1 0 0 0 1 1 0 0 ...
 $ doenca_renal               : num  0 0 0 0 0 0 0 0 1 0 ...
 $ cancer_de_pele             : num  1 0 0 1 0 0 1 0 0 0 ...
 $ IMC_grau                   : chr  "baixopeso" "eutrofia(

In [18]:
# Analisando valores únicos da coluna GenHealth
valores_unicos <- unique(df$GenHealth)

# Imprimir os valores únicos
print(valores_unicos)

 [1] "Very good" "Fair"      "Good"      "Poor"      "Excellent" ""         
 [7] "Gooe"      "Goo"       "0"         "1"         "5.0"       "21.7"     
[13] "Gite"      "Go"       


In [19]:
# Substituir valor na coluna GenHealth
df$GenHealth[df$GenHealth == "Very good"] <- "Verygood"

# Imprimir os valores únicos atualizados
valores_unicos_atualizados <- unique(df$GenHealth)
print(valores_unicos_atualizados)

 [1] "Verygood"  "Fair"      "Good"      "Poor"      "Excellent" ""         
 [7] "Gooe"      "Goo"       "0"         "1"         "5.0"       "21.7"     
[13] "Gite"      "Go"       


In [20]:
# Criar variáveis dummy usando model.matrix
df_dummies <- as.data.frame(model.matrix(~ . - 1, data = df))

head(df_dummies)

Unnamed: 0_level_0,doenca_cardiaca,fumante,bebe_alcool,teve_AVC,dificul_andar_subir_escadas,sexo,IdadeCategoriaAdulto,IdadeCategoriaIdoso,racaAmerican Indian/Alaskan Native,racaAsian,⋯,GenHealthPoor,GenHealthVerygood,asma,doenca_renal,cancer_de_pele,IMC_graueutrofia(pesoadequado),IMC_grauobesidadeextrema,IMC_grauobesidadegrau1,IMC_grauobesidadegrau2,IMC_grausobrepeso
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,1,0,0,0,0,1,0,0,0,⋯,0,1,1,0,1,0,0,0,0,0
2,0,0,0,1,0,0,0,1,0,0,⋯,0,1,0,0,0,1,0,0,0,0
3,0,1,0,0,0,1,0,1,0,0,⋯,0,0,1,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,1,0,0,⋯,0,0,0,0,1,1,0,0,0,0
5,0,0,0,0,1,0,1,0,0,0,⋯,0,1,0,0,0,1,0,0,0,0
6,1,1,0,0,1,0,0,1,0,0,⋯,0,0,0,0,0,0,0,0,0,1


In [21]:
# Todos os dados estão em numérico ou inteiros, em 0 ou 1
str(df_dummies)

'data.frame':	608239 obs. of  32 variables:
 $ doenca_cardiaca                   : num  0 0 0 0 0 1 0 0 0 0 ...
 $ fumante                           : num  1 0 1 0 0 1 0 1 0 0 ...
 $ bebe_alcool                       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ teve_AVC                          : num  0 1 0 0 0 0 0 0 0 0 ...
 $ dificul_andar_subir_escadas       : num  0 0 0 0 1 1 0 1 0 1 ...
 $ sexo                              : num  0 0 1 0 0 0 0 0 0 1 ...
 $ IdadeCategoriaAdulto              : num  1 0 0 0 1 0 0 0 0 0 ...
 $ IdadeCategoriaIdoso               : num  0 1 1 1 0 1 1 1 1 1 ...
 $ racaAmerican Indian/Alaskan Native: num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaAsian                         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaBlack                         : num  0 0 0 0 0 1 0 0 0 0 ...
 $ racaHispanic                      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaOther                         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaWhio                          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ r

In [22]:
# Apresentando a correlação dos dados, como possui várias colunas a visualização é ruim
correlacao <- cor(df_dummies)
head(correlacao, 100)

Unnamed: 0,doenca_cardiaca,fumante,bebe_alcool,teve_AVC,dificul_andar_subir_escadas,sexo,IdadeCategoriaAdulto,IdadeCategoriaIdoso,racaAmerican Indian/Alaskan Native,racaAsian,⋯,GenHealthPoor,GenHealthVerygood,asma,doenca_renal,cancer_de_pele,IMC_graueutrofia(pesoadequado),IMC_grauobesidadeextrema,IMC_grauobesidadegrau1,IMC_grauobesidadegrau2,IMC_grausobrepeso
doenca_cardiaca,1.0,0.107713549,-0.0328848357,0.1966108735,0.2008371381,0.069501951,-0.210887061,0.210887061,0.0095695462,-0.0297688838,⋯,0.173411238,-0.1013805662,0.0415051827,0.1453259239,0.0926712347,-0.0496427591,0.020838834,0.0255708625,0.0241047264,0.0036507942
fumante,0.1077135491,1.0,0.1120167469,0.0611478248,0.1203923534,0.084867277,-0.097183037,0.097183037,0.0358994317,-0.0594106872,⋯,0.086503325,-0.052964115,0.0238120297,0.0349907362,0.033391678,-0.0329747891,0.007868393,0.0135319264,0.0087609108,0.0090826325
bebe_alcool,-0.0328848357,0.112016747,1.0,-0.0202111332,-0.0355281027,0.004000694,0.056365494,-0.056365494,-0.0040218956,-0.0220900946,⋯,-0.017111273,0.0125344055,-0.0020936427,-0.0285011852,-0.0061248476,0.0291035947,-0.023084366,-0.0140220581,-0.0194858558,0.0053504993
teve_AVC,0.1966108735,0.061147825,-0.0202111332,1.0,0.1736984183,-0.003140572,-0.119917985,0.119917985,0.0140654296,-0.0157914102,⋯,0.133344945,-0.0690612164,0.0383147838,0.0917862955,0.0477957757,-0.0212960421,0.009908723,0.0127203023,0.0085210624,-0.0018444627
dificul_andar_subir_escadas,0.2008371381,0.120392353,-0.0355281027,0.1736984183,1.0,-0.069604696,-0.201434667,0.201434667,0.0256233834,-0.0378777644,⋯,0.308394278,-0.1845176581,0.1033663231,0.1538816152,0.0647141992,-0.0946029247,0.143920952,0.0403122034,0.0822751646,-0.0595575508
sexo,0.0695019509,0.084867277,0.0040006939,-0.0031405715,-0.0696046958,1.0,0.050639943,-0.050639943,-0.0035032164,0.0143723709,⋯,-0.010915786,-0.0032852061,-0.0691077742,-0.0091435057,0.0124672621,-0.1027304626,-0.04785393,0.0339690207,-0.0077166081,0.1070442661
IdadeCategoriaAdulto,-0.210887061,-0.097183037,0.0563654938,-0.1199179854,-0.2014346673,0.050639943,1.0,-1.0,0.0258779658,0.0630709584,⋯,-0.06728615,0.0252407839,0.0451372407,-0.10949805,-0.2386741607,0.0099660343,0.051576142,-0.004336555,0.0241040375,-0.0430661071
IdadeCategoriaIdoso,0.210887061,0.097183037,-0.0563654938,0.1199179854,0.2014346673,-0.050639943,-1.0,1.0,-0.0258779658,-0.0630709584,⋯,0.06728615,-0.0252407839,-0.0451372407,0.10949805,0.2386741607,-0.0099660343,-0.051576142,0.004336555,-0.0241040375,0.0430661071
racaAmerican Indian/Alaskan Native,0.0095695462,0.035899432,-0.0040218956,0.0140654296,0.0256233834,-0.003503216,0.025877966,-0.025877966,1.0,-0.0200129254,⋯,0.022171873,-0.0250806435,0.0145968346,0.007562106,-0.0252798031,-0.0185926898,0.014928722,0.0122945629,0.0128670598,-0.0065064486
racaAsian,-0.0297688838,-0.059410687,-0.0220900946,-0.0157914102,-0.0378777644,0.014372371,0.063070958,-0.063070958,-0.0200129254,1.0,⋯,-0.01760646,-0.0031099029,-0.0170636686,-0.0165445293,-0.0476216987,0.0658861315,-0.029682094,-0.0379510951,-0.032456838,-0.0067048233


# Dividindo os dados em Treino e Teste(Validação)

In [23]:
# Dividir os dados em treinamento e teste
prop_treino <- 0.7
n_treino <- round(prop_treino * nrow(df_dummies))

dados_treino <- df_dummies[1:n_treino, ]
dados_teste <- df_dummies[(n_treino + 1):nrow(df_dummies), ]

# Amostragem aleatória para conjuntos de treinamento e validação
set.seed(123)  # Define uma semente para a reprodutibilidade
indices_treino <- sample(nrow(df_dummies), n_treino)
dados_treino <- df_dummies[indices_treino, ]
dados_validacao <- df_dummies[-indices_treino, ]

In [24]:
str(dados_treino)

'data.frame':	425767 obs. of  32 variables:
 $ doenca_cardiaca                   : num  0 0 0 0 0 0 0 0 1 0 ...
 $ fumante                           : num  1 0 1 1 0 1 0 1 0 0 ...
 $ bebe_alcool                       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ teve_AVC                          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ dificul_andar_subir_escadas       : num  0 0 0 0 0 1 0 0 0 0 ...
 $ sexo                              : num  0 1 1 0 1 0 1 1 0 1 ...
 $ IdadeCategoriaAdulto              : num  1 1 0 1 1 0 1 1 0 0 ...
 $ IdadeCategoriaIdoso               : num  0 0 1 0 0 1 0 0 1 1 ...
 $ racaAmerican Indian/Alaskan Native: num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaAsian                         : num  0 0 0 1 0 0 0 0 0 0 ...
 $ racaBlack                         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaHispanic                      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaOther                         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaWhio                          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ r

In [25]:
str(dados_validacao)

'data.frame':	182472 obs. of  32 variables:
 $ doenca_cardiaca                   : num  0 0 0 0 1 0 0 0 0 0 ...
 $ fumante                           : num  1 0 0 0 1 1 0 1 0 1 ...
 $ bebe_alcool                       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ teve_AVC                          : num  0 1 0 0 0 0 0 0 0 0 ...
 $ dificul_andar_subir_escadas       : num  0 0 1 0 1 0 1 1 0 0 ...
 $ sexo                              : num  0 0 0 0 1 0 0 1 0 1 ...
 $ IdadeCategoriaAdulto              : num  1 0 1 0 0 1 0 0 0 0 ...
 $ IdadeCategoriaIdoso               : num  0 1 0 1 1 0 1 1 1 1 ...
 $ racaAmerican Indian/Alaskan Native: num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaAsian                         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaBlack                         : num  0 0 0 0 0 0 0 0 1 0 ...
 $ racaHispanic                      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaOther                         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaWhio                          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ r

In [26]:
# Separar os dados em features (X) e variável alvo (y)
X_treino <- dados_treino[, -which(names(dados_treino) == "doenca_cardiaca")]
y_treino <- dados_treino$doenca_cardiaca
X_validacao <- dados_validacao[, -which(names(dados_validacao) == "doenca_cardiaca")]
y_validacao <- dados_validacao$doenca_cardiaca

# Modelo Árvore de decisão

In [27]:
install.packages("rpart")
library(rpart)

# Criar o modelo de árvore de decisão
modelo_arvore <- rpart(y_treino ~ ., data = X_treino, method = "class")

# Fazer predições no conjunto de validação
predicoes <- predict(modelo_arvore, newdata = X_validacao, type = "class")

  # Acurácia
  accuracy <- mean(predicoes == y_validacao)*100
  accuracy <- round(accuracy,2)

  # Imprimir resultados
  cat("Acurácia:", accuracy,"% de acerto", "\n")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



Acurácia: 91.45 % de acerto 
