# Projeto de Linguagem R para Tratamento de Dados e Machine Learning com Árvore de Decisão

## Objetivo:

Desenvolver uma solução em linguagem R para realizar o tratamento de dados e implementar modelos de machine learning utilizando a técnica de árvore de decisão.

## Subobjetivos:

Preparação de Dados:

- Importar conjuntos de dados relevantes.
- Realizar limpeza e pré-processamento dos dados.
- Tratar valores ausentes e outliers.
- Implementação da Árvore de Decisão:

Estudar o funcionamento da técnica de árvore de decisão.

- Utilizar pacotes R para construir e treinar modelos de árvore de decisão.

Avaliação do Modelo:

- Dividir os dados em conjunto de treinamento e teste.
- Avaliar o desempenho da árvore de decisão usando métricas como acurácia.

Este projeto visa criar um sistema em linguagem R para realizar o tratamento de dados e implementar técnicas de machine learning usando árvores de decisão. Os dados serão preparados e processados antes de construir e treinar o modelo.

#### Dificuldades:

- Confundindo comandos do python com a linguagem R
- Em criar função para troca de valores no dataframe.

Lembrando que esses tópicos estão apresentados de forma simplificada e resumida. Um projeto completo envolveria detalhes mais aprofundados e etapas adicionais.

# Projeto: 

In [1]:
# Carregando o Dataframe
df <- read.csv('/content/heart_2020_cleaned.csv')
head(df)

Unnamed: 0_level_0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
Unnamed: 0_level_1,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>
1,No,16.6,Yes,No,No,3,30,No,Female,55-59,White,Yes,Yes,Very good,5,Yes,No,Yes
2,No,20.34,No,No,Yes,0,0,No,Female,80 or older,White,No,Yes,Very good,7,No,No,No
3,No,26.58,Yes,No,No,20,30,No,Male,65-69,White,Yes,Yes,Fair,8,Yes,No,No
4,No,24.21,No,No,No,0,0,No,Female,75-79,White,No,No,Good,6,No,No,Yes
5,No,23.71,No,No,No,28,0,Yes,Female,40-44,White,No,Yes,Very good,8,No,No,No
6,Yes,28.87,Yes,No,No,6,0,Yes,Female,75-79,Black,No,No,Fair,12,No,No,No


In [2]:
# Renomeando as colunas
novos_nomes <- c("doenca_cardiaca",	"IMC",	"fumante",	"bebe_alcool",	"teve_AVC",	"PhysicalHealth",	"MentalHealth",	"dificul_andar_subir_escadas",	"sexo",	"IdadeCategoria",	"raca",	"diabetico",	"atividade_fisica_regular",	"GenHealth",	"horas_de_sono",	"asma",	"doenca_renal",	"cancer_de_pele")
colnames(df) <- novos_nomes
head(df)

Unnamed: 0_level_0,doenca_cardiaca,IMC,fumante,bebe_alcool,teve_AVC,PhysicalHealth,MentalHealth,dificul_andar_subir_escadas,sexo,IdadeCategoria,raca,diabetico,atividade_fisica_regular,GenHealth,horas_de_sono,asma,doenca_renal,cancer_de_pele
Unnamed: 0_level_1,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>
1,No,16.6,Yes,No,No,3,30,No,Female,55-59,White,Yes,Yes,Very good,5,Yes,No,Yes
2,No,20.34,No,No,Yes,0,0,No,Female,80 or older,White,No,Yes,Very good,7,No,No,No
3,No,26.58,Yes,No,No,20,30,No,Male,65-69,White,Yes,Yes,Fair,8,Yes,No,No
4,No,24.21,No,No,No,0,0,No,Female,75-79,White,No,No,Good,6,No,No,Yes
5,No,23.71,No,No,No,28,0,Yes,Female,40-44,White,No,Yes,Very good,8,No,No,No
6,Yes,28.87,Yes,No,No,6,0,Yes,Female,75-79,Black,No,No,Fair,12,No,No,No


In [3]:
# Informações relevantes
str(df)

'data.frame':	319795 obs. of  18 variables:
 $ doenca_cardiaca            : chr  "No" "No" "No" "No" ...
 $ IMC                        : num  16.6 20.3 26.6 24.2 23.7 ...
 $ fumante                    : chr  "Yes" "No" "Yes" "No" ...
 $ bebe_alcool                : chr  "No" "No" "No" "No" ...
 $ teve_AVC                   : chr  "No" "Yes" "No" "No" ...
 $ PhysicalHealth             : num  3 0 20 0 28 6 15 5 0 0 ...
 $ MentalHealth               : num  30 0 30 0 0 0 0 0 0 0 ...
 $ dificul_andar_subir_escadas: chr  "No" "No" "No" "No" ...
 $ sexo                       : chr  "Female" "Female" "Male" "Female" ...
 $ IdadeCategoria             : chr  "55-59" "80 or older" "65-69" "75-79" ...
 $ raca                       : chr  "White" "White" "White" "White" ...
 $ diabetico                  : chr  "Yes" "No" "Yes" "No" ...
 $ atividade_fisica_regular   : chr  "Yes" "Yes" "Yes" "No" ...
 $ GenHealth                  : chr  "Very good" "Very good" "Fair" "Good" ...
 $ horas_de_sono      

In [4]:
# Substituir 'Yes' por 1 e 'No' por 0 em todas as colunas
df <- data.frame(lapply(df, function(col) ifelse(col == 'Yes', 1, ifelse(col == 'No', 0, col))))

In [5]:
head(df)

Unnamed: 0_level_0,doenca_cardiaca,IMC,fumante,bebe_alcool,teve_AVC,PhysicalHealth,MentalHealth,dificul_andar_subir_escadas,sexo,IdadeCategoria,raca,diabetico,atividade_fisica_regular,GenHealth,horas_de_sono,asma,doenca_renal,cancer_de_pele
Unnamed: 0_level_1,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>
1,0,16.6,1,0,0,3,30,0,Female,55-59,White,1,1,Very good,5,1,0,1
2,0,20.34,0,0,1,0,0,0,Female,80 or older,White,0,1,Very good,7,0,0,0
3,0,26.58,1,0,0,20,30,0,Male,65-69,White,1,1,Fair,8,1,0,0
4,0,24.21,0,0,0,0,0,0,Female,75-79,White,0,0,Good,6,0,0,1
5,0,23.71,0,0,0,28,0,1,Female,40-44,White,0,1,Very good,8,0,0,0
6,1,28.87,1,0,0,6,0,1,Female,75-79,Black,0,0,Fair,12,0,0,0


In [6]:
# Listando valores únicos de cada coluna
library(dplyr)
valores_unicos <- df%>%
  summarise_all(~ list(unique(.)))
print(valores_unicos)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




  doenca_cardiaca
1            0, 1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

In [7]:
# Substituir valores
df <- data.frame(lapply(df, function(col) ifelse(col == 'Yes (during pregnancy', 0, ifelse(col == 'No, borderline diabetes', 1, col))))
df <- data.frame(lapply(df, function(col) ifelse(col == 'Female', 0, ifelse(col == 'Male', 1, col))))

valores_unicos <- df %>%
  summarise_all(~ list(unique(.)))
print(valores_unicos)

  doenca_cardiaca
1            0, 1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

In [8]:
# Função para classificar IMC
imc <- function(imc) {
  if (imc < 18.5) {
    return("baixopeso")
  } else if (imc <= 24.9) {
    return("eutrofia(pesoadequado)")
  } else if (imc <= 29.9) {
    return("sobrepeso")
  } else if (imc <= 34.9) {
    return("obesidadegrau1")
  } else if (imc <= 39.9) {
    return("obesidadegrau2")
  } else {
    return("obesidadeextrema")
  }
}

# Aplicar a função à coluna IMC e criar nova coluna IMC_grau
df$IMC_grau <- sapply(df$IMC, imc)

print(df)

     doenca_cardiaca   IMC fumante bebe_alcool teve_AVC PhysicalHealth
1                  0 16.60       1           0        0              3
2                  0 20.34       0           0        1              0
3                  0 26.58       1           0        0             20
4                  0 24.21       0           0        0              0
5                  0 23.71       0           0        0             28
6                  1 28.87       1           0        0              6
7                  0 21.63       0           0        0             15
8                  0 31.64       1           0        0              5
9                  0 26.45       0           0        0              0
10                 0 40.69       0           0        0              0
11                 1 34.30       1           0        0             30
12                 0 28.71       1           0        0              0
13                 0 28.37       1           0        0              0
14    

In [9]:
# Deletar a coluna IMC
df <- df[, -which(names(df) == "IMC")]
head(df)

Unnamed: 0_level_0,doenca_cardiaca,fumante,bebe_alcool,teve_AVC,PhysicalHealth,MentalHealth,dificul_andar_subir_escadas,sexo,IdadeCategoria,raca,diabetico,atividade_fisica_regular,GenHealth,horas_de_sono,asma,doenca_renal,cancer_de_pele,IMC_grau
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>
1,0,1,0,0,3,30,0,0,55-59,White,1,1,Very good,5,1,0,1,baixopeso
2,0,0,0,1,0,0,0,0,80 or older,White,0,1,Very good,7,0,0,0,eutrofia(pesoadequado)
3,0,1,0,0,20,30,0,1,65-69,White,1,1,Fair,8,1,0,0,sobrepeso
4,0,0,0,0,0,0,0,0,75-79,White,0,0,Good,6,0,0,1,eutrofia(pesoadequado)
5,0,0,0,0,28,0,1,0,40-44,White,0,1,Very good,8,0,0,0,eutrofia(pesoadequado)
6,1,1,0,0,6,0,1,0,75-79,Black,0,0,Fair,12,0,0,0,sobrepeso


In [10]:
# Encontrar colunas vazias
colunas_vazias <- colnames(df)[colSums(is.na(df)) == nrow(df)]

print(colunas_vazias)

character(0)


In [11]:
str(df)

'data.frame':	319795 obs. of  18 variables:
 $ doenca_cardiaca            : chr  "0" "0" "0" "0" ...
 $ fumante                    : chr  "1" "0" "1" "0" ...
 $ bebe_alcool                : chr  "0" "0" "0" "0" ...
 $ teve_AVC                   : chr  "0" "1" "0" "0" ...
 $ PhysicalHealth             : num  3 0 20 0 28 6 15 5 0 0 ...
 $ MentalHealth               : num  30 0 30 0 0 0 0 0 0 0 ...
 $ dificul_andar_subir_escadas: chr  "0" "0" "0" "0" ...
 $ sexo                       : chr  "0" "0" "1" "0" ...
 $ IdadeCategoria             : chr  "55-59" "80 or older" "65-69" "75-79" ...
 $ raca                       : chr  "White" "White" "White" "White" ...
 $ diabetico                  : chr  "1" "0" "1" "0" ...
 $ atividade_fisica_regular   : chr  "1" "1" "1" "0" ...
 $ GenHealth                  : chr  "Very good" "Very good" "Fair" "Good" ...
 $ horas_de_sono              : num  5 7 8 6 8 12 4 9 5 10 ...
 $ asma                       : chr  "1" "0" "1" "0" ...
 $ doenca_renal       

In [12]:
# Alterar o tipo das colunas para numérico
df <- df%>%
  mutate(
    doenca_cardiaca = as.numeric(doenca_cardiaca),
    fumante = as.numeric(fumante),
    bebe_alcool = as.numeric(bebe_alcool),
    teve_AVC = as.numeric(teve_AVC),
    dificul_andar_subir_escadas = as.numeric(dificul_andar_subir_escadas),
    sexo = as.numeric(sexo),
    diabetico = as.numeric(diabetico),
    atividade_fisica_regular = as.numeric(atividade_fisica_regular),
    asma = as.numeric(asma),
    doenca_renal = as.numeric(doenca_renal),
    cancer_de_pele  = as.numeric(cancer_de_pele)
  )

str(df)

[1m[22m[36mℹ[39m In argument: `diabetico = as.numeric(diabetico)`.
[33m![39m NAs introduced by coercion”


'data.frame':	319795 obs. of  18 variables:
 $ doenca_cardiaca            : num  0 0 0 0 0 1 0 0 0 0 ...
 $ fumante                    : num  1 0 1 0 0 1 0 1 0 0 ...
 $ bebe_alcool                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ teve_AVC                   : num  0 1 0 0 0 0 0 0 0 0 ...
 $ PhysicalHealth             : num  3 0 20 0 28 6 15 5 0 0 ...
 $ MentalHealth               : num  30 0 30 0 0 0 0 0 0 0 ...
 $ dificul_andar_subir_escadas: num  0 0 0 0 1 1 0 1 0 1 ...
 $ sexo                       : num  0 0 1 0 0 0 0 0 0 1 ...
 $ IdadeCategoria             : chr  "55-59" "80 or older" "65-69" "75-79" ...
 $ raca                       : chr  "White" "White" "White" "White" ...
 $ diabetico                  : num  1 0 1 0 0 0 0 1 1 0 ...
 $ atividade_fisica_regular   : num  1 1 1 0 1 0 1 0 0 1 ...
 $ GenHealth                  : chr  "Very good" "Very good" "Fair" "Good" ...
 $ horas_de_sono              : num  5 7 8 6 8 12 4 9 5 10 ...
 $ asma                       : num  1 0 1 0 0 

In [13]:
# Pesquisando valores únicos na coluna IdadeCategoria
valores_unicos <- unique(df$IdadeCategoria)

# Imprimir os valores únicos
print(valores_unicos)

 [1] "55-59"       "80 or older" "65-69"       "75-79"       "40-44"      
 [6] "70-74"       "60-64"       "50-54"       "45-49"       "18-24"      
[11] "35-39"       "30-34"       "25-29"      


In [14]:
# Vetor de faixas etárias que correspondem a "Adulto"
faixas_adulto <- c("18-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59")

# Substituir valores na coluna IdadeCategoria
df$IdadeCategoria <- ifelse(df$IdadeCategoria %in% faixas_adulto, "Adulto", "Idoso")

# Imprimir os valores únicos atualizados
valores_unicos_atualizados <- unique(df$IdadeCategoria)

In [15]:
valores_unicos <- unique(df$IdadeCategoria)

# Imprimir os valores únicos
print(valores_unicos)

[1] "Adulto" "Idoso" 


In [16]:
# Deletar colunas
df <- df[, -which(names(df) == "PhysicalHealth")]
df <- df[, -which(names(df) == "MentalHealth")]
df <- df[, -which(names(df) == "horas_de_sono")]

In [17]:
str(df)

'data.frame':	319795 obs. of  15 variables:
 $ doenca_cardiaca            : num  0 0 0 0 0 1 0 0 0 0 ...
 $ fumante                    : num  1 0 1 0 0 1 0 1 0 0 ...
 $ bebe_alcool                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ teve_AVC                   : num  0 1 0 0 0 0 0 0 0 0 ...
 $ dificul_andar_subir_escadas: num  0 0 0 0 1 1 0 1 0 1 ...
 $ sexo                       : num  0 0 1 0 0 0 0 0 0 1 ...
 $ IdadeCategoria             : chr  "Adulto" "Idoso" "Idoso" "Idoso" ...
 $ raca                       : chr  "White" "White" "White" "White" ...
 $ diabetico                  : num  1 0 1 0 0 0 0 1 1 0 ...
 $ atividade_fisica_regular   : num  1 1 1 0 1 0 1 0 0 1 ...
 $ GenHealth                  : chr  "Very good" "Very good" "Fair" "Good" ...
 $ asma                       : num  1 0 1 0 0 0 1 1 0 0 ...
 $ doenca_renal               : num  0 0 0 0 0 0 0 0 1 0 ...
 $ cancer_de_pele             : num  1 0 0 1 0 0 1 0 0 0 ...
 $ IMC_grau                   : chr  "baixopeso" "eutrofia(

In [18]:
# Analisando valores únicos da coluna GenHealth
valores_unicos <- unique(df$GenHealth)

# Imprimir os valores únicos
print(valores_unicos)

[1] "Very good" "Fair"      "Good"      "Poor"      "Excellent"


In [19]:
# Substituir valor na coluna GenHealth
df$GenHealth[df$GenHealth == "Very good"] <- "Verygood"

# Imprimir os valores únicos atualizados
valores_unicos_atualizados <- unique(df$GenHealth)
print(valores_unicos_atualizados)

[1] "Verygood"  "Fair"      "Good"      "Poor"      "Excellent"


In [20]:
# Criar variáveis dummy usando model.matrix
df_dummies <- as.data.frame(model.matrix(~ . - 1, data = df))

head(df_dummies)

Unnamed: 0_level_0,doenca_cardiaca,fumante,bebe_alcool,teve_AVC,dificul_andar_subir_escadas,sexo,IdadeCategoriaAdulto,IdadeCategoriaIdoso,racaAsian,racaBlack,⋯,GenHealthPoor,GenHealthVerygood,asma,doenca_renal,cancer_de_pele,IMC_graueutrofia(pesoadequado),IMC_grauobesidadeextrema,IMC_grauobesidadegrau1,IMC_grauobesidadegrau2,IMC_grausobrepeso
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,1,0,0,0,0,1,0,0,0,⋯,0,1,1,0,1,0,0,0,0,0
2,0,0,0,1,0,0,0,1,0,0,⋯,0,1,0,0,0,1,0,0,0,0
3,0,1,0,0,0,1,0,1,0,0,⋯,0,0,1,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,1,0,0,⋯,0,0,0,0,1,1,0,0,0,0
5,0,0,0,0,1,0,1,0,0,0,⋯,0,1,0,0,0,1,0,0,0,0
6,1,1,0,0,1,0,0,1,0,1,⋯,0,0,0,0,0,0,0,0,0,1


In [21]:
# Todos os dados estão em numérico ou inteiros, em 0 ou 1 
str(df_dummies)

'data.frame':	317236 obs. of  27 variables:
 $ doenca_cardiaca               : num  0 0 0 0 0 1 0 0 0 0 ...
 $ fumante                       : num  1 0 1 0 0 1 0 1 0 0 ...
 $ bebe_alcool                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ teve_AVC                      : num  0 1 0 0 0 0 0 0 0 0 ...
 $ dificul_andar_subir_escadas   : num  0 0 0 0 1 1 0 1 0 1 ...
 $ sexo                          : num  0 0 1 0 0 0 0 0 0 1 ...
 $ IdadeCategoriaAdulto          : num  1 0 0 0 1 0 0 0 0 0 ...
 $ IdadeCategoriaIdoso           : num  0 1 1 1 0 1 1 1 1 1 ...
 $ racaAsian                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaBlack                     : num  0 0 0 0 0 1 0 0 0 0 ...
 $ racaHispanic                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaOther                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaWhite                     : num  1 1 1 1 1 0 1 1 1 1 ...
 $ diabetico                     : num  1 0 1 0 0 0 0 1 1 0 ...
 $ atividade_fisica_regular      : num  1 1 1 0 1 0 1 0 0 1 

In [22]:
# Apresentando a correlação dos dados, como possui várias colunas a visualização é ruim
correlacao <- cor(df_dummies)
head(correlacao, 100)

Unnamed: 0,doenca_cardiaca,fumante,bebe_alcool,teve_AVC,dificul_andar_subir_escadas,sexo,IdadeCategoriaAdulto,IdadeCategoriaIdoso,racaAsian,racaBlack,⋯,GenHealthPoor,GenHealthVerygood,asma,doenca_renal,cancer_de_pele,IMC_graueutrofia(pesoadequado),IMC_grauobesidadeextrema,IMC_grauobesidadegrau1,IMC_grauobesidadegrau2,IMC_grausobrepeso
doenca_cardiaca,1.0,0.10780795,-0.032667238,0.1969750133,0.201366817,0.069252663,-0.210973157,0.210973157,-0.030185875,-0.01047156,⋯,0.174596181,-0.102115463,0.0415420054,0.14504013,0.0933090377,-0.049020108,0.021171004,0.0252990362,0.023853128,0.003353678
fumante,0.10780795,1.0,0.11171649,0.0610108136,0.120128472,0.085237372,-0.097263603,0.097263603,-0.060004623,-0.03808824,⋯,0.086550413,-0.05233091,0.0238568325,0.034743214,0.0339604597,-0.032538818,0.007197875,0.0131467221,0.008474163,0.009523214
bebe_alcool,-0.032667238,0.11171649,1.0,-0.0199336361,-0.03555208,0.003878966,0.056102634,-0.056102634,-0.02225543,-0.02588137,⋯,-0.017341385,0.013084029,-0.0022128319,-0.028419744,-0.0058741289,0.028695761,-0.022755787,-0.0143518275,-0.019243723,0.005607271
teve_AVC,0.196975013,0.061010814,-0.019933636,1.0,0.174226766,-0.003501323,-0.119677218,0.119677218,-0.016095611,0.02485745,⋯,0.13371471,-0.069454189,0.0389742019,0.091251699,0.0480291681,-0.021085686,0.010050961,0.0125458599,0.008063519,-0.001611544
dificul_andar_subir_escadas,0.201366817,0.120128472,-0.03555208,0.1742267657,1.0,-0.069950965,-0.2016199,0.2016199,-0.03857048,0.03957076,⋯,0.309027799,-0.185149966,0.1034577395,0.153216385,0.0647695648,-0.094653371,0.144293268,0.040094133,0.082401896,-0.059439816
sexo,0.069252663,0.085237372,0.003878966,-0.0035013232,-0.069950965,1.0,0.051067666,-0.051067666,0.01447531,-0.03763934,⋯,-0.011104166,-0.003289453,-0.0686688354,-0.009549371,0.0126520892,-0.101809408,-0.047849168,0.0339202246,-0.008504832,0.106690209
IdadeCategoriaAdulto,-0.210973157,-0.097263603,0.056102634,-0.1196772175,-0.2016199,0.051067666,1.0,-1.0,0.064814047,0.0356513,⋯,-0.067922919,0.025268346,0.04452755,-0.109431899,-0.2400250824,0.009670278,0.051409769,-0.0039730799,0.024380805,-0.043083421
IdadeCategoriaIdoso,0.210973157,0.097263603,-0.056102634,0.1196772175,0.2016199,-0.051067666,-1.0,1.0,-0.064814047,-0.0356513,⋯,0.067922919,-0.025268346,-0.04452755,0.109431899,0.2400250824,-0.009670278,-0.051409769,0.0039730799,-0.024380805,0.043083421
racaAsian,-0.030185875,-0.060004623,-0.02225543,-0.0160956111,-0.03857048,0.01447531,0.064814047,-0.064814047,1.0,-0.04463471,⋯,-0.017984917,-0.003198975,-0.0169493324,-0.016795545,-0.0479650415,0.066317292,-0.029530459,-0.038080244,-0.032500885,-0.007058883
racaBlack,-0.010471556,-0.038088241,-0.025881366,0.0248574525,0.039570759,-0.037639338,0.035651304,-0.035651304,-0.044634713,1.0,⋯,0.011035099,-0.043067756,0.0214565143,0.010351817,-0.0835466125,-0.048500135,0.051113016,0.0274806358,0.038688986,-0.019650265


# Dividindo os dados em Treino e Teste(Validação)

In [23]:
# Dividir os dados em treinamento e teste
prop_treino <- 0.7
n_treino <- round(prop_treino * nrow(df_dummies))

dados_treino <- df_dummies[1:n_treino, ]
dados_teste <- df_dummies[(n_treino + 1):nrow(df_dummies), ]

# Amostragem aleatória para conjuntos de treinamento e validação
set.seed(123)  # Define uma semente para a reprodutibilidade
indices_treino <- sample(nrow(df_dummies), n_treino)
dados_treino <- df_dummies[indices_treino, ]
dados_validacao <- df_dummies[-indices_treino, ]

In [24]:
str(dados_treino)

'data.frame':	222065 obs. of  27 variables:
 $ doenca_cardiaca               : num  0 0 0 0 0 0 0 1 0 0 ...
 $ fumante                       : num  0 1 1 0 0 1 0 0 1 0 ...
 $ bebe_alcool                   : num  0 0 0 1 0 0 0 0 0 0 ...
 $ teve_AVC                      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ dificul_andar_subir_escadas   : num  0 0 0 0 0 0 1 1 0 0 ...
 $ sexo                          : num  1 1 0 1 0 1 0 0 0 0 ...
 $ IdadeCategoriaAdulto          : num  1 1 0 0 1 0 0 0 0 0 ...
 $ IdadeCategoriaIdoso           : num  0 0 1 1 0 1 1 1 1 1 ...
 $ racaAsian                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaBlack                     : num  0 0 0 0 1 0 0 0 0 0 ...
 $ racaHispanic                  : num  0 0 0 0 0 1 1 0 0 0 ...
 $ racaOther                     : num  0 0 0 0 0 0 0 0 1 0 ...
 $ racaWhite                     : num  1 1 1 1 0 0 0 1 0 1 ...
 $ diabetico                     : num  1 0 0 0 0 0 0 0 1 0 ...
 $ atividade_fisica_regular      : num  1 1 0 1 1 0 1 1 1 1 

In [25]:
str(dados_validacao)

'data.frame':	95171 obs. of  27 variables:
 $ doenca_cardiaca               : num  0 1 0 0 0 0 0 0 0 0 ...
 $ fumante                       : num  0 1 0 1 1 0 1 0 0 0 ...
 $ bebe_alcool                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ teve_AVC                      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ dificul_andar_subir_escadas   : num  1 1 1 1 1 0 1 0 0 0 ...
 $ sexo                          : num  0 0 0 0 1 0 0 0 0 1 ...
 $ IdadeCategoriaAdulto          : num  1 0 0 0 0 0 0 0 0 0 ...
 $ IdadeCategoriaIdoso           : num  0 1 1 1 1 1 1 1 1 1 ...
 $ racaAsian                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaBlack                     : num  0 1 0 0 0 0 0 0 1 0 ...
 $ racaHispanic                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaOther                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ racaWhite                     : num  1 0 1 1 1 1 1 1 0 1 ...
 $ diabetico                     : num  0 0 0 0 1 0 0 0 0 0 ...
 $ atividade_fisica_regular      : num  1 0 0 0 1 1 1 1 1 1 .

In [26]:
# Separar os dados em features (X) e variável alvo (y)
X_treino <- dados_treino[, -which(names(dados_treino) == "doenca_cardiaca")]
y_treino <- dados_treino$doenca_cardiaca
X_validacao <- dados_validacao[, -which(names(dados_validacao) == "doenca_cardiaca")]
y_validacao <- dados_validacao$doenca_cardiaca

# Modelo Árvore de decisão

In [27]:
install.packages("rpart")
library(rpart)

# Criar o modelo de árvore de decisão
modelo_arvore <- rpart(y_treino ~ ., data = X_treino, method = "class")

# Fazer predições no conjunto de validação
predicoes <- predict(modelo_arvore, newdata = X_validacao, type = "class")

  # Acurácia
  accuracy <- mean(predicoes == y_validacao)*100
  accuracy <- round(accuracy,2)

  # Imprimir resultados
  cat("Acurácia:", accuracy,"% de acerto", "\n")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



Acurácia: 91.34 % de acerto 
