## Heart Disease
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them.  In particular, the Cleveland database is the only one that has been used by ML researchers to date.  The "goal" field refers to the presence of heart disease in the patient.  It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).  

Variable Name	Role	Type	Demographic	Description	Units	Missing Values
age	Feature	Integer	Age		years	no
sex	Feature	Categorical	Sex			no
cp	Feature	Categorical				no
trestbps	Feature	Integer		resting blood pressure (on admission to the hospital)	mm Hg	no
chol	Feature	Integer		serum cholestoral	mg/dl	no
fbs	Feature	Categorical		fasting blood sugar > 120 mg/dl		no
restecg	Feature	Categorical				no
thalach	Feature	Integer		maximum heart rate achieved		no
exang	Feature	Categorical		exercise induced angina		no
oldpeak	Feature	Integer		ST depression induced by exercise relative to rest		no
slope	Feature	Categorical				no
ca	Feature	Integer		number of major vessels (0-3) colored by flourosopy		yes
thal	Feature	Categorical				yes
num	Target	Integer		diagnosis of heart disease		no

In [1]:
import pandas as pd

columnas = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "class"]
data = pd.read_csv('processed.cleveland.data', names=columnas)
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,class
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [2]:
#Contar los datos
data.count()

age         303
sex         303
cp          303
trestbps    303
chol        303
fbs         303
restecg     303
thalach     303
exang       303
oldpeak     303
slope       303
ca          303
thal        303
class       303
dtype: int64

In [3]:
print("Elementos en la columna AGE\n")
data['age'] = data['age'].astype(int) #Cambiar tipo de dato
print(data['age'].unique())

print("Elementos en la columna SEX\n")
data['sex'] = data['sex'].astype(int) #Cambiar tipo de dato
print(data['sex'].unique())

print("Elementos en la columna CP\n")
#data['sex'] = data['sex'].astype(int) #Cambiar tipo de dato
print(data['cp'].unique())

print("Elementos en la columna trestbps \n")
#data['sex'] = data['sex'].astype(int) #Cambiar tipo de dato
print(data['trestbps'].unique())

#Repetir para todas las variables para comprobar

print("Elementos en la columna CA \n")
#data['sex'] = data['sex'].astype(int) #Cambiar tipo de dato
print(data['ca'].unique())

#Cambiar a numpy el '?'
import numpy as np

data = data.replace('?', np.nan)
print("Elementos en la columna CA \n")
#data['sex'] = data['sex'].astype(int) #Cambiar tipo de dato
print(data['ca'].unique())

Elementos en la columna AGE

[63 67 37 41 56 62 57 53 44 52 48 54 49 64 58 60 50 66 43 40 69 59 42 55
 61 65 71 51 46 45 39 68 47 34 35 29 70 77 38 74 76]
Elementos en la columna SEX

[1 0]
Elementos en la columna CP

[1. 4. 3. 2.]
Elementos en la columna trestbps 

[145. 160. 120. 130. 140. 172. 150. 110. 132. 117. 135. 112. 105. 124.
 125. 142. 128. 170. 155. 104. 180. 138. 108. 134. 122. 115. 118. 100.
 200.  94. 165. 102. 152. 101. 126. 174. 148. 178. 158. 192. 129. 144.
 123. 136. 146. 106. 156. 154. 114. 164.]
Elementos en la columna CA 

['0.0' '3.0' '2.0' '1.0' '?']
Elementos en la columna CA 

['0.0' '3.0' '2.0' '1.0' nan]


In [None]:
data.count()

### ¿Qué hacer en caso de valores faltantes?
1. Eliminar valores faltantes
    - Cuanto tenemos muchos datos
    - Cuando en las características no hay muchos datos
2. Imputar los datos
   - Numérico (Media)
   - Categórico (Frecuencia)

In [5]:
#Como CA es una variable categórica - Frecuencia
freq_ca = data['ca'].value_counts().idxmax()
print(freq_ca)
data['ca'] = data['ca'].replace(np.nan, freq_ca)


0.0


'3.0'

In [None]:

#Cambiar para thal
freq_thal = data['thal'].value_counts().idxmax()
print(freq_thal)
data['thal'] = data['thal'].replace(np.nan, freq_thal)

In [None]:
media_edad = data['age'].mean()
print(media_edad)

In [None]:
data.count()

In [25]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X = data.drop('class', axis = 1)
y = data['class']

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier()

rf.fit(xtrain, ytrain)

ypred = rf.predict(xtest)

In [None]:
from sklearn.metrics import accuracy_score

print(f"la exactitud alcanzada fue: {accuracy_score(ytest, ypred):.2f}", )