# Proyecto 2. Introducción a la inteligencia artificial

## Objetivo: 

Aplicar el concepto de aprendizaje de máquina para resolver un problema de clasificación usando los métodos vistos en el curso. 

---

## Actividades:

1. [X] Visite el sitio web de Kaggle y descargue el conjunto de datos con ejemplos de enfermedades de corazón https://www.kaggle.com/ronitf/heart-disease-uci  (En este la columna objetivo es target) Deberían usar este
2. [X] Lea la descripción de los datos cuidadosamente.
3. [X] Crear un notebook de Python(puede ser un kernel de Kaggle o un notebook local en Jupyter)
4. [X] Cargar el conjunto de datos como un dataFrame de pandas. Realizar gráficos para cada una de las variables usando matplotlib. Las variables numéricas deben ser graficadas como histogramas, y las variables categóricas deben ser representadas como diagramas de tortas. Luego es importante que previamente haya clasificado las variables en estos 2 grupos. Su visualización debe verse como las imágenes a continuación para sus variables numéricas y categóricas:


5. [ ] Complete o remueva los elementos faltantes del conjunto de datos si existen
6. [ ] Divida el conjunto de datos en 2. 80% para entrenamiento y 20% para pruebas
7. [ ] Entrene un modelo de árbol de decisión. Ajuste los parámetros necesarios para obtener un buen resultado. Reporte la precisión del modelo para el conjunto de entrenamiento y para el conjunto de prueba. También reporta la matriz de confusión para el conjunto de pruebas.  
8. [ ] Realice una interpretación del modelo obtenido. Para esto puede imprimir el modelo obtenido por python. ¿Qué tan fácil es?
9. [ ] Repita los pasos 7 y 8 para un modelo Naive Bayes y una Red Neuronal. 
10. [ ] Compare los resultados de los 3 modelos usados en términos de la precisión, la estabilidad y la interpretabilidad de los resultados. 
11. [ ] En su opinión, ¿¡cuál de los 3 métodos usaría para resolver el problema de predecir enfermedades del corazón y porqué?

---

## Entregables

* [ ] Una carpeta comprimida con el notebook en Júpiter  y los datos. La primera línea de su notebook debe ser la instalación de todas la librerías que necesite para correr su programa usando conda. Las librerías numpy, pandas, matplotlib y sklearn ya vienen instaladas con la versión completa de anaconda.
* [ ] El notebook debe contener comentarios y apuntes suficientes como para ser el informe final.
* [ ] Una presentación de 15 minutos que se realizará en el salón de clase  el 6 de septiembre. La presentación debe concentrarse en presentar los resultados y las conclusiones a las cuales llegaron. 

---


## Attributes
1. age (in years)
2. sex (1 = male; 0 = female)
3. cp => chest pain type (4 values)
4. trestbps => resting blood pressure (in mm Hg on admission to the hospital) 
5. chol => serum cholestoral in mg/dl
6. fbs => fasting blood sugar > 120 mg/dl (1 = true; 0 = false) 
7. restecg => resting electrocardiographic results (values 0,1,2)
8. thalach => maximum heart rate achieved
9. exang => exercise induced angina (1 = yes; 0 = no) 
10. oldpeak => ST depression induced by exercise relative to rest
11. slope => the slope of the peak exercise ST segment
12. ca => number of major vessels (0-3) colored by flourosopy
13. thal => thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. target =>

In [53]:
#%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

allData = pd.read_csv('heart.csv')
#data = allData[['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']]


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# 1--- AGE OF PATIENTE PLOT
plt.figure()
counts, bins = np.histogram(data['age'])
plt.hist(bins[:-1], bins, weights=counts)
plt.xlabel('Age')
plt.ylabel('Patients')
plt.text(30, 45, r'$\mu=100,\ \sigma=15$')
#plt.xlim(40, 160)
#plt.ylim(0, 0.03)
plt.grid(True)
plt.title('Age Histogram')

# 2--- SEX OF PATIENTE PLOT
plt.figure()
list_sex = data['sex']
length = len(list_sex)
male = np.count_nonzero(list_sex == 1)
pie_sex = [male, length-male]
labels = ['male', 'female']
colors = ['tab:cyan','r']
explode = (0.1, 0)
plt.pie(pie_sex,labels=labels,colors=colors,explode=explode,autopct='%1.1f%%',shadow=True, startangle=90);
plt.title('Sex of pacient');

# 3--- CHEST PAIN TYPE PLOT
plt.figure()
list_cp = data['cp']
length = len(list_cp)
cp0 = np.count_nonzero(list_cp == 0)
cp1 = np.count_nonzero(list_cp == 1)
cp2 = np.count_nonzero(list_cp == 2)
cp3 = np.count_nonzero(list_cp == 3)
pie_cp = [cp0,cp1,cp2,cp3]
labels = ['cp=0', 'cp=1', 'cp=2', 'cp=3']
colors = ['palegreen','moccasin','coral','r']
explode = (0.1, 0,0,0)
plt.pie(pie_cp,labels=labels,colors=colors,explode=explode,autopct='%1.1f%%',shadow=True, startangle=90);
plt.title('chest pain type');

# 4--- RESTING BLOOD PRESSURE PLOT
plt.figure()
counts, bins = np.histogram(data['trestbps'])
plt.hist(bins[:-1], bins, weights=counts,color='k', alpha=0.5)
plt.xlabel('mm Hg')
plt.ylabel('Patients')
plt.title('Histogram of trestbps')
plt.text(160, 45, r'$\mu=100,\ \sigma=15$')
#plt.xlim(0, 160)
#plt.ylim(0, 0.03)
plt.grid(True)
plt.title('Resting Blood Pressure Histogram');

# 5--- SERUM CHOLESTORAL PLOT
plt.figure()
counts, bins = np.histogram(data['chol'])
plt.hist(bins[:-1], bins, weights=counts)
plt.xlabel('mg/dl')
plt.ylabel('Patients')
#plt.xlim(40, 160)
#plt.ylim(0, 0.03)
plt.grid(True)
plt.title('Serum Cholestoral Histogram')

# 6--- FASTING BLOOD SUGAR PLOT
plt.figure()
list_fbs = data['fbs']
length = len(list_fbs)
fbs = np.count_nonzero(list_fbs == 1)
pie_fbs = [fbs, length-fbs]
labels = ['fbs=1', 'fbs=0']
colors = ['coral','palegreen']
explode = (0.1, 0)
plt.pie(pie_fbs,labels=labels,colors=colors,explode=explode,autopct='%1.1f%%',shadow=True, startangle=90);
plt.title('Fasting blood sugar');

# 7--- RESTING ELECTROCARDIOGRAPHIC PLOT
plt.figure()
list_restecg = data['restecg']
re0 = np.count_nonzero(list_restecg == 0)
re1 = np.count_nonzero(list_restecg == 1)
re2 = np.count_nonzero(list_restecg == 2)
pie_rest = [re0,re1,re2]
labels = ['0','1','2']
colors = ['tab:cyan','pink','lightgrey']
explode = (0.1, 0.1,0)
plt.pie(pie_rest,labels=labels,colors=colors,explode=explode,autopct='%1.1f%%',shadow=True, startangle=90);
plt.title('Resting Electrocardiographic Results');

# 8--- MAXIMUM HEART RATE ACHIEVED PLOT
plt.figure()
counts, bins = np.histogram(data['thalach'])
plt.hist(bins[:-1], bins, weights=counts)
plt.xlabel('bps')
plt.ylabel('Patients')
#plt.xlim(40, 160)
#plt.ylim(0, 0.03)
plt.grid(True)
plt.title('Maximum Heart Rate Achived Histogram')

# 9--- EXERCISE INDUCED ANGINA PLOT  exang => exercise induced angina
plt.figure()
list_ex = data['exang']
length = len(list_ex)
ex = np.count_nonzero(list_ex == 1)
pie_ex = [length-ex,ex]
labels = ['0', '1']
colors = ['tab:cyan','r']
explode = (0.1, 0)
plt.pie(pie_ex,labels=labels,colors=colors,explode=explode,autopct='%1.1f%%',shadow=True, startangle=90);
plt.title('Exercise Induced Angina');

# 10-- ST DEPRESSION INDUCED BY EXERCISE RELATIVE TO REST ST PLOT
plt.figure()
counts, bins = np.histogram(data['oldpeak'])
plt.hist(bins[:-1], bins, weights=counts)
plt.xlabel('oldpeak')
plt.ylabel('Patients')
#plt.xlim(40, 160)
#plt.ylim(0, 0.03)
plt.grid(True)
plt.title('ST depression induced by exercise relative to rest Histogram')

# 11-- THE SLOPE OF THE PEAK EXERCISE ST SEGMENT PLOT   slope => the slope of the peak exercise ST segment
plt.figure()
list_slope = data['slope']
sl0 = np.count_nonzero(list_slope == 0)
sl1 = np.count_nonzero(list_slope == 1)
sl2 = np.count_nonzero(list_slope == 2)
pie_slope = [sl0,sl1,sl2]
labels = ['0','1','2']
colors = ['tab:cyan','pink','lightgrey']
explode = (0.1, 0,0)
plt.pie(pie_slope,labels=labels,colors=colors,explode=explode,autopct='%1.1f%%',shadow=True, startangle=90);
plt.title('The Slope of the Peak Exercise ST Segment');

# 12-- NUMBER OF MAJOR VESSELS PLOT  ca => number of major vessels (0-3) colored by flourosopy
plt.figure()
counts, bins = np.histogram(data['ca'])
plt.hist(bins[:-1], bins, weights=counts)
plt.xlabel('Number of Major Vessels')
plt.ylabel('Patients')
#plt.xlim(40, 160)
#plt.ylim(0, 0.03)
plt.grid(True)
plt.title('Number of Major Vessels Histogram');

# 13-- THAL PLOT
plt.figure()
list_thal = data['thal']
thal0 = np.count_nonzero(list_thal == 0)
thal1 = np.count_nonzero(list_thal == 1)
thal2 = np.count_nonzero(list_thal == 2)
thal3 = np.count_nonzero(list_thal == 3)

pie_thal = [thal0,thal1,thal2,thal3]
labels = ['0','1','2','3']
colors = ['tab:cyan','pink','lightgrey','r']
explode = (0.1, 0,0,0)
plt.pie(pie_thal,labels=labels,colors=colors,explode=explode,autopct='%1.1f%%',shadow=True, startangle=90);
plt.title('Thal PLOT');

Split the data 80% for training, 20% for testing predictions

In [None]:
#data #.to_numpy() #.values documentation says to better use .to_numpy() rather than .values (they give the same result)


trainSet = allData.iloc[:241,:] 
testSet = allData.iloc[242:,:]

In [12]:
import csv
import math
import random
def loadCsv(filename):
    lines = csv.reader(open(r'heart.csv'))
    dataset = list(lines)
    for i in range(len(dataset)):
        dataset[i] = [float(x) for x in dataset[i]]
    return dataset

def splitDataset(data,split):
    trainSize = int(len(data)*split)
    trainSet=[]
    copy = list(data)
    while len(trainSet) < trainSize:
        index = random.randrange(len(copy))
        trainSet.append(copy.pop(index))
    return [trainSet,copy]

In [None]:
def separateByClass(dataset):
    separated = {}
    for i in range(len(dataset)):
        vector = dataset[i]
        if (vector[-1] not in separated):
            separated[vector[-1]] = []
        separated[vector[-1]].append(vector)
    return separated

In [None]:
def mean(numbers):
    return sum(numbers)/float(len(numbers))

In [None]:
def stdev(numbers):
    avg = mean(numbers)
    variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
    return math.sqrt(variance)

In [None]:
def summarize(dataset):
    summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
    del summaries[-1]
    return summaries

In [None]:
def summarizeByClass(dataset):
    separated = separateByClass(dataset)
    summaries = {}
    for classValue, instances in separated.items():
        summaries[classValue] = summarize(instances)
    return summaries

In [None]:
def calculateProbability(x,mean,stdev):
    exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
    return (1/(math.sqrt(2*math.pi)*stdev))*exponent

In [None]:
def calculateClassProbabilities(summaries,inputVector):
    probabilities = {}
    for classValue, classSummaries in summaries.items():
        probabilities[classValue] = 1
        for i in range (len(classSummaries)):
            mean,stdev = classSummaries[i]
            x = inputVector[i]
            probabilities[classValue] *= calculateProbability(x, mean, stdev)
        return probabilities

In [None]:
def predict(summaries,inputVector):
    probabilities = calculateClassProbabilities(summaries, inputVector)
    bestLabel, bestProb = None, -1
    for classValue, probability in probabilities.items():
        if bestLabel is None or probability > bestProb:
            bestProb = probability
            bestLabel = classValue
        return bestLabel

In [None]:
def getPredictions(summaries, testSet):
    predictions = []
    for i in range(len(testSet)):
        result = predict(summaries, testSet[i])
        predictions.append(result)
    return predictions

In [None]:
def getAccuracy(testSet, predictions):
    correct=0
    for x in range(len(testSet)):
        if testSet[x][-1]==predictions[x]:
            correct += 1
        return (correct/float(len(testSet)))*100.0

In [None]:
trainD = trainDat(data['age'],0.8)
len(trainD[0])
def main():
    filename = 'heart.csv'
    splitRatio = 0.67
    dataset = loadCsv(filename)
    trainingSet, testSet = splitDataset(dataset,splitRatio)
    print('Split {0} rows into train = {1} and test = {2} rows'.format(len(dataset),len(trainingSet),len(testSet)))
    #prepare model
    summaries = summarizeByClass(trainingSet)
    #test model
    predictions = getPredictions(summaries, testSet)
    accuracy = getAccuracy(testSet, predictions)
    print('Accuracy: {0}%'.format(accuracy))
    
main()

In [None]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
[df['col2'],df['col1']]

In [3]:
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import BernoulliNB

In [35]:
datairis = datasets.load_iris()
#datairis



In [20]:
datos = pd.read_csv('heart.csv')
datos[['age','sex']]



Unnamed: 0,age,sex
0,63,1
1,37,1
2,41,0
3,56,1
4,57,0
...,...,...
298,57,0
299,45,1
300,68,1
301,57,1


In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target

datos = pd.read_csv('heart.csv')
dataset = datos[['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']]


split = 0.8
[trainSet, test] = splitDataset(dataset,split)
#print(dataset)

IndexError: pop from empty list

train the model

In [5]:
model = BernoulliNB()
model.fit(dataset,datos.target)

BernoulliNB()

Make predictions

In [10]:
expected = datos.target
predicted = model.predict(dataset)

In [9]:
print(model)

BernoulliNB()


In [11]:
print(metrics.classification_report(expected,predicted))
print(metrics.confusion_matrix(expected, predicted))

              precision    recall  f1-score   support

           0       0.84      0.78      0.81       138
           1       0.83      0.87      0.85       165

    accuracy                           0.83       303
   macro avg       0.83      0.83      0.83       303
weighted avg       0.83      0.83      0.83       303

[[108  30]
 [ 21 144]]


## Decision Tree
### Example:
https://www.youtube.com/watch?v=LDRbO9a6XPU  
GH code: https://github.com/random-forests/tutorials/blob/master/decision_tree.ipynb

## Naïve Bayes Classifier
### What is it (basic):
https://www.youtube.com/watch?v=CPqOCI0ahss (didn't like the video much, but is easy to understand)

 to plot in matplotlib to plot in matplotlib### how to plot in matplotlib
#### colors:
https://matplotlib.org/3.1.0/gallery/color/named_colors.html
#### histograms:
https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.hist.html#matplotlib.pyplot.hist
#### pie chart:
https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.pie.html#matplotlib.pyplot.pie

### numpy stuffs
#### numpy count
https://note.nkmk.me/en/python-numpy-count/

* To run bash commands use the ! at the beggining

* To plot with matplotlib ->

In [None]:
#%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np

In [None]:
rng = np.random.RandomState(69)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
    plt.plot(rng.rand(5), rng.rand(5), marker,
             label="marker='{0}'".format(marker))
plt.legend(numpoints=1)
plt.xlim(0, 1.8);

In [None]:
%lsmagic

In [None]:
%%javascript
console.log("hello World");