<a href="https://colab.research.google.com/github/nfrn/Tutorial_Intro_ML_2Health/blob/master/Tutorial_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install keras_tqdm
!pip install matplotlib==3.1.0

### **Breast Cancer Wisconsin (Diagnostic) Data Set**
Processing data from:

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/

We read in the data and do some basic cleanup for missing values. For the description of the fields, see :

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names

Features were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. In summary:
```
 Sample code number          : Id number (not used and thus dropped)
 Clump Thickness             : 1–10
 Uniformity of Cell Size     : 1–10
 Uniformity of Cell Shape    : 1–10
 Marginal Adhesion           : 1–10
 Single Epithelial Cell Size : 1–10
 Bare Nuclei                 : 1–10
 Bland Chromatin             : 1–10
 Normal Nucleoli             : 1–10
 Mitoses                     : 1–10
 Class                       : 2 for benign, 4 for malignant
```



In [0]:
import pandas as pd
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", header = None)
data = data.drop(data.columns[0], axis=1)
data.head(10)

###**Data Cleaning and Preprocessing**

In [0]:
df = pd.DataFrame(data)
# Notice that in column 6 there are some missing values.
print(df.loc[df[6] == "?"].head(5))

# We calculate the mean of that feature.
df_6_without_missing_values = df[6].loc[df[6] != "?"]
mean = df_6_without_missing_values.astype(int).mean()
print("Mean value: " + str(mean))

# Replace missing values with mean value
df[6] = df[6].replace("?", mean)
df[6] = df[6].astype(int)

# Re-arranging labels 2 -> 0 and 4 -> 1
df[10] = df[10].replace(2,0).replace(4,1)

###**Exploratory Data Analysis (EDA)**

In [0]:
names = [ "ID", "Clump thickness", "Uniformity of Cell Size", "Uniformity of Cell Shape", "Marginal Adhesion", "Single Epithelial Cell Size", "Bare Nuclei", "Bland Chromatin", "Normal Nucleoli", "Mitoses", "Class" ]
df.columns = names[1:]
hists = df.hist(bins=20, figsize=(15,20))
import matplotlib.pyplot as plt
plt.matshow(df.corr())
plt.show()

In [0]:
# Seaborn visualization library
import seaborn as sns
# Create the default pairplot
sns.pairplot(df, hue = 'Class', diag_kind = 'kde',
             plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'},
             height = 4)


# The diagonal diagrams describe
# The other diagrams describe the relationship (or lack thereof) between two variable

In [0]:
# Using information from all columns (0-9) to predict target (column 10)
X = df.iloc[:, :9]
Y = df.iloc[:, 9]

# Splitting between traning and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

# Checking the shapes to get an understanding of the problem
print( X_train.shape, X_test.shape )
print( Y_train.shape, Y_test.shape )

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [0]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import tensorflow.compat.v1 as tf


from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras_tqdm import TQDMNotebookCallback

model = Sequential()
model.add(Dense(16, input_dim=(9), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
model.summary()

model.fit(X_train, Y_train,batch_size=16,validation_split=0.2, epochs=100,verbose=0,callbacks=[TQDMNotebookCallback(leave_inner=True)])


In [0]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

Y_pred = model.predict(X_test,verbose=0)
Y_pred = [ 1 if y>=0.5 else 0 for y in Y_pred]
cm = confusion_matrix(Y_test, Y_pred)

df_cm = pd.DataFrame(cm, index = ["Benign", "Malign"],
                  columns = ["Benign", "Malign"])
plt.figure(figsize = (10,7))
sns.set(font_scale=1.4)
sns.heatmap(df_cm, annot=True,annot_kws={"size": 16})
plt.show()
#print(cm)