#### CS 4820
# Assignment 4: Data Pre-processing and Model Analysis

The [Pima Indians dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) is a very famous dataset distributed by UCI and originally collected from the National Institute of Diabetes and Digestive and Kidney Diseases. It contains data from clinical exams for women age 21 and above of Pima indian origins. The objective is to predict, based on diagnostic measurements, whether a patient has diabetes.

It has the following features:

- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)

The last column is the outcome, and it is a binary variable.

In this first exercise we will explore it through the following steps:

1. Load the ..data/diabetes.csv dataset, use `pandas` to explore the range of each feature
- For each feature draw a histogram. Bonus points if you draw all the histograms in the same figure.
- Explore correlations of features with the outcome column. You can do this in several ways, for example using the `sns.pairplot` we used above or drawing a heatmap of the correlations.
- Do features need standardization? If so what standardization technique will you use? MinMax? Standard?
- Prepare your final `X` and `y` variables to be used by an ML model. Make sure you define your target variable well. Will you need dummy columns?

### 0. Manually calculate the outputs from all neurons

In [1]:
import numpy as np
from tensorflow.keras.activations import relu
from tensorflow.keras.activations import tanh
from tensorflow.keras.layers import Activation

X_test = np.array([[0.9,1.2,0.7,0.8]])
print('Inputs:\n', X_test, '\n')

Inputs:
 [[0.9 1.2 0.7 0.8]] 



In [2]:
a=np.array([[1,0,-1,0]]).T
weights_0 = np.concatenate(
    (
        a,
        np.roll(a,1,axis=0),
        np.roll(a,3,axis=0)
    ), 
    axis=1
)
thetas_0 = np.full((3,),0.1)
print('Weights between input and hidden layer 1:\n', weights_0)
print('Thetas in hidden layer 1:\n', thetas_0)

y_1=tanh(X_test.dot(weights_0)+thetas_0)
print('Outputs from hidden layer 1:\n', y_1.numpy(), '\n')

Weights between input and hidden layer 1:
 [[ 1  0  0]
 [ 0  1 -1]
 [-1  0  0]
 [ 0 -1  1]]
Thetas in hidden layer 1:
 [0.1 0.1 0.1]
Outputs from hidden layer 1:
 [[ 0.29131261  0.46211716 -0.29131261]] 



In [3]:
b=np.array([[-1], [1.3], [-0.8]])
weights_1 = np.concatenate(
    (
        b,
        np.roll(b,1,axis=0),
        np.roll(b,2,axis=0)
    ), 
    axis=1
)

thetas_1 = np.full((3,),0.1)
print('Weights between hidden layer 1 and hidden layer 2:\n', weights_1)
print('Thetas in hidden layer 2:\n', thetas_1)

y_2=tanh(y_1.numpy().dot(weights_1)+thetas_1)
print('Outputs from hidden layer 2:\n', y_2.numpy(), '\n')

Weights between hidden layer 1 and hidden layer 2:
 [[-1.  -0.8  1.3]
 [ 1.3 -1.  -0.8]
 [-0.8  1.3 -1. ]]
Thetas in hidden layer 2:
 [0.1 0.1 0.1]
Outputs from hidden layer 2:
 [[ 0.56659243 -0.7504016   0.38022725]] 



In [12]:
c=np.array([[-1], [0], [1.1]])
weights_2 = np.concatenate(
    (
        c,
        np.roll(c,1,axis=0),
        np.roll(c,2,axis=0),
        np.roll(c,3,axis=0)
    ), 
    axis=1
)

thetas_2 = np.full((4,),0.1)
print('Weights between hidden layer 2 and output layer:\n', weights_2)
print('Thetas in output layer:\n', thetas_2)

output=Activation('softmax')(y_2.numpy().dot(weights_2)+thetas_2).numpy()

#output = np.around(output.numpy(), decimals=3)
print('Outputs from output layer:\n', output, '\n')

Weights between hidden layer 2 and output layer:
 [[-1.   1.1  0.  -1. ]
 [ 0.  -1.   1.1  0. ]
 [ 1.1  0.  -1.   1.1]]
Thetas in output layer:
 [0.1 0.1 0.1 0.1]
Outputs from output layer:
 [[0.14432633 0.66121078 0.05013656 0.14432633]] 



### 1. Build a model in Keras to verify the calculations above

In [6]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

model = Sequential()
model.add(Dense(3, input_shape=(4,), activation='tanh'))
model.add(Dense(3, activation='tanh'))
model.add(Dense(4, activation='softmax'))
model.compile(Adam(lr=0.1), 
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [7]:
model.layers[0].set_weights([weights_0, thetas_0])
#model.layers[1].set_weights([np.ones((3, 3)), np.zeros((3,))])
print(model.layers[0].get_weights())

model.layers[1].set_weights([weights_1, thetas_1])
print(model.layers[1].get_weights())

model.layers[2].set_weights([weights_2, thetas_2])
print(model.layers[2].get_weights())

[array([[ 1.,  0.,  0.],
       [ 0.,  1., -1.],
       [-1.,  0.,  0.],
       [ 0., -1.,  1.]], dtype=float32), array([0.1, 0.1, 0.1], dtype=float32)]
[array([[-1. , -0.8,  1.3],
       [ 1.3, -1. , -0.8],
       [-0.8,  1.3, -1. ]], dtype=float32), array([0.1, 0.1, 0.1], dtype=float32)]
[array([[-1. ,  1.1,  0. , -1. ],
       [ 0. , -1. ,  1.1,  0. ],
       [ 1.1,  0. , -1. ,  1.1]], dtype=float32), array([0.1, 0.1, 0.1, 0.1], dtype=float32)]


In [8]:
y_pred = model.predict(X_test)

#y_pred = np.around(y_pred, decimals=3)
print(y_pred)

[[0.14432633 0.66121083 0.05013655 0.14432633]]


In [14]:
print(y_pred)
print(output)
if (np.isclose(y_pred, output).all()):
    print("They match!")


[[0.14432633 0.66121083 0.05013655 0.14432633]]
[[0.14432633 0.66121078 0.05013656 0.14432633]]
They match!
