# Homework 03

This Jupyter notebook calculates information gain of the restaurant example, examines the XOR network, and uses Keras to create a model for the Boston Housing dataset.

Author: Luke Steffen (lhs3)

Version: 04/06/2020

### 1. Information Gain of price
![title](restaurant_data.png)

**Information Gain = Gain(A) = B(p/(p+n)) - Remainder(A)**

B(p/(p+n)) = 6 / 12 = 0.5 -> 1 bit

B(q) = -(q * log2(q) + (1 - q) * log2(1 - q))

Remainder(A) = Sum[(pk + nk)/(p+n) * B(pk/(pk + nk))]

Price -> 3 choices for attribute

\$ -> 7 occurrences

\$$ -> 2 occurrences

\$$$ -> 3 occurrences

---------------------------------------------------------------------

\$ -> 3 positive occurrences & 4 negative occurrences

\$$ -> 2 positive occurrences & 0 negative occurrences

\$$$ -> 1 positive occurrence & 2 negative occurrences

--------------------------------------------------------------------

**Sum[(pk + nk)/(p+n) * B(pk/(pk + nk))]**

\$ -> (3 + 4) / (6 + 6) * B(3 / (3 + 4)) = 0.5747164127

    B(3/7) = 0.985228136
    
\$$ -> (2 + 0) / (6 + 6) * B(2 / (2 + 0)) = 0

    B(1) = 0
    
\$$$ -> (1 + 2) / (6 + 6) * B(1 / (1 + 2)) = 0.2295739585

    B(1/3) = 0.9182958341
    
Sum[(pk + nk)/(p+n) * B(pk/(pk + nk))] = 0.5747164127 + 0 + 0.2295739585 = 0.804

--------------------------------------------------------------------

Gain(Price) = 1 - 0.804 = 0.196 bits

The information gain from making the root of the tree price results in
0.196 bits gained. The information gained from price is higher than the information gained from type, which is 0 bits. However, it is not as much gain as patrons, which had a gain of 0.541 bits. Because partons has a higher information gain, it would be a better option to make patrons the root of the tree.

### 2. XOR Network Simplification

It is not possible to simplify the network beyond the network created in class. The network created in class is below.


                        input1  input2                               input1   input2
               (weight1)   \      /  (weight2)               (weight1)   \     /   (weight2)
                            \    /                                        \   /
                           AND Node   <- bias1                           OR Node   <- bias1
                               |                                            |
                               |                                            |
                                \                                          /
                                 \                                        /
                                  \                                      /
                                   \                                    /
                                    \                                  /
                                     \                                /
                                      \                              /
                                       \                            /
                                        \ weight3          weight3 /
                                         -------            -------
                                                \          /
                                         bias2 -> Neuron 3
                                                     |
                                                     |
                                                     |
                                                   Output

The reason why this cannot be further simplified is because of the nature of the XOR problem. If we were to use only one layer, we would be using perceptrons instead of a neural network. Perceptrons are only able to solve problems that are linearly separable. A lineraly separable problem is one that, when graphed, a linear line can be drawn which separates positive answers from negative answers. XOR is not a linearly separable problem and this can be shown if the problem is graphed, which would look something like the below graph.

                   1 o            *
                     |
                     |
                     |
                     |
                     |
                     *------------o
                     0            1
                     
          Where o is positive and * is negative


If we were to attempt to draw a line to separate the positive answers from the negative answers, we would find that there is no point where a single line can be drawn to separate the answers. Because of this, XOR is not considered a linearly separable problem. Because XOR is not linearly separable, it is impossible to design a network with only one layer. This means we need at least one more layer to solve the XOR problem. The deep neural network solved in class is a network of 2 layers, which is the simplest network we can create to solve a non-linear problem. Therefore, we cannot simplify the network created in class any further.

### 3. Boston Housing Dataset

In [17]:
import numpy as np
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import pandas as pd

# Import boston housing into training and testing datasets
from keras.datasets import boston_housing
(train_images, train_labels), (test_images, test_labels) = boston_housing.load_data()

In [18]:
# Code pulled from numpy.ipynb to print dataset metrics
print(
        'training images \
            \n\tcount: {} \
            \n\tdimensions: {} \
            \n\tshape: {} \
            \n\tdata type: {}\n\n'.format(
                len(train_images),
                train_images.ndim,
                train_images.shape,
                train_images.dtype
        ),
        'testing images \
            \n\tcount: {} \
            \n\tdimensions: {} \
            \n\tshape: {} \
            \n\tdata type: {} \
            \n\tvalues: {}\n'.format(
                len(test_labels),
                train_labels.ndim,
                test_labels.shape,
                test_labels.dtype,
                test_labels
        )
    )

training images             
	count: 404             
	dimensions: 2             
	shape: (404, 13)             
	data type: float64

 testing images             
	count: 102             
	dimensions: 1             
	shape: (102,)             
	data type: float64             
	values: [ 7.2 18.8 19.  27.  22.2 24.5 31.2 22.9 20.5 23.2 18.6 14.5 17.8 50.
 20.8 24.3 24.2 19.8 19.1 22.7 12.  10.2 20.  18.5 20.9 23.  27.5 30.1
  9.5 22.  21.2 14.1 33.1 23.4 20.1  7.4 15.4 23.8 20.1 24.5 33.  28.4
 14.1 46.7 32.5 29.6 28.4 19.8 20.2 25.  35.4 20.3  9.7 14.5 34.9 26.6
  7.2 50.  32.4 21.6 29.8 13.1 27.5 21.2 23.1 21.9 13.  23.2  8.1  5.6
 21.7 29.6 19.6  7.  26.4 18.9 20.9 28.1 35.4 10.2 24.3 43.1 17.6 15.4
 16.2 27.1 21.4 21.5 22.4 25.  16.6 18.6 22.  42.8 35.1 21.5 36.  21.9
 24.1 50.  26.7 25. ]



In [19]:
# Training data length
t_length = len(train_images)

# Take last 50 points of training data and make a validation set
val_images = train_images[-50:]
val_labels = train_labels[-50:]

# Remove last 50 values from the training data
train_images = train_images[:(t_length-50)]
train_labels = train_labels[:(t_length-50)]

print("Validation data length: " + str(len(val_images)))
print("Validation label length: " + str(len(val_labels)))
print("\nNew training data length: " + str(len(train_images)))
print("New training label length: " + str(len(train_labels)))

# Note that the testing data has already been loaded from the import

Validation data length: 50
Validation label length: 50

New training data length: 354
New training label length: 354


In [26]:
# Create one new synthetic feature

# Convert numpy dataset to pandas dataset
training_images = pd.DataFrame(data=train_images, 
                               columns=["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"])

validation_images = pd.DataFrame(data=val_images, 
                               columns=["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"])

testing_images = pd.DataFrame(data=test_images, 
                               columns=["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"])

# Synthetic feature: Average number of rooms per dwellings built prior to 1940
training_images["rooms_per_old_dwellings"] = training_images["RM"] * training_images["AGE"]

This synthetic feature could be useful for machine learning because creating a
model trained on this could potentially predict if a house was built before
the 1940s based on the number of rooms present in the house. It makes sense
statistically that older houses may contain less rooms on average than more
modern houses. If this is the case, it is possible to train a machine learning
model to predict the age of the house based on the number of total rooms
present in the house.