# k-Nearest Neighbor


This function illustrates how to use k-nearest neighbors in tensorflow

We will use the 1970s Boston housing dataset which is available through the UCI ML data repository. 

### Data:
----------x-values-----------
* CRIM   : per capita crime rate by town
* ZN     : prop. of res. land zones
* INDUS  : prop. of non-retail business acres
* CHAS   : Charles river dummy variable
* NOX    : nitrix oxides concentration / 10 M
* RM     : Avg. # of rooms per building
* AGE    : prop. of buildings built prior to 1940
* DIS    : Weighted distances to employment centers
* RAD    : Index of radian highway access
* TAX    : Full tax rate value per $10k
* PTRATIO: Pupil/Teacher ratio by town
* B      : 1000*(Bk-0.63)^2, Bk=prop. of blacks
* LSTAT  : % lower status of pop

------------y-value-----------
* MEDV   : Median Value of homes in $1,000's

In [11]:
# import required libraries
#importing pyplot from matplotlib for visualization
import matplotlib.pyplot as plt
#importing numpy
import numpy as np
#importing tensorflow
import tensorflow as tf
#importing requests which will be used for fetching data
import requests
#Clearing the default graph stack and resetting the global default graph.
from tensorflow.python.framework import ops
ops.reset_default_graph()
#importing debug library
from tensorflow.python import debug as tf_debug

#Load boston housing dataset...does not need internet
from sklearn.datasets import load_boston
print ("Housing data downloaded")

Housing data downloaded


### Create graph

In [12]:
#creating a session object which creates an environment where we can execute Operations and evaluate Tensors
sess = tf.Session()

## Debugger

### Uncomment the below line and execute the code to run the debugger.

### Go to the link once you start execution    			http://localhost:6006/

In [13]:
#Uncomment the below line to run the debugger
# sess = tf_debug.TensorBoardDebugWrapperSession(sess, "localhost:6064")

### Load the data

In [14]:
#URL for the boston housing data in UCI repository is set to the variable
housing_url = load_boston()#'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
#the different features in the dataset
print ("Data accessed")
housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
#the features being used in our model
cols_used = ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT']
#The number of features being used in the model
num_features = len(cols_used)
#Using requests, load the url containing the dataset and fetch the data
housing_file = requests.get(housing_url)
#For each line withe length greater than 0, split the line based on space and store it in a 2-d list
housing_data = [[float(x) for x in y.split(' ') if len(x)>=1] for y in housing_file.text.split('\n') if len(y)>=1]

#retrieve the 13th value in each row of the 2-d list, convert it to a numpy array, take transpose to obtain n x 1 array
y_vals = np.transpose([np.array([y[13] for y in housing_data])])
#retrieve the data belonging to the features to be used in the model, convert it to a 2-d numpy array
x_vals = np.array([[x for i,x in enumerate(y) if housing_header[i] in cols_used] for y in housing_data])

## Min-Max Scaling
#Normalize the data
x_vals = (x_vals - x_vals.min(0)) / x_vals.ptp(0)

Data accessed


InvalidSchema: No connection adapters were found for '{'data': array([[  6.32000000e-03,   1.80000000e+01,   2.31000000e+00, ...,
          1.53000000e+01,   3.96900000e+02,   4.98000000e+00],
       [  2.73100000e-02,   0.00000000e+00,   7.07000000e+00, ...,
          1.78000000e+01,   3.96900000e+02,   9.14000000e+00],
       [  2.72900000e-02,   0.00000000e+00,   7.07000000e+00, ...,
          1.78000000e+01,   3.92830000e+02,   4.03000000e+00],
       ..., 
       [  6.07600000e-02,   0.00000000e+00,   1.19300000e+01, ...,
          2.10000000e+01,   3.96900000e+02,   5.64000000e+00],
       [  1.09590000e-01,   0.00000000e+00,   1.19300000e+01, ...,
          2.10000000e+01,   3.93450000e+02,   6.48000000e+00],
       [  4.74100000e-02,   0.00000000e+00,   1.19300000e+01, ...,
          2.10000000e+01,   3.96900000e+02,   7.88000000e+00]]), 'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'],
      dtype='|S7'), 'DESCR': "Boston House Prices dataset\n===========================\n\nNotes\n------\nData Set Characteristics:  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive\n    \n    :Median Value (attribute 14) is usually the target\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000\n        - PTRATIO  pupil-teacher ratio by town\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n        - LSTAT    % lower status of the population\n        - MEDV     Median value of owner-occupied homes in $1000's\n\n    :Missing Attribute Values: None\n\n    :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttp://archive.ics.uci.edu/ml/datasets/Housing\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems.   \n     \n**References**\n\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)\n", 'target': array([ 24. ,  21.6,  34.7,  33.4,  36.2,  28.7,  22.9,  27.1,  16.5,
        18.9,  15. ,  18.9,  21.7,  20.4,  18.2,  19.9,  23.1,  17.5,
        20.2,  18.2,  13.6,  19.6,  15.2,  14.5,  15.6,  13.9,  16.6,
        14.8,  18.4,  21. ,  12.7,  14.5,  13.2,  13.1,  13.5,  18.9,
        20. ,  21. ,  24.7,  30.8,  34.9,  26.6,  25.3,  24.7,  21.2,
        19.3,  20. ,  16.6,  14.4,  19.4,  19.7,  20.5,  25. ,  23.4,
        18.9,  35.4,  24.7,  31.6,  23.3,  19.6,  18.7,  16. ,  22.2,
        25. ,  33. ,  23.5,  19.4,  22. ,  17.4,  20.9,  24.2,  21.7,
        22.8,  23.4,  24.1,  21.4,  20. ,  20.8,  21.2,  20.3,  28. ,
        23.9,  24.8,  22.9,  23.9,  26.6,  22.5,  22.2,  23.6,  28.7,
        22.6,  22. ,  22.9,  25. ,  20.6,  28.4,  21.4,  38.7,  43.8,
        33.2,  27.5,  26.5,  18.6,  19.3,  20.1,  19.5,  19.5,  20.4,
        19.8,  19.4,  21.7,  22.8,  18.8,  18.7,  18.5,  18.3,  21.2,
        19.2,  20.4,  19.3,  22. ,  20.3,  20.5,  17.3,  18.8,  21.4,
        15.7,  16.2,  18. ,  14.3,  19.2,  19.6,  23. ,  18.4,  15.6,
        18.1,  17.4,  17.1,  13.3,  17.8,  14. ,  14.4,  13.4,  15.6,
        11.8,  13.8,  15.6,  14.6,  17.8,  15.4,  21.5,  19.6,  15.3,
        19.4,  17. ,  15.6,  13.1,  41.3,  24.3,  23.3,  27. ,  50. ,
        50. ,  50. ,  22.7,  25. ,  50. ,  23.8,  23.8,  22.3,  17.4,
        19.1,  23.1,  23.6,  22.6,  29.4,  23.2,  24.6,  29.9,  37.2,
        39.8,  36.2,  37.9,  32.5,  26.4,  29.6,  50. ,  32. ,  29.8,
        34.9,  37. ,  30.5,  36.4,  31.1,  29.1,  50. ,  33.3,  30.3,
        34.6,  34.9,  32.9,  24.1,  42.3,  48.5,  50. ,  22.6,  24.4,
        22.5,  24.4,  20. ,  21.7,  19.3,  22.4,  28.1,  23.7,  25. ,
        23.3,  28.7,  21.5,  23. ,  26.7,  21.7,  27.5,  30.1,  44.8,
        50. ,  37.6,  31.6,  46.7,  31.5,  24.3,  31.7,  41.7,  48.3,
        29. ,  24. ,  25.1,  31.5,  23.7,  23.3,  22. ,  20.1,  22.2,
        23.7,  17.6,  18.5,  24.3,  20.5,  24.5,  26.2,  24.4,  24.8,
        29.6,  42.8,  21.9,  20.9,  44. ,  50. ,  36. ,  30.1,  33.8,
        43.1,  48.8,  31. ,  36.5,  22.8,  30.7,  50. ,  43.5,  20.7,
        21.1,  25.2,  24.4,  35.2,  32.4,  32. ,  33.2,  33.1,  29.1,
        35.1,  45.4,  35.4,  46. ,  50. ,  32.2,  22. ,  20.1,  23.2,
        22.3,  24.8,  28.5,  37.3,  27.9,  23.9,  21.7,  28.6,  27.1,
        20.3,  22.5,  29. ,  24.8,  22. ,  26.4,  33.1,  36.1,  28.4,
        33.4,  28.2,  22.8,  20.3,  16.1,  22.1,  19.4,  21.6,  23.8,
        16.2,  17.8,  19.8,  23.1,  21. ,  23.8,  23.1,  20.4,  18.5,
        25. ,  24.6,  23. ,  22.2,  19.3,  22.6,  19.8,  17.1,  19.4,
        22.2,  20.7,  21.1,  19.5,  18.5,  20.6,  19. ,  18.7,  32.7,
        16.5,  23.9,  31.2,  17.5,  17.2,  23.1,  24.5,  26.6,  22.9,
        24.1,  18.6,  30.1,  18.2,  20.6,  17.8,  21.7,  22.7,  22.6,
        25. ,  19.9,  20.8,  16.8,  21.9,  27.5,  21.9,  23.1,  50. ,
        50. ,  50. ,  50. ,  50. ,  13.8,  13.8,  15. ,  13.9,  13.3,
        13.1,  10.2,  10.4,  10.9,  11.3,  12.3,   8.8,   7.2,  10.5,
         7.4,  10.2,  11.5,  15.1,  23.2,   9.7,  13.8,  12.7,  13.1,
        12.5,   8.5,   5. ,   6.3,   5.6,   7.2,  12.1,   8.3,   8.5,
         5. ,  11.9,  27.9,  17.2,  27.5,  15. ,  17.2,  17.9,  16.3,
         7. ,   7.2,   7.5,  10.4,   8.8,   8.4,  16.7,  14.2,  20.8,
        13.4,  11.7,   8.3,  10.2,  10.9,  11. ,   9.5,  14.5,  14.1,
        16.1,  14.3,  11.7,  13.4,   9.6,   8.7,   8.4,  12.8,  10.5,
        17.1,  18.4,  15.4,  10.8,  11.8,  14.9,  12.6,  14.1,  13. ,
        13.4,  15.2,  16.1,  17.8,  14.9,  14.1,  12.7,  13.5,  14.9,
        20. ,  16.4,  17.7,  19.5,  20.2,  21.4,  19.9,  19. ,  19.1,
        19.1,  20.1,  19.9,  19.6,  23.2,  29.8,  13.8,  13.3,  16.7,
        12. ,  14.6,  21.4,  23. ,  23.7,  25. ,  21.8,  20.6,  21.2,
        19.1,  20.6,  15.2,   7. ,   8.1,  13.6,  20.1,  21.8,  24.5,
        23.1,  19.7,  18.3,  21.2,  17.5,  16.8,  22.4,  20.6,  23.9,
        22. ,  11.9])}'

### Split the data into train and test sets

In [15]:
#Seeding a pseudo-random number generator to give it its first previous value, making future generations reproducible
np.random.seed(13)  #make results reproducible
#generate a random list of indices having 80% of the original indices, use it to form the train data
train_indices = np.random.choice(len(x_vals), int(round(len(x_vals)*0.8)), replace=False)
#use the indices generated in the previous step and retrieve the remaining 20% of the indices for test data
test_indices = np.array(list(set(range(len(x_vals))) - set(train_indices)))
#using the list of indices genereated obtain the training data from x_vals
x_vals_train = x_vals[train_indices]
#using the list of test indices generated obtain the testing data from x_vals
x_vals_test = x_vals[test_indices]
#use the same set of train indices and obtain the corresponding labels for the train data
y_vals_train = y_vals[train_indices]
#use the same set of test indices and obtain the corresponding lables for the test data
y_vals_test = y_vals[test_indices]

NameError: name 'x_vals' is not defined

### Parameters to control run

In [16]:
# Declare k-value and batch size
#Setting the k value, used in the calulation of nearest neighbors
k = 4
#set the batch size to length of test data, number of training samples to be used in one iteration
batch_size=len(x_vals_test)

# Placeholders
#Inserting a placeholder for a tensor of size train data
x_data_train = tf.placeholder(shape=[None, num_features], dtype=tf.float32)
#Inserting a placeholder for a tensor of size test data
x_data_test = tf.placeholder(shape=[None, num_features], dtype=tf.float32)
#Inserting a placeholder for a tensor of labels size of train data
y_target_train = tf.placeholder(shape=[None, 1], dtype=tf.float32)
#Inserting a placeholder for a tensor of labels size of test data
y_target_test = tf.placeholder(shape=[None, 1], dtype=tf.float32)

NameError: name 'x_vals_test' is not defined

## Declare distance metric

### L1 Distance Metric

Uncomment following line and comment L2

In [17]:
#the following line calculates the distance matrix using manhattan distance metric
#the distance is calulated between each test point and all train data points
distance = tf.reduce_sum(tf.abs(tf.subtract(x_data_train, tf.expand_dims(x_data_test,1))), axis=2)

NameError: name 'x_data_train' is not defined

### L2 Distance Metric

Uncomment following line and comment L1 above

In [18]:
#the following line calculates the distance matrix using eucledian distance metric
#the distance is calulated between each test point and all train data points
#distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(x_data_train, tf.expand_dims(x_data_test,1))), reduction_indices=1))

## Predict: Get min distance index (Nearest neighbor)

In [19]:
#prediction = tf.arg_min(distance, 0)
#Finds values and indices of the k largest entries for the negative values of distance matrix (smallest positive)
top_k_xvals, top_k_indices = tf.nn.top_k(tf.negative(distance), k=k)
#reduce the tensor to a 1-d tensor (flatten) and add the second dimension
x_sums = tf.expand_dims(tf.reduce_sum(top_k_xvals, 1),1)
#creating a matrix of 1's of size 1xk and multiplying it with the sum matrix generated in previous steps
x_sums_repeated = tf.matmul(x_sums,tf.ones([1, k], tf.float32))
#dividing each value in top_k_val matrix with correspoding value from x_sums repeated and expanding its dimension by 1
x_val_weights = tf.expand_dims(tf.div(top_k_xvals,x_sums_repeated), 1)


#retrieve the labels belonging to the top_k_indices
top_k_yvals = tf.gather(y_target_train, top_k_indices)

#multiply the caluclated weights with the respective labels and reduce the dimension of tensor by 1
prediction = tf.squeeze(tf.matmul(x_val_weights,top_k_yvals), axis=[1])
#prediction = tf.reduce_mean(top_k_yvals, 1)


# Calculate MSE
#calculating mean square error for the predicted labels and divide it by the batch size
mse = tf.div(tf.reduce_sum(tf.square(tf.subtract(prediction, y_target_test))), batch_size)


# Calculate how many loops over training data
num_loops = int(np.ceil(len(x_vals_test)/batch_size))

#iterating for the training data for num_loops
for i in range(num_loops):
    #starting index of the current batch
    min_index = i*batch_size
    #ending index of the current batch
    max_index = min((i+1)*batch_size,len(x_vals_train))
    #data for testing of batch size
    x_batch = x_vals_test[min_index:max_index]
    #labels for the test data of batch size
    y_batch = y_vals_test[min_index:max_index]
    #run the graph fragment to execute the operation (predcition) and evaluate each tensor using data from feed_dict
    predictions = sess.run(prediction, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch,
                                         y_target_train: y_vals_train, y_target_test: y_batch})
    #run the graph fragment to execute the operation (calculate mse) and evaluate each tensor using data from feed_dict
    batch_mse = sess.run(mse, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch,
                                         y_target_train: y_vals_train, y_target_test: y_batch})

    #print the mse for the current batch
    print('Batch #' + str(i+1) + ' MSE: ' + str(np.round(batch_mse,3)))

NameError: name 'distance' is not defined

In [20]:
#the output of plotting commands is displayed inline within frontends, stored in notebook
%matplotlib inline
# Plot prediction and actual distribution
#store 45 evenly spaced numbers between 5 and 50
bins = np.linspace(5, 50, 45)

#plot the histogram for predicted values
plt.hist(predictions, bins, alpha=0.5, label='Prediction')
#plot the histogram for actual values
plt.hist(y_batch, bins, alpha=0.5, label='Actual')
#set title for the histogram
plt.title('Histogram of Predicted and Actual Values')
#labeling the x-axis of the plot
plt.xlabel('Med Home Value in $1,000s')
#labeling the y-axis of the plot
plt.ylabel('Frequency')
#set the location of the legend on the plot
plt.legend(loc='upper right')
#display the plot
plt.show()

NameError: name 'predictions' is not defined