# K Nearest Neighbor Algorithm of Room Occupancy

## Introduction

In this lab a K Nearest Neighbor algorithm is used to attempt to determine if a room is occupied or not using data from various sensors placed in the room.  The goal of this lab is both to create a working algorithm that addresses the problem listed above, but to also optimize the algorithm to be as efficent as possible.  This will be done by examining the attributes of the dataset and eliminating as many as possible, while at the same time maintaining a high level of accuracy.

In [1]:
import csv
import random
import math
import operator
import statistics

def euclidianDistance(item1, item2, attributes):
    distance = 0
    for x in range(attributes):
        distance+=(float(item1[x]) - float(item2[x]))**2
    return math.sqrt(distance)

    
def loadDataset(filename, split, trainingSet, testSet):
    with open(filename, 'r') as csvfile:
        lines = csv.reader(csvfile)
        dataset = list(lines)
        newdata = []
        indicies = [4,5,7]
        for x in range(1,len(dataset)-1):
            #this needs to change in a different dataset  
            newdata.append([])
            for y in indicies:
                newdata[-1].append(float(dataset[x][y]))
            if random.random() < split:
                trainingSet.append(newdata[x - 1])
            else:
                testSet.append(newdata[x - 1])
                
def getResponse(neighbors):
    classVotes = {}
    for x in range(len(neighbors)):
        response = neighbors[x][-1]
        if response in classVotes:
            classVotes[response]+=1
        else:
            classVotes[response] = 1
    sortedVotes = sorted(classVotes.items(), key = operator.itemgetter(1), reverse = True)
    return sortedVotes[0][0]

def getAccuracy(testSet, predictions):
    correct = 0
    for x in range(len(testSet)):
        if testSet[x][-1] == predictions[x]:
            correct+=1
    return(correct/float(len(testSet)))*100.0
    
#def optimize(testSet, trainingSet):
    
def getNeighbors(trainingSet, test, k):
    distances = []
    length = len(test) - 1
    for x in range(len(trainingSet)):
        dist = euclidianDistance(test, trainingSet[x], length)
        distances.append((trainingSet[x], dist))
    #sort on the distance, not the data point
    distances.sort(key = operator.itemgetter(1))
    neighbors = []
    for x in range(k):
        neighbors.append(distances[x][0])
    return neighbors
    

def kNN():
    trainingSet = []
    testSet = []
    split = 0.98
    loadDataset('datatraining.txt', split, trainingSet, testSet)
    bests = []
    for k in range (1, 15):
        avgresult = 0
        result_set = []
        print("!!!!!!!")
        print("Hey there! Now starting optimization for k = ",k)
        print("See you soon!")
        print("!!!!!!!")
        for times in range(10):
            predictions = []
            for x in range(len(testSet)):
        
                neighbors = getNeighbors(trainingSet, testSet[x], k)
                result = getResponse(neighbors)
                predictions.append(result)
                #print('predicted:' + str(result) + ', actual:' + str(testSet[x][-1]))
            accuracy = getAccuracy(testSet, predictions)
            result_set.append(accuracy)
        avgresult = statistics.mean(result_set)
        print("DONE! Average accuracy ", avgresult)
        print()
        print()
        bests.append([avgresult, k])
    print("Wow! That was a lot of work! Let's see which k value was the most efficent!")
    the_best_ones = sorted(bests, key = operator.itemgetter(0))
    print("Okay I got it! The most accurate k value is a value of ", the_best_ones[0][1], " with an accuracy level of ", bests[0][0])
    
    
kNN()

!!!!!!!
Hey there! Now starting optimization for k =  1
See you soon!
!!!!!!!


KeyboardInterrupt: 

### Questions

1.Does the value of k change the accuracy of the algorithm?  What value of k seems to be best?

Doing optimizing through a for loop, I collected the average accuracy of all possible values of k from 1-15.  From this optimization, I found that there was not a large change in the accuracy of the algorithm changing k from 1 to 15.  The accuracy of the algorithm hovered around 98-99% for every value of k.  For this reason I have concluded that there is no real benefit to any k value.  Any value from 1-15 is the "best" k-value in my opinion, however according to calculated means (listed below), k was the most accurate.

2.Can you remove some of the attributes and still get a high (>90%) level of accuracy?  Which attributes seem to be most important?

Removing many the attributes had little to no affect on the accuracy of the Nearest Neighbor algorithm.  By process of elimination, I found that the only relevant information that was necessary in order to accurately predict occupancy of a room were Light and CO2.  These two variables seemed to be the only defining factors on occupancy according to this dataset.  When I removed the light variable, the accuracy dropped from ~98-99% to ~92%.  While still accurate, this was a significant loss.  Even worse accuracy occurred when the CO2 variable was removed from the dataset.  This removal resulted in accuracy of ~85%, an approximately 14% loss in accuracy.  These losses lead to the conclusion that these variables were the most important to the analysis.  Comparing accuracy of the analysis with only these two variables to a the analysis with all variables, I found almost identical results.  Both analysis' lead to accuracies of approximately 98%, confirming that Light and CO2 are adequate to predict occupancy of a room. 

### Data

This is a data excerpt from the code ran, as this algoritm takes 15+ minutes to complete.

!!!!!!!
Hey there! Now starting optimization for k =  1
See you soon!
!!!!!!!
DONE! Average accuracy  98.6013986013986


!!!!!!!
Hey there! Now starting optimization for k =  2
See you soon!
!!!!!!!
DONE! Average accuracy  99.3006993006993


!!!!!!!
Hey there! Now starting optimization for k =  3
See you soon!
!!!!!!!
DONE! Average accuracy  97.9020979020979


!!!!!!!
Hey there! Now starting optimization for k =  4
See you soon!
!!!!!!!
DONE! Average accuracy  98.6013986013986


!!!!!!!
Hey there! Now starting optimization for k =  5
See you soon!
!!!!!!!
DONE! Average accuracy  98.6013986013986


!!!!!!!
Hey there! Now starting optimization for k =  6
See you soon!
!!!!!!!
DONE! Average accuracy  98.6013986013986


!!!!!!!
Hey there! Now starting optimization for k =  7
See you soon!
!!!!!!!
DONE! Average accuracy  97.9020979020979


!!!!!!!
Hey there! Now starting optimization for k =  8
See you soon!
!!!!!!!
DONE! Average accuracy  97.9020979020979


!!!!!!!
Hey there! Now starting optimization for k =  9
See you soon!
!!!!!!!
DONE! Average accuracy  97.9020979020979


!!!!!!!
Hey there! Now starting optimization for k =  10
See you soon!
!!!!!!!
DONE! Average accuracy  97.9020979020979


!!!!!!!
Hey there! Now starting optimization for k =  11
See you soon!
!!!!!!!
DONE! Average accuracy  97.9020979020979


!!!!!!!
Hey there! Now starting optimization for k =  12
See you soon!
!!!!!!!
DONE! Average accuracy  97.9020979020979


!!!!!!!
Hey there! Now starting optimization for k =  13
See you soon!
!!!!!!!
DONE! Average accuracy  97.9020979020979


!!!!!!!
Hey there! Now starting optimization for k =  14
See you soon!
!!!!!!!
DONE! Average accuracy  97.9020979020979