# Outlier Detection

For this homework, you will be implementing the outlier detection algorithm from class.

There are two subtasks. Each is worth 25 points: 15 points for the code. 10 points for getting the correct answer. Partial credit may be given and deductions may be taken.


## Task 1


For this task, you will be asked to implement (in Python) the nested loops algorithm on slide 27 of the "outliers" lecture. Start with the code below, and fill in the appropriate missing code. To implement the priority queue, use Python's heapq. The distance between data points will be implemented using Euclidean distance (the l2-norm).


My output is:
```
--- 55.808152198791504 seconds ---
(20.401885959821104, 3002)
(21.573458366185555, 3001)
(23.81891978393018, 3005)
(25.3452201490205, 3004)
(23.860280263207713, 3003)

```

Note that since the data are randomly created, you may get a slightly different result, although data points 3001 thru 3004 should always be the outliers. Since 55 seconds is a non-trival amount of time to wait, when you are debugging, you should consider reducing the data set size from 3000 to a smaller value, such as 100. Here is the skeleton to start with:



In [25]:
import heapq as hq
import numpy as np
import time
# from scipy.spatial import distance

#create the covariance matrix
covar = np.zeros ((100,100))
np.fill_diagonal (covar, 1)

#and the mean vector
mean = np.zeros (100)

#create 3000 data points
all_data = np.random.multivariate_normal (mean, covar, 3000)

#now create the 20 outliers
for i in range (1, 20):
  mean.fill (i)
  outlier_data = np.random.multivariate_normal (mean, covar, i)
  all_data = np.concatenate ((all_data, outlier_data))

#k for kNN detection
k = 10

#the number of outliers to return
m = 5

#start the timer
start_time = time.time()

#the priority queue of outliers
outliers = []

#YOUR CODE HERE!

for i in range(len(all_data)):
    max_heap = []
    for j in range(len(all_data)):
        if i == j:
            continue
            
#         dist = distance.euclidean(all_data[i], all_data[j])    
#         dist = np.sqrt(np.sum(np.power(all_data[i] - all_data[j], 2)))
        dist = np.linalg.norm(all_data[i] - all_data[j])
        hq.heappush(max_heap, -dist)
        
        if len(max_heap) > k:
            hq.heappop(max_heap)
            
    tmp = hq.heappop(max_heap)
    hq.heappush(outliers, (-tmp, i))
    
    if len(outliers) > m:
        hq.heappop(outliers)


print("--- %s seconds ---" % (time.time() - start_time))

#print the outliers... 
for outlier in outliers:
  print (outlier)  

--- 63.30155920982361 seconds ---
(20.31684850212159, 3001)
(21.499931242948378, 3002)
(25.404498372012462, 3005)
(24.262796393911643, 3003)
(25.364716015500573, 3004)


Note the spot above where you should add your code. 

## Task 2

In this task, you should implement the faster algorithm on slide 30. 

Note that to randomly shuffle the data you should use 


In [26]:
np.random.shuffle (all_data)


In [27]:
# your code here
# be sure to including the timing

#start the timer
start_time = time.time()

#the priority queue of outliers
outliers = []

#YOUR CODE HERE!

for i in range(len(all_data)):
    max_heap = []
    for j in range(len(all_data)):
        if i == j:
            continue
            
#         dist = distance.euclidean(all_data[i], all_data[j])
#         dist = np.sqrt(np.sum(np.power(all_data[i] - all_data[j], 2)))
        dist = np.linalg.norm(all_data[i] - all_data[j])
        hq.heappush(max_heap, -dist)
        
        
        if len(max_heap) > k:
            hq.heappop(max_heap)
            
        if len(max_heap) == k and len(outliers) == m and -min(max_heap) < min(outliers)[0]:
            break
            
    tmp = hq.heappop(max_heap)
    hq.heappush(outliers, (-tmp, i))
    
    if len(outliers) > m:
        hq.heappop(outliers)


print("--- %s seconds ---" % (time.time() - start_time))

#print the outliers... 
for outlier in outliers:
  print (outlier)

--- 2.8013627529144287 seconds ---
(20.31684850212159, 545)
(24.262796393911643, 1377)
(21.499931242948378, 2600)
(25.404498372012462, 1130)
(25.364716015500573, 2915)


Here is my output:
```
--- 2.0767040252685547 seconds ---
(20.401885959821104, 343)
(23.818919783930184, 1455)
(21.573458366185555, 1902)
(23.860280263207713, 2668)
(25.345220149020495, 393)
```

Note that since you've shuffled the data, the indices of the outliers will change, but the distances should be the same.


Upload a jupyter notebook with your solution to Canvas.
