# Nan_euclidean distance
When calculating the distance between a pair of samples, this formulation ignores feature coordinates with a missing value in either sample and scales up the weight of the remaining coordinates:

How to find the distance between two points when some values are NaN ?

|Rows|col0|col1|col2|col3|
|-|-|-|-|-|
|r0|$X_{00}$|$X_{01}$|$X_{02}$|$X_{03}$|$X_{04}$|
|r1|$X_{10}$|$X_{11}$|$X_{12}$|$X_{13}$|
|r2|$X_{20}$|$X_{21}$|$X_{22}$|$X_{23}$|
|r3|$X_{30}$|$X_{31}$|$X_{32}$|$X_{33}$|

\

||col1|col2|col3|col4|
|-|-|-|-|-|
|r0|3|np.nan| 2| 0|
|r1|5| 4| 1| 5|
|r2|6| 7| 7| 5|
|r3|6| 5| 4| np.nan|

$ \verb|weight| = \dfrac{\verb|Total number of columns|}{\verb|number of columns with values present in row0 and row3|} $

$\verb|dist( r0,r3)| = \sqrt{\verb|weight| *[(X_{00}- X_{30})^2+ (X_{02}- X_{32})^2]} $         


\
\
example,
## Find the nan_euclidean distance between point1 $\verb|[3, np.nan,2, 0]|$ and point2 $\verb|[6,5,4, np.nan]|$?

If I ignore the values which have nan values then,
Euclidean Distances = $\sqrt{(3-6)^2+ (2-4)^2} = \sqrt{9 + 4} = 3.605 $

$Weight = \dfrac{4}{2} = 2 $\
$\verb|dist (row0 and row3|)  = \sqrt{2*[(3-6)^2+ (2-4)^2]} = \sqrt{2*[9+4]} = \sqrt{26} \approx 5.099$

\\

## Find the nan_euclidean distance between point1 $\verb|[3, np.nan,2, 0]|$ and point2 $\verb|[5,4,1,5]|$?

$Weight = \dfrac{4}{3} $\
$\verb|dist (row0 and row1|)  = \sqrt{\dfrac{4}{3}*[(3-5)^2+ (2-1)^2 + (0-5)^2]} = \sqrt{\dfrac{4}{3}*[4+1+25]} = \sqrt{40} \approx 6.324$



In [2]:
from sklearn.metrics.pairwise import nan_euclidean_distances
import numpy as np

X = np.array([[3, np.nan, 2, 0],
              [5, 4, 1, 5],
              [6, 7, 7, 5],
              [6, 5, 4, np.nan]])
nan_euclidean_distances(X[[0]], X[[3]]) #distance between rows of X

array([[5.09901951]])

In [3]:
nan_euclidean_distances(X[[0]], X[[1]])

array([[6.32455532]])

In [None]:
nan_euclidean_distances(X, X)

array([[0.        , 6.32455532, 5.09901951, 8.86942313],
       [6.32455532, 0.        , 3.82970843, 6.78232998],
       [5.09901951, 3.82970843, 0.        , 4.163332  ],
       [8.86942313, 6.78232998, 4.163332  , 0.        ]])

How to impute values using neighbors values ?
- KNNImputer(n_neighbors=2, weights="uniform") \
  - choose 2 nearest neighbors and take mean value of that.
  - Note while taking mean we say all values have uniform weight.
$X_{01} = \dfrac{(X_{11} + X_{21})}{2}$

    W_1 = 1 \
    W_2 = 1 \
  $X_{01} = \dfrac{(W_1 * X_{11} + W_2 *X_{21})}{(W_1 + W_2)}$

but what happens if weights($W_1, W_2$) is not uniform but based on distance.

- KNNImputer(n_neighbors=2, weights="distance")
  - weight points by the inverse of their distance. in this case,closer neighbors of a query point will have a greater influence than neighbors which are further away.

  w1 = (1/6.32) \
  w2 = (1/5.099) \
  
  $X_{01} = \dfrac{(W_1 * X_{11} + W_2 *X_{21})}{(W_1 + W_2)}$
  
  $ \dfrac{(4*w1) + (5*w2)}{(w1+w2} = 4.553 $



In [None]:
#@title KNNImputer(n_neighbors=2, weights="uniform")

import numpy as np
from sklearn.impute import KNNImputer
X = np.array([[3, np.nan, 2, 0],
              [5, 4, 1, 5],
              [6, 5, 4, np.nan],
              [6, 7, 7, 5]])

imputer = KNNImputer(n_neighbors=2, weights="uniform")
# Note weights parameter given here is completely different from the weight we defined for nan_euclidean distances.

imputer.fit_transform(X)

array([[3. , 4.5, 2. , 0. ],
       [5. , 4. , 1. , 5. ],
       [6. , 5. , 4. , 5. ],
       [6. , 7. , 7. , 5. ]])

In [None]:
#@title KNNImputer(n_neighbors=2, weights="distance")
import numpy as np
from sklearn.impute import KNNImputer
X = np.array([[3, np.nan, 2, 0],
              [5, 4, 1, 5],
              [6, 5, 4, np.nan],
              [6, 7, 7, 5]])

imputer = KNNImputer(n_neighbors=2, weights="distance")
# "distance" : weight points by the inverse of their distance. in this case,
#closer neighbors of a query point will have a greater influence than
#neighbors which are further away.
imputer.fit_transform(X)

array([[3.        , 4.55364064, 2.        , 0.        ],
       [5.        , 4.        , 1.        , 5.        ],
       [6.        , 5.        , 4.        , 5.        ],
       [6.        , 7.        , 7.        , 5.        ]])

# NearestNeighbors
KneighborsClassifier uses NearestNeighbors function to compute distances


In [None]:
#@title Que : Find the distance of datapoint= [[0, 0, 1.3]] from all the datapoints in X.
import numpy as np
from sklearn.neighbors import NearestNeighbors
datapoint = [[0, 0, 1.3]]

X = [[0, 0, 2],
    [1, 0, 0],
    [0, 0, 1]]

neigh = NearestNeighbors(n_neighbors=2, radius=0.4,metric="euclidean")
neigh.fit(X)
distances, indices = neigh.kneighbors(datapoint , n_neighbors = 3)
print(distances)
print(indices)

[[0.3        0.7        1.64012195]]
[[2 0 1]]


# KNN

- It is a non parametric algorithm
- = It is a supervised ML algorithm
- K = Number of neighbour considered for voting (n_neighbors)
- How to decide the label of unknown datapoints?
  - mean values of neighbor in Regression.
  - Voting majority class of neighbors in Classification

- As K increases model will underfit and it will form smooth decision boundaries.

- As K decreases model will overfit and it will form complex decision boundaries

- We can't create groups if labels are not known.

- Do scaling before model fitting because internally it calculates distance between two points.  


## Issues with knn
- Compute distance of every point
- Prediction is computationally expensive
- No model is learnt  
- Algorithm have to remember all the points for predicting new datapoints.


In [None]:
from sklearn.neighbors import KNeighborsClassifier
X_train = [[1,100],[4,400],[5,500],[6,600],[8,800],[9,900],[11,1100],[12,1200], [15,1500], [18,1800],[19,1900]]
y_train = [0,0,1,1,1,2,2,2,2,2,2]

X_test = [[2,200]]

knn = KNeighborsClassifier(n_neighbors= 3, metric="euclidean",weights= 'uniform')
knn.fit(X_train,y_train)

print(knn.predict(X_test))

[0]


In [None]:
knn.kneighbors([[1,100]]) # finding the neighbors of the first training point

(array([[  0.        , 300.01499963, 400.0199995 ]]), array([[0, 1, 2]]))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
X_train = [[1,100],[4,400],[5,500],[6,600],[8,800],[9,900],[11,1100],[12,1200], [15,1500], [18,1800],[19,1900]]
y_train = [0,0,1,1,1,2,2,2,2,2,2]

X_test = [[2,200]]

knn = KNeighborsClassifier(n_neighbors= len(y_train), metric="euclidean",weights= 'uniform')
knn.fit(X_train,y_train)

print(knn.predict(X_test))

[2]
