# Data Normalization

## Learning objectives
*After completing this activity, you should be able to:*
- discuss the importance of normalization and scaling
- implement *min-max scaling* in Python
- utilize scikit-learn to perform normalization

## Groups
You may optionally work in groups for this assignment.  The maximum size of a group is 2.


Please place your complete group members name here in the list below:
- Tommy Lane
- person B


## Background
The features in your your data can have large differences in terms 
of magnitude, units, and range. The similarity and dissimilarity measures that ML algorithms employ are not aware of these differences, and depending on the technique applied, can produce misleading results if care is not taken to preprocess this information.
The matrix, *X* (shown below) has 2 dimensions, column 0 represents the feature *age* and column 1 represents the *salary*.  Each row respresents one data point/person/observation.

$
\begin{bmatrix} 
23   &  56000 \\
35   &  75000 \\ 
55   &  76000 \\
\end{bmatrix}
$

Consider a single data example, *p*, where p = [39, 75750].


### <span style="color:red">Question #1</span>
Write code to create a numpy matrix *X* and vector *p*
(represent *p* still as a matrix with one row if you need to) 
with this data.


In [11]:
## Your Code here.

import numpy as np

## declare X with values as shown above 
# declare p with values as shown above
X = np.matrix([[23, 56000],[35, 75000],[55, 76000]])
p = np.matrix([[39, 75750]])

### <span style="color:red">Question #2</span>
Without using the computer and just using your thoughts,
which row/observation in matrix *X* is the most similar 
to the observation represented by vector *p*? 
Think about what the columns are encoding and explain your answer 
(do not do any distance calculations, imagine you were not a machine 
 learning student or a computer scientist).


The row that is the most similar would be the second row because the age and salary are closest to the age and salary of p.

### <span style="color:red">Question #3</span>

Compute the distance between *p* and each example in *X* using 
the Euclidian distance. You can use the [sklearn.metrics.pairwise_distances_argmin_min function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances_argmin_min.html), 
that returns a 2 column array with the index of the nearest point and the actual distance calculated. 

Return the results of the function into a variable named **p_closest**. Which vector (row number, starting from 0) in X did this method identify as the closest to p.

In [12]:
import sklearn.metrics

p_closest = sklearn.metrics.pairwise_distances_argmin_min(p,X)
print(p_closest)

(array([2], dtype=int64), array([250.51147678]))




### <span style="color:red">Question #4</span>
What point was the closest distance wise?  Is this the same point as you identified in question 2?


The closest point distance wise was row 3 or age 55 salary 76000. This is not the same point that I identified in question 2.

## Difference Scales in the Feature Space

To compensate for the differences in magnitude, scale, units, etc.,
it is important to **normalize** the data. This prevents features with different scales from dominating distance calculation. 
This formula below will transform the data using 
[min-max normalization](https://en.wikipedia.org/wiki/Feature_scaling).

$
x′_i = \frac{x_i − min(X_i)}{max(X_i) - min(X_i)}
$

The max($X_i$) is the maximum value for column i (and the same idea for min). $x_i$ is the original data for a specific row and column $i$
and $x′_i$ is the modified/scaled version.

### <span style="color:red">Question #5</span>
Write Pyhon code in the cell below that creates a new matrix, X_norm, that represents a translated version of 
*X* using the formula shown in the above cell.  Please note:
- *do* use vectorized numpy operations
- **do NOT** use scikit learn or other normalization packages for these operations


In [86]:

top = np.subtract(X[:,0], np.min(X[:,0]))
bottom = np.subtract(np.max(X[:,0]), np.min(X[:,0]))
col1 = np.divide(top, bottom)

top1 = np.subtract(X[:,1], np.min(X[:,1]))
bottom1 = np.subtract(np.max(X[:,1]), np.min(X[:,1]))
col2 = np.divide(top1, bottom1)

X_norm = np.concatenate((col1, col2),axis=1)
print(X_norm)
print("\n")
print("Question 7 answer:")
x1 = np.array((X[0,0],X[0,1]))
x2 = np.array((X[1,0],X[1,1]))
y1 = np.array((X_norm[0,0],X[0,1]))
y2 = np.array((X_norm[1,0],X[1,1]))
print("Distance before:" + " " + str(np.linalg.norm(x1 - x2)))
print("Distance after:" + "  " + str(np.linalg.norm(y1 - y2)))
print("Relative distance Coloumn 1 and 2 before:" + " " + str((X[1,0] - X[0,0])/(X[2,0] - X[0, 0])), end = " , ")
print(str((X[1,1] - X[0,1])/(X[2,1] - X[0, 1])))
print("Relative distance Coloumn 1 and 2 after:" + "  " + str((X_norm[1,0] - X_norm[0,0])/(X_norm[2,0] - X_norm[0, 0])), end = " , ")
print(str((X_norm[1,1] - X_norm[0,1])/(X_norm[2,1] - X_norm[0, 1])))

[[0.    0.   ]
 [0.375 0.95 ]
 [1.    1.   ]]


Question 7 answer:
Distance before: 19000.003789473307
Distance after:  19000.000003700658
Relative distance Coloumn 1 and 2 before: 0.375 , 0.95
Relative distance Coloumn 1 and 2 after:  0.375 , 0.95


### <span style="color:red">Question #6</span>
Review the data in **X_norm** what are the range of values for each column. 
How do they compare to the original **X** matrix?

The range of the values for both columns are 0-1. They are much smaller numbers then in the X matrix but convey the distance between the numbers more clearly

### <span style="color:red">Question #7</span>
Is the transformation of data between *X* and **X_norm**
represent a *linear* transform?  Recall that a linear transformation
of distances means that for each dimension, the relative distance between points is maintained after the transformation.  

Show the distance between two points (two rows in matrix) and show the relative distance in each dimension both before and after the transformation.


See code above

## Scikit Learn Min_max scaler object

For models that are sensitive to scales between features, it is
common practice to scale all the training/testing data. Scikit learn
provides an object called [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) to perform these operations. Here is an example of using this
object on *X*.

The code illustrates that the fit and transform operations can be performed in a single function call.  Fit determines how to scale 
each column (think of the equation from earlier, we need to record
each columns min and max values) and transform computes $x'$ for each
entry in the matrix.  Once the scaling object is *fitted*, we can
then transform additional data as required (illustrated below on the 
vector **p**).  

In [45]:
# Example code of using scikit learn's scaler
import numpy as np
import sklearn.preprocessing as skp

skScaler = skp.MinMaxScaler()
X_norm_scikit = skScaler.fit_transform(X)
p_min_max = skScaler.transform(p)

[[0.    0.   ]
 [0.375 0.95 ]
 [1.    1.   ]]




### <span style="color:red">Question #8</span>
In the code block below, find the nearest point in *X_norm_scikit* to *p_min_max*.  Is the nearest point the same as the one identified in question 2 and/or 3?  Make a few comments on why or why not the same points were identified and which one might be "better".

In [39]:
p_closest = sklearn.metrics.pairwise_distances_argmin_min(p_min_max,X_norm_scikit)
print(p_closest)

(array([1], dtype=int64), array([0.13050383]))


The nearest point is the same one identifed in question 2 it is better because both the age and the salary are closer together rather then in question 3 where only the salary is closer. 

### <span style="color:red">Question #9</span>
List at least one complications that can arise when applying min_max scaling?

In [44]:
#if there is a big outlier added to the matrix column all the data will be smushed between 1 and zero causing them to be
#Much closer togerther
Y = np.matrix([[23, 56000],[35, 75000],[55, 76000],[108, 80000000]])
Y_norm_scikit = skScaler.fit_transform(Y)
print(Y_norm_scikit)

[[0.00000000e+00 0.00000000e+00]
 [1.41176471e-01 2.37666366e-04]
 [3.76470588e-01 2.50175123e-04]
 [1.00000000e+00 1.00000000e+00]]


