# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab you will be performing a particular form of feature scaling known as *mean normalization*. Mean normalization will not only scale the data but will also ensure your data has zero mean. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [18]:
    # import NumPy into Python
    import numpy as np

    # Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
    X = np.random.randint(0,5001,(1000,20))

    # print the shape of X
    print(X.shape)

(1000, 20)


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [44]:
# Average of the values in each column of X
ave_cols = X.mean(axis=0)

# Standard Deviation of the values in each column of X
std_cols = X.std(axis=0)

print(ave_cols)
print(std_cols)

[ 2469.267  2504.163  2516.322  2489.592  2544.142  2490.609  2436.443
  2532.075  2508.505  2505.744  2520.357  2505.058  2566.631  2566.421
  2606.811  2533.552  2540.215  2539.577  2550.021  2553.551]
[ 1432.03130333  1445.46658226  1417.63464134  1471.1847068   1443.98424847
  1406.727029    1462.0728931   1425.80429561  1473.91941231  1454.08420818
  1423.83551352  1436.67337855  1440.10943433  1444.68565984  1423.41804516
  1463.2847335   1471.5283126   1476.2768013   1438.52924494  1445.23068449]


If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [51]:
# Print the shape of ave_cols
ave_cols = X.mean(axis=0)
# Print the shape of std_cols
std_cols = X.std(axis=0)

print(ave_cols.shape)
print(std_cols.shape)

(20,)
(20,)


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [52]:
# Mean normalize X
X_norm = (X - ave_cols)/(std_cols)
print(X_norm)

[[-0.75017005  0.7498181  -1.75667406 ..., -0.17041316 -1.14562912
   1.38140507]
 [-1.01133753  1.22094625  0.78841047 ...,  0.76911254  0.5540235
  -1.59320655]
 [ 1.50676385  0.29944449  1.34497137 ...,  1.66460856  0.42750539
   1.48934632]
 ..., 
 [-1.57207946 -0.89048271  1.22999111 ..., -1.38834194  1.70241864
  -1.18358337]
 [ 0.06545457  1.07566444  1.13476206 ...,  0.1953719   1.42922294
   1.57998929]
 [ 0.78820414  0.82453445  0.21633077 ..., -1.47165965 -1.63501785
   0.31513931]]


If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero. You can verify this by filing the code below:

In [46]:
# Print the average of all the values of X_norm
print(X_norm.mean())
# Print the minimum value of each column of X_norm
print(X_norm.min())
# Print the maximum value of each column of X_norm
print(X_norm.max())

-1.7763568394e-17
-1.82434879819
1.77816374352


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [31]:
# We create a random permutation of integers 0 to 4
np.random.permutation(5)

array([4, 2, 0, 1, 3])

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [47]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(X_norm.shape[0])
print(row_indices)

[478 866 297 302 259 865 665 325 390 854 764 350  32  55 609 774 796 659
 956  16 747 668 971 521 715  10 294 180  81 880  67 481 182 516 303 908
 889 975 944 684 148 293  62 982 838 890 398 289 231 938 682 800  84 919
 475 119 286 375 438 596 681 476 308 191  20 696 674 978 730 169 327 737
 531 256 204 384 504 721 503 726 431 693 522 614 959 123 490 955 572  54
 125  74 357 917 802 650 797 427 290 200  75 485 722 645 586 318 117 587
 935 206 320 604 921 356 246 226 328 165 567  14 285 430 906 468 718 501
 406 862 397 887 317 520 512 219 902 874 757 957 703 823 915  37 595 815
 498  36 425 989 994 273 822 292 275  52 810 699 966 963 792 776 885 227
 288 383 793  80 486 576 998 859  29 829 344 798 635 637 871 181 939 210
 103 948 616  87 336 949 891 404 638 643 385 864 333 653 552 579 831 716
 897 134 163 634   1 568 598 469 585 381 652 773 373 451 669 314 844 464
 918 139 845 909 361  40 980 709  45 525 313  99 713 711 420 126 870  94
 137 371 496 452 435 177 622 954 362 754 769 284 82

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [63]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.

# Create a Training Set
X_train = X_norm[row_indices[:600],:]

# Create a Cross Validation Set
X_crossVal = X_norm[row_indices[600:800],:]

# Create a Test Set
X_test = X_norm[row_indices[800:],:]

If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [64]:
# Print the shape of X_train
print(X_train.shape)
# Print the shape of X_crossVal
print(X_crossVal.shape)
# Print the shape of X_test
print(X_test.shape)

(600, 20)
(200, 20)
(200, 20)
