# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [1]:
# import NumPy into Python
import numpy as np

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0,5001,size=(1000,20))

# print the shape of X
print(X)

[[4526 2056 1094 ...,  115 1190 1516]
 [1412  540 2770 ..., 4075  749 1324]
 [3181 1468 4267 ..., 1441 1201  780]
 ..., 
 [2117 2048 3613 ..., 2886 1797 4060]
 [ 604 3623  115 ..., 1841 3300 3851]
 [2066 1832 3219 ..., 3718  873 3134]]


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [2]:
# Average of the values in each column of X
ave_cols = X.mean(axis=0)
# Standard Deviation of the values in each column of X
std_cols = X.std(axis=0)

If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [3]:
# Print the shape of ave_cols
print(ave_cols)

# Print the shape of std_cols
print(std_cols)

[ 2558.884  2456.949  2491.768  2539.097  2594.411  2540.38   2463.434
  2516.378  2524.338  2465.324  2463.329  2505.962  2495.657  2488.817
  2489.638  2489.288  2466.687  2593.265  2420.309  2484.501]
[ 1458.14610535  1438.37460712  1462.71377452  1424.46610967  1430.90833112
  1422.55679451  1437.44824312  1376.93559294  1445.63865048  1453.03107504
  1458.69616191  1433.19234807  1441.45483986  1428.95206061  1441.20110427
  1458.50085124  1432.52430172  1406.83820917  1457.61184734  1451.89710448]


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [4]:
# Mean normalize X
X_norm = (X-ave_cols)/std_cols

If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [5]:
# Print the average of all the values of X_norm
print(X_norm)

# Print the average of the minimum value in each column of X_norm
print(X_norm.mean(axis=0).min(axis=0))

# Print the average of the maximum value in each column of X_norm
print(X_norm.mean(axis=0).min(axis=0))

[[ 1.34905274 -0.27875144 -0.95559912 ..., -1.76158494 -0.84405804
  -0.66705898]
 [-0.78653572 -1.33271888  0.1902163  ...,  1.05323767 -1.14660772
  -0.79929976]
 [ 0.4266486  -0.6875462   1.21365645 ..., -0.81904585 -0.83651145
  -1.17398195]
 ..., 
 [-0.30304508 -0.28431328  0.76654231 ...,  0.20808007 -0.42762345
   1.08513131]
 [-1.34066401  0.81067268 -1.624903   ..., -0.53472034  0.60351526
   0.94118171]
 [-0.338021   -0.43448278  0.49717998 ...,  0.79947715 -1.06153706
   0.44734506]]
-1.5687451338e-16
-1.5687451338e-16


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [6]:
# We create a random permutation of integers 0 to 4
np.random.permutation(5)

array([0, 2, 4, 3, 1])

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [7]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(X_norm.shape[0])
print(row_indices)

[164 237  22 776 266 566 772 486  38  57 955  16 867 669 194  91 870 982
 705 540 925 578 661 395 740 227 532 734 733 857 709 523 561 359 556 101
  56 520 349  51 758 746 681 953 392 220 766 823 608 168 116 978 338 449
 174 492 113 484 162 854 300 883 915 286 213 551 205 747 437   8 429 757
 398  76 531 507 534 462 257 801 821 761 602 616 620 639 474 722  95 452
 769   2 380  24 994 223 331  81 876 775 549 765 671 827 730 641 767 535
 718 966 606 938 828 753   7 687 530 362 654 989 542 179 849 863 708  47
 559 478 424 726 406 252  39 421 146 371 509 195 472 664 198 910 784 517
 850  46 923 685 862 438 508 751 706 166 513 841 441 872 181 975 344 296
 260  99 526 467 185 156 936 754 297 952 676 545 450 434 313 629 468 896
 907 625 280 400 999 918 414 868  80 929 279  44 553 678 372 787 803 145
 563 673 518 833  64 397 892 482  49 110 595 126 725 554 628 969 604 976
 212 786 873 222 259  68 495 491 290 521 586 642 458 697 215 649 264 160
 701 281 737 968 605 694 417  45 768 107 519 139 29

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [8]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.
training_idx =row_indices[:(row_indices.shape[0]*60)//100]
crossVal_idx = row_indices[(row_indices.shape[0]*60)//100:(row_indices.shape[0]*80)//100]
test_idx = row_indices[(row_indices.shape[0]*80)//100:row_indices.shape[0]]

# Create a Training Set
X_train = X_norm[training_idx,:]

# Create a Cross Validation Set
X_crossVal = X_norm[crossVal_idx,:]

# Create a Test Set
X_test = X_norm[test_idx,:]

If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [9]:
# Print the shape of X_train
print(X_train.shape)

# Print the shape of X_crossVal
print(X_crossVal.shape)

# Print the shape of X_test
print(X_test.shape)

(600, 20)
(200, 20)
(200, 20)
