# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero.

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [1]:
import numpy as np

V = np.random.randint(0, 5001, size=(1000, 20))
print(V.shape)

(1000, 20)


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$.

In [2]:
import numpy as np

Y = np.random.randint(0, 5001, size=(1000, 20))

ave_cols = np.mean(Y, axis=0)
std_cols = np.std(Y, axis=0)

print("Average:")
print(ave_cols)

print("Standard deviation:")
print(std_cols)


Average:
[2505.757 2548.357 2541.611 2494.835 2513.747 2490.874 2501.266 2507.763
 2437.261 2488.601 2395.173 2492.829 2492.835 2510.373 2516.088 2554.33
 2475.137 2485.368 2558.321 2508.813]
Standard deviation:
[1400.51143371 1418.47783823 1455.05424768 1465.070188   1437.80516587
 1407.30742275 1450.04965889 1464.39190753 1401.23806788 1458.13077527
 1412.92381857 1430.97334069 1411.28936571 1446.2511939  1457.87007455
 1432.10571995 1453.22482508 1444.16783463 1450.80345945 1422.94008167]


If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [3]:
import numpy as np


X = np.random.randint(0, 5001, size=(1000, 20))
ave_cols = np.mean(X, axis=0)
std_cols = np.std(X, axis=0)
print("Shape:", ave_cols.shape)
print("Shape:", std_cols.shape)

Shape of ave_cols: (20,)
Shape of std_cols: (20,)


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [4]:
import numpy as np

X = np.random.randint(0, 5001, size=(1000, 20))
col_means = np.mean(X, axis=0)
X_norm = X - col_means

print("Mean normalized X:")
print(X_norm)


Mean normalized X:
[[-2430.642   199.6    1283.626 ...   531.594 -2278.765  1367.445]
 [-2046.642 -1662.4    -834.374 ...   749.594  -618.765  1479.445]
 [  642.358   -71.4    -182.374 ...   129.594 -2163.765 -2286.555]
 ...
 [  638.358  -129.4    1101.626 ...  -317.406 -2090.765 -1017.555]
 [  706.358 -1640.4     969.626 ...   621.594  1818.235  1896.445]
 [ 2203.358 -1316.4   -2095.374 ...   -57.406 -2115.765  2463.445]]


If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [5]:
import numpy as np

X = np.random.randint(0, 5001, size=(1000, 20))
col_means = np.mean(X, axis=0)
X_norm = X - col_means
avg_all_values = np.mean(X_norm)
avg_min_column = np.mean(np.min(X_norm, axis=0))
avg_max_column = np.mean(np.max(X_norm, axis=0))

print("Average of all values in X_norm:", avg_all_values)
print("Average of the minimum value in each column of X_norm:", avg_min_column)
print("Average of the maximum value in each column of X_norm:", avg_max_column)


Average of all values in X_norm: -4.765752237290144e-14
Average of the minimum value in each column of X_norm: -2487.0498000000002
Average of the maximum value in each column of X_norm: 2503.9501999999998


You should note that since $X$ was created using random integers, the above values will vary.

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data.

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [6]:
import numpy as np


random_permutation = np.random.permutation(5)
print("Random permutation:", random_permutation)


Random permutation: [1 2 4 0 3]


# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [7]:
import numpy as np

X = np.random.randint(0, 5001, size=(1000, 20))
col_means = np.mean(X, axis=0)

X_norm = X - col_means

num_rows = X_norm.shape[0]

row_indices = np.random.permutation(num_rows)

print("Random permutation of row indices:", row_indices)


Random permutation of row indices: [601 165 434 381 571 272 312 908 361 512 787 835 873 376 323 235 294 322
 748 158 427 105 867 849 829 838 501 978 184 388  56 813 160 520 745 249
 188  36 815 827 525 387 794 620 930 557 789 800 705 562 992 693 540   3
  10 651 145 382 196 439 457 449 569 761  68 535  80 321 623 746 850 567
 647 264 593 668 931   6 499 928 963 547  25 563 451 230 712 480   1 528
 233 644 422 202 462 958 879 269 482 755 869 832 467 103 503 952 891 174
 327 899 556 220  73 377 589 155 310 344 630 232 853 676 190 315 887 237
 715 456 784 717 660 634 774 244 492 641 642  86 248 840 546 401 565 673
 286 130  88 366 868 979 126 719   5 304 263  27 886 211 391 795 995  44
 845 575 967 983 594 801  22 400 776 406 250 841 132 714 412 428 282 790
 179 955 739 696 821 836 669 256 515 917 334 903 350 798  38  67 409 576
 890 806 767 919 961 545   8 209 278 708  72 729 496 925 276 595 922 758
 635 341 372 440 884 659 804 208 345 135 986 365 732 463 893 258 197 519
 926 950 302 356

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [8]:
import numpy as np


X = np.random.randint(0, 5001, size=(1000, 20))

col_means = np.mean(X, axis=0)

X_norm = X - col_means

num_rows = X_norm.shape[0]

row_indices = np.random.permutation(num_rows)

train_ratio = 0.7
cross_val_ratio = 0.15
test_ratio = 0.15


num_train = int(train_ratio * num_rows)
num_cross_val = int(cross_val_ratio * num_rows)
num_test = num_rows - num_train - num_cross_val

X_train = X_norm[row_indices[:num_train]]
X_crossVal = X_norm[row_indices[num_train:num_train+num_cross_val]]
X_test = X_norm[row_indices[num_train+num_cross_val:]]

print("Shape of X_train:", X_train.shape)
print("Shape of X_crossVal:", X_crossVal.shape)
print("Shape of X_test:", X_test.shape)


Shape of X_train: (700, 20)
Shape of X_crossVal: (150, 20)
Shape of X_test: (150, 20)


If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [9]:
import numpy as np


X_train = np.random.randint(0, 5001, size=(700, 20))
X_crossVal = np.random.randint(0, 5001, size=(150, 20))
X_test = np.random.randint(0, 5001, size=(150, 20))

print("Shape of X_train:", X_train.shape)

print("Shape of X_crossVal:", X_crossVal.shape)

print("Shape of X_test:", X_test.shape)


Shape of X_train: (700, 20)
Shape of X_crossVal: (150, 20)
Shape of X_test: (150, 20)
