# Exercise Background

This small application based coding exercise is ment to expose you to the use of the numpy library as well as give you a taste of tasks that you might be needed to perform during machine learning. 

Usually, machine learning involves working on large data sets. This notebook will walk you through normalising the data and then dividing the data set into smaller subsets. It is recommended that while attempting each of the tasks visit the NumPy library to find the most appropriate function which can help you achieve the desired result. More often than not you will find the functions which you require prewritten in the library. The **numpy library** can be found [here.](https://numpy.org/doc/stable/) 

Without further ado, the first task is to mean normalise a data set. Mean normalising is a data transformation done to reduce the variations in the data set. For example, consider a data set which has integers between 0 and 10000. That is a lot of variation, and it becomes difficult to build ML algorithms on this data. So mean normalisation is done on such data, after the transformation, the mean of the data will be zero, and standard deviation will be 1.  Even though the actual values of data will change a lot, but the overall variation is still kept intact. If the concept of normalisation feels a bit unclear dont worry all of this will be covered in the future sections of this program. For now, let’s concentrate on the tasks at hand. 


# Task 1: Mean Normalisation: 

**Question 1.1** Create a 2D of random integers between 0 and 10,000 (including both 0 and 10,000) with 25000 rows and 15 columns. This will be the dataset you will use in the notebook. 

In [5]:
10000/15

666.6666666666666

In [6]:
help(np.random.random_integers)

Help on built-in function random_integers:

random_integers(...) method of numpy.random.mtrand.RandomState instance
    random_integers(low, high=None, size=None)
    
    Random integers of type `np.int_` between `low` and `high`, inclusive.
    
    Return random integers of type `np.int_` from the "discrete uniform"
    distribution in the closed interval [`low`, `high`].  If `high` is
    None (the default), then results are from [1, `low`]. The `np.int_`
    type translates to the C long integer type and its precision
    is platform dependent.
    
    This function has been deprecated. Use randint instead.
    
    .. deprecated:: 1.11.0
    
    Parameters
    ----------
    low : int
        Lowest (signed) integer to be drawn from the distribution (unless
        ``high=None``, in which case this parameter is the *highest* such
        integer).
    high : int, optional
        If provided, the largest (signed) integer to be drawn from the
        distribution (see above for 

In [17]:

from os import XATTR_SIZE_MAX
# import NumPy into Python
import numpy as np

# Create a 25000 x 15 ndarray with random integers in the interval [0, 10000].
x = np.random.random_integers(low=0, high=10000,size=(25000,15))

# print the shape of X
x.shape

  x = np.random.random_integers(low=0, high=10000,size=(25000,15))


(25000, 15)

In [19]:
x

array([[8568, 8156, 7320, ..., 8925,  756, 1458],
       [ 768, 3006, 1068, ..., 1453,  231,  500],
       [2510, 1713, 5529, ..., 9796, 2163, 3081],
       ...,
       [5550, 6063, 2320, ..., 4707, 6891, 1502],
       [4380,  844, 4468, ..., 6111, 2468, 3675],
       [7440, 1464, 4217, ...,  785, 1072,  212]])

In [20]:
# print the first row of X
x[0, :]

array([8568, 8156, 7320, 5141,  567, 1062, 1825, 1773, 2646, 3857, 2477,
        719, 8925,  756, 1458])

Now that you created the array we will mean normalize it. The equation for normalisaing the data is given below:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. To put it simply, to find the new value of each element, you have to subtract the mean of respective column form that value and divide the result with the standard deviation of that columns. Now the question is, Why are these operations being done column-wise? That is because usually all the procedures in ML are done column-wise. So it will be beneficial for us to develop the habit of thinking about data column-wise.   

**Question 1.2** Find the mean and the standard deviation of each of the columns in the dataset. The result will be two 1D arrays with 15 elements each, representing the mean and standard deviation for each of the columns in the dataset.  

In [33]:

# Average of the values in each column of X
ave_cols = np.average(x, axis = 0)

# # print ave_cols  
print(ave_cols)

# # Standard Deviation of the values in each column of X
std_cols =  np.std(x, axis= 0)

# # print std_cols  
print(std_cols)



[4993.36724 5021.52808 5011.8084  4988.22872 4991.45392 5001.738
 4993.825   5001.13148 5002.64684 4997.70424 4992.01484 4992.34448
 5014.85396 5013.34464 4977.57408]
[2890.27591031 2886.52501128 2889.83500685 2879.32132191 2884.60041702
 2887.04516314 2880.92550098 2881.44853103 2893.52762969 2884.04337766
 2887.13601496 2882.45392965 2881.07901546 2889.69070118 2888.02465196]


**Question 1.3** Print the shape of each both the arrays, they should have 15 elements each.  

In [34]:
# Print the shape of ave_cols
ave_cols.shape
# Print the shape of std_cols
std_cols.shape

(15,)

**Question 1.4** Now that you have mean and standard deviation calculated, it is time to apply the transformation to the dataset. 
 
**HINT** The broadcast property of NumPy can make this a lot easier. You can read about it [here](https://numpy.org/doc/stable/user/basics.broadcasting.html).
All you have to do is create one row of transformation values and repeat them through all the values.

In [35]:
ave_cols

array([4993.36724, 5021.52808, 5011.8084 , 4988.22872, 4991.45392,
       5001.738  , 4993.825  , 5001.13148, 5002.64684, 4997.70424,
       4992.01484, 4992.34448, 5014.85396, 5013.34464, 4977.57408])

In [36]:
# Mean normalize X
x_norm = (x-ave_cols)/std_cols

x_norm

array([[ 1.23677907,  1.08589806,  0.79872781, ...,  1.35718112,
        -1.47328731, -1.21867868],
       [-1.46192522, -0.69825415, -1.3647175 , ..., -1.23629166,
        -1.65496765, -1.5503933 ],
       [-0.85921459, -1.14619761,  0.17896925, ...,  1.65949841,
        -0.98638399, -0.65670287],
       ...,
       [ 0.1925881 ,  0.36080474, -0.93147477, ..., -0.1068537 ,
         0.64977728, -1.20344336],
       [-0.21221754, -1.44725165, -0.18817974, ...,  0.38046372,
        -0.88083636, -0.45102596],
       [ 0.84650491, -1.23246051, -0.27503591, ..., -1.46814924,
        -1.36393305, -1.65011544]])

**Question 1.5** If the transformation has been performed correctly, the mean of elements in each column will be approximately 0. Also, the average of the **minimum** value in each column of X_norm and the average of the **maximum** value in each column of X_norm will have almost the same face value with opposite signs. Let’s confirm if the transformation has happened correctly. 

In [44]:
# Print the average of all the values of X_norm

print(np.average(x_norm))
# Print the average of the minimum value in each column of X_norm
print(np.average( x_norm.min(axis = 0) ) )
# Print the average of the maximum value in each column of X_norm
print(np.average( x_norm.max(axis = 0) ) )

-1.1615005253891771e-17
-1.7324719815643366
1.732817544359596


In [69]:
print(np.mean(x_norm))

-1.1615005253891771e-17


In [46]:
 x_norm.max(axis = 0) 

array([1.7322335 , 1.72472849, 1.7261164 , 1.74060854, 1.73630498,
       1.73127254, 1.73769679, 1.73484567, 1.72707981, 1.73447314,
       1.73458581, 1.73728901, 1.73030521, 1.72567097, 1.7390523 ])

Be mindful that the exact values might not match since the dataset was initialized using the random function. 

# Data Spliting 

After data processing, it is a regular practice in ML to split the dataset into three datasets. 

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The ratios in which the data is split varies a bit from case to case. But the accepted standard 6:2:2 for train, test, and validation respectively. That is 60% for training data and so on. Again why is the data split or what is the signification of these smaller data sets? These questions are better left unanswered for now. 
The tanks assigned to you is to split the data in the given proportions randomly. 
For instance, if the data set had ten elements, this is how you would do it. 

In [None]:
# We create a random permutation of integers 0 to 9
np.random.permutation(10)

array([8, 3, 7, 5, 2, 6, 1, 9, 0, 4])

1. training set = 8,3,7,5,2,6
2. Cross Validation Set = 1,9
3. Test Set = 0,4

**Question 2.1** Similarly, create a 1D array representing the indexes of the rows in the dataset X_norm. U can use the   `np.random.permutation()` function for randomising the indexes. 

In [51]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`

row_indices = np.random.permutation(np.arange(0,25000))

In [52]:
# Print the shape of row_indices
row_indices.shape

(25000,)

**Question 2.2** Split the row indexes in the needed proportions. You can use the slicing methods you have learnt in this session to make the job easier.  

In [64]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.
train = row_indices[:15000]
test = row_indices[15000:20000]
val  = row_indices[20000:  ]

In [65]:
val

array([15447,   659, 23439, ..., 12782,  7019, 15500])

**Question 2.3** Now make use of the indexes that you made to split the data also similarly once the data is split print the shape of each of the smaller data sets. `X_train` should have 15000 rows and 15 columns. `X_test` should have 5000 rows and 15 columns. `X_val` should have 5000 rows and 15 columns. 

In [66]:
# Create a Training Set
x_train = x_norm[train]

# Create a Cross Validation Set
x_val = x_norm[val]

# Create a Test Set
x_test = x_norm[test]

In [67]:
x_val

array([[ 0.98351605, -0.45297653, -1.10138066, ...,  0.32944117,
         0.08224249, -1.38176593],
       [-0.16862308,  0.83888825, -0.873686  , ..., -1.6875115 ,
        -1.5726059 , -0.17367375],
       [ 0.71329964,  1.59238943,  0.96240498, ..., -1.72499745,
        -0.11328017, -0.82256018],
       ...,
       [-0.91526461, -1.08453177, -1.68895746, ...,  0.16075437,
        -1.68645891,  0.55104305],
       [ 0.787687  , -1.67659316, -1.16436004, ..., -0.14399257,
        -0.49359768,  0.12029881],
       [ 0.69876815,  0.04901115, -1.22111068, ..., -0.16655356,
        -0.82615923,  1.44230968]])

In [70]:
# Print the shape of X_train
x_train.shape

# Print the shape of X_crossVal
x_val.shape

# Print the shape of X_test
x_test.shape

(5000, 15)

In [72]:
a = np.array([1,2,3,4,5,6,7,8])
a.reshape(2,2,2)

array([[[1, 2],
        [3, 4]],

       [[5, 6],
        [7, 8]]])

In [75]:
a.reshape(1,1,-1)

array([[[1, 2, 3, 4, 5, 6, 7, 8]]])