# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

In [1]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
import numpy as np

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**

### **Seed a Random Number Generator (RNG)**

<span style="color:black">For many machine learning models, random numbers are generated and iteratively updated during the training process. Many programming languages, including Python, use pseudo-random number generators that produce "almost random" numbers and allow you to "set a seed." Seeding the generator is important to ensure reproducible results.

<span style="color:black">Set the random seed to zero and generate arrays with random integers.

<span style="color:black"> See NumPy documentation for more info on  [`seed()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html) and [`randint()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html)  or use `np.random.seed`. 

In [2]:
n, p = 10, 4  # observations x features

np.random.seed(0)                      # we seed the random number generator (RNG)
Y = np.random.randint(2, size=n)       # 1D array of n outputs from values {0,1}
X = np.random.randint(10, size=(n, p)) # 2D array data matrix with n observations from values {0,1,2,3,4,5,6,7,8,9}

print('Y = ', Y)
print('X = \n', X)

Y =  [0 1 1 0 1 1 1 1 1 1]
X = 
 [[5 2 4 7]
 [6 8 8 1]
 [6 7 7 8]
 [1 5 9 8]
 [9 4 3 0]
 [3 5 0 2]
 [3 8 1 3]
 [3 3 7 0]
 [1 9 9 0]
 [4 7 3 2]]


### **NumPy library and `RandomState()` object**

Often we'd like to generate random values from a specific distribution (such as uniform or Gaussian). These values can be integers, floats or any other objects from a specified collection. Such random number generation (RNG) is offered by many established packages, but often fundamentally is supported by the NumPy library. One way to generate such values is to create an object [`numpy.random.RandomState(...)`](https://numpy.org/doc/stable/reference/random/legacy.html#numpy.random.RandomState), which can take an integer to seed RNG. Seeding simply ensures that every time the same seed is used, the exactly same sequence of random numbers is generated. This guarantees reproducibility of randomness, which helps troubleshooting and debugging one's code.

The object `RandomState()` offers several methods to return a desired-size sequence of random values. For example [`.randint(low=2, high=10, size=5)`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html) returns a 1D numpy array of five random integers from 2 to 9 (or 10-1). If size is a tuple of two numbers, such as `size=(3,4)`, then a 2D numpy  array is returned, which is similar to a matrix of 3 by 4 (i.e. 3 rows and 4 columns).

Let's see this in the following example, where we use a seeded RNG to generate a size `n` array `Y` of some (random) target values, and a size `10x4` 2D array of observations, `X`. Each of the ten observations corresponds to the target value with the same index. The data matrix `X` has four features (columns). Further these matrices will be split into a training and testing sample.

In [3]:
n, p = 10, 4                     # number of n observations and p features
rng = np.random.RandomState(0)   # save the random state object that can be used to seed sampling from distributions
Y, X = rng.randint(2, size=n), rng.randint(10, size=(n, p))

print('Y = ', Y)
print('X = \n', X)

Y =  [0 1 1 0 1 1 1 1 1 1]
X = 
 [[5 2 4 7]
 [6 8 8 1]
 [6 7 7 8]
 [1 5 9 8]
 [9 4 3 0]
 [3 5 0 2]
 [3 8 1 3]
 [3 3 7 0]
 [1 9 9 0]
 [4 7 3 2]]


### **Training and Validation Data Sets**

<span style="color:black">You can use scikit-learn's `train_test_split()` function to split observations (i.e., the rows of matrix $X$) into training and validation subsets. By specifying the `random_state` parameter, you can also use this function to shuffle the rows prior to splitting the data. Notice that the row labels, which are stored in array $Y$ are split in the same way so that every row of the training matrix `tX` is associated with its training label `tY`. The same goes for the validation rows and labels, `vX` and `vY`.

<span style="color:black">Note that seeding `random_state` seeds the shuffling. The values themselves are already drawn from the seed in the previous code block.

In [4]:
from sklearn.model_selection import train_test_split

tX, vX, tY, vY = train_test_split(X, Y, test_size=0.2, random_state=0) 

print(f'tX.shape = {tX.shape}\n', tX)  # training inputs
print(f'vX.shape = {vX.shape}\n', vX)  # validation  inputs
print(f'tY.shape = {tY.shape}\n', tY)  # training outputs
print(f'vY.shape = {vY.shape}\n', vY)  # validation  outputs

tX.shape = (8, 4)
 [[9 4 3 0]
 [4 7 3 2]
 [6 8 8 1]
 [3 8 1 3]
 [3 3 7 0]
 [1 5 9 8]
 [5 2 4 7]
 [3 5 0 2]]
vX.shape = (2, 4)
 [[6 7 7 8]
 [1 9 9 0]]
tY.shape = (8,)
 [1 1 1 1 1 0 0 1]
vY.shape = (2,)
 [1 1]


<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**

Now, equipped with these concepts and tools you will tackle a few related tasks.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.

## Task 1

In the following example you will practice seeding an RNG for each matrix in a sequence (for reproducibility). The goal is to find a matrix with the largest sum of its elements.

Create a loop for $k$ in 0..10000. In each iteration:
1. seed random number generator with $k$
1. sample $X_{9\times 5}$ matrix with values $X_{ij}$ from the set $\{\pm1,\pm2,\pm3,\pm4,\pm5\}$
1. compute the sum of matrix values, $S_k:=\sum_{i,j}X_{ij}(k)$, which is the sum of all values in the matrix $X$ seeded with $k$, herein as $X(k)$

Find $k$, which yields the highest $S_k$.

Note: This $k$ is called a **maximizer** of $S$, and may not be unique. This is a small exercise with a random number generator and sampling of random values from a discrete distribution, as above.

<b>Hint:</b> Check out <code>low</code> and <code>high</code> parameters of <code>np.random.randint()</code>.


In [5]:
# returns a 9x5 matrix (i.e. numpy 2D array) with random integers in range from -5 to 5:
GetX = lambda i: np.random.RandomState(i).randint(low=-5, high=6, size=(9, 5)) 
ArraySums = [GetX(i).sum() for i in range(10001)] # list of sums of elements of 10K matrices
i = np.argmax(ArraySums)  # returns the maximizer, i.e. the index that results in largest sum of 10k matrices 
X = GetX(i)               # returns the matrix for the given maximizer
print(f'i={i};\t', f'X.sum()={X.sum()}','\n', X)

i=1097;	 X.sum()=83 
 [[ 4  4 -4 -2  2]
 [ 2  5  1  5  4]
 [-3  5  1  2  5]
 [ 4 -2  0  3  5]
 [-3  4  5  3  0]
 [ 4  5 -4  4  4]
 [ 5  3  5 -2  4]
 [ 4 -1 -4  5 -1]
 [ 5 -2  3 -2 -2]]



<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
# returns a 9x5 matrix (i.e. numpy 2D array) with random integers in range from -5 to 5:
GetX = lambda i: np.random.RandomState(i).randint(low=-5, high=6, size=(9, 5)) 
ArraySums = [GetX(i).sum() for i in range(10001)] # list of sums of elements of 10K matrices
i = np.argmax(ArraySums)  # returns the maximizer, i.e. the index that results in largest sum of 10k matrices 
X = GetX(i)               # returns the matrix for the given maximizer
print(f'i={i};\t', f'X.sum()={X.sum()}','\n', X)
</pre>
</details> 
</font>
<hr>

## Task 2

Now split $X$ into a training set of six observations with the rest allocated to the validation set. See [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) documentation. 


In [None]:
# check solution here


<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
tX, vX = train_test_split(X, train_size=6, random_state=0) 
print(f'tX.shape = {tX.shape}\n', tX)  # training inputs
print(f'vX.shape = {vX.shape}\n', vX)  # validation  inputs
</pre>
</details> 
</font>
<hr>
