# Lab 02 - numpy

Name: Harrison Halesworth  
Class: CSCI 349 - Intro to Data Mining  
Section: 01 - 11am
Semester: Spring 2024
Instructors: Brian King, Joshua Stough

**Submission:** 
* `lab02.ipynb`, pushed to your Git repo on master.   
* `lab02.pdf` generated and turned in on Gradescope

**Points**: 10

**Due**: Friday, January 26, *before* class

# Objectives

-   numpy. More numpy. Because, one can never have enough numpy.

# Introduction

The first lab helped you set up your tools and briefly introduced you to numpy. Hopefully you found good reference material for numpy as you were working through the previous lab. A common challenge that Python programmers who are new to numpy is understanding some of the fundamental differences between Python lists and NumPy arrays (`ndarray`). They are completely *different*! And, understanding their underlying representation makes it very clear just how different they are. NumPy arrays are designed to be ultra-fast on vector or matrix calculations, with most of the underlying library developed in C. This lab is going to dive into NumPy, and by the end, you should have started a good foundation to this fundamental Python package.

## Useful References

-   <https://numpy.org/doc/stable/>

    - The most useful chapter for all beginners is <https://numpy.org/doc/stable/user/absolute_beginners.html>

-   Python for Data Analysis, 2nd ed. by Wes McKinney. 
    - Excellent book. Bookmark this one. The URL is in Moodle in the references section at the bottom.
    - Chapter 4 covers NumPy basics

# Getting Started

Notice that every lab and assignment should have the top cell that has identifying information. Make sure this is your first cell for every Python notebook you submit.

Next, you should have a cell that sets up the imports that you'll be using. You should become accustomed to doing this for future labs. These instructions will be reduced in the future, expecting you to understand what you need to do.


In [2]:
# Setting things up
import sys
import numpy as np
print(f'which python: {sys.executable}.')
print(f'python: {sys.version}.')
print(f'numpy: {np.__version__}.')

which python: C:\Users\hhale\anaconda3\python.exe.
python: 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)].
numpy: 1.26.3.


You'll notice that version numbers of important packages are printed out, including Python itself. Why? From personal experience, we use Python for almost all of our data science research, studies and experiments that we are involved with. Packages are *frequently* updated, and sometimes behaviors change that are unintended and dependent on a particular package, resulting in broken code. Including this output in generated reports can ensure others who want to run your code will install matching versions in their own environment.

We will usually provide a skeleton `.ipynb` file much like this file, with general background and introductory material to cover what you need, followed by numerous questions to gauge your understaanding. 

* Questions with **[P]** are questions you need to write Python code that produces your answer. **No answer should just show hard coded output!** If there is output required, then your code should clearly print the answer desired. 
* Questions with **[M]** are questions that require a written response in makrdown. You should double click on the cell to edit it, and write your response in markdown.

Generally, any variable on a line by itself as the last line of a cell will automatically be shown in the output. When you have multiple lines of output, however, use either the `display()` function, or the Python `print()` function. If the answer is going to produce multiple pages of output, then please do **not** print all of the data, but only the first few observations in your data.

Let's start with an example. If a line states

**[P] "Create a 2x3 matrix of zeros, single precision floating point format, stored as X"** 

Your answer should be something like:

In [3]:
X = np.zeros((2,3), dtype=np.float32)
X

array([[0., 0., 0.],
       [0., 0., 0.]], dtype=float32)

Notice that we did not use the `print` function. The variable `X` on the last line of the cell is enough. Jupyter will automatically print the contents of the last expression in every cell. However, more interestingly (especially when we get into Pandas), Jupyter will try to do some formatting. If you used print, notice you lose a lot of formatting:

In [4]:
print(X)

[[0. 0. 0.]
 [0. 0. 0.]]


Finally, remember not to show the output of enormous tables of data. For example:

**[P] Create a numpy array of 10000 rows and 100 columns of 32-bit integer zeros:**

I could show the first couple of rows using `print`:

In [5]:
X = np.zeros((10000,100), dtype=np.int32)
print(X)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Better, just show the shape and the first couple of lines or so of the data. For example:

In [6]:
print(X.shape)
print(X.size)
X[0:2,:]

(10000, 100)
1000000


array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

Additionally, you could put a useful function somewhere to return useful info about an array.

In [7]:
def arr_info(arr):
    assert(type(arr) == np.ndarray)
    return f'shape: {arr.shape}, size: {arr.size}, dtype: {arr.dtype}, min: {arr.min()}, max: {arr.max()}'

arr_info(X)

'shape: (10000, 100), size: 1000000, dtype: int32, min: 0, max: 0'

---
# Exercises

### Some of you did not fill out your README.md. Please do that and push that file to your remote Git repo ASAP.

**1. [P]** Create a 100000 x 75 matrix of zeros, stored as `X`. Then print out the shape of the matrix, the base data type, the individual item size in the array, and the total size of the array in bytes (as an integer), Also print out the total size in megabytes with 3 places of significance.

(Note: Megabyte is a loosely defined term these days. A megabyte used to mean $2^{20}$ bytes, before companies selling storage devices started skimming by using $10^6$ as the measure. Now that the marketers stole our word, $2^{20}$ bytes is a *mebibyte*. Use either measure, but be consistent.)

In [8]:
# https://numpy.org/doc/stable/reference/arrays.ndarray.html

X = np.zeros((100000,75))
print("shape = " + str(X.shape))
print("type = " + str(X.dtype))
print("item size = " + str(X.itemsize))
print("size = " + str(X.nbytes))
print("size (in MB) = " + "{:.3f}".format(X.nbytes/(2^20)))

shape = (100000, 75)
type = float64
item size = 8
size = 60000000
size (in MB) = 2727272.727


**2. [P]** Resize X to have the same number of elements, but with 100 columns. Show the shape.  Show the number of bytes (it should be the same as the previous answer)

In [9]:
# https://numpy.org/doc/stable/reference/generated/numpy.resize.html#numpy-resize

X = np.resize(X,(75000, 100))
print("shape = " + str(X.shape))
print("size = " + str(X.nbytes))

shape = (75000, 100)
size = 60000000


**3. [P]** Redo #1, but use a base datatype of 16-bit integers.

**You should have a reduction in memory by a factor of 4.** There are times when this type of change is simple, but highly effective in speeding up your algorithm. *Integer computations will always outperform floating point computations!* Use the simplest data type available that can accurately store your data.



In [10]:
X = np.zeros((100000,75), dtype = "int16")
print("shape = " + str(X.shape))
print("type = " + str(X.dtype))
print("item size = " + str(X.itemsize))
print("size = " + str(X.nbytes))
print("size (in MB) = " + "{:.3f}".format(X.nbytes/(2^20)))

shape = (100000, 75)
type = int16
item size = 2
size = 15000000
size (in MB) = 681818.182


**4. [P]** How many dimensions does X have? Answer using the appropriate property of np.ndarray objects.

In [11]:
X.ndim

2

**5. [P]** Enter the following Python list in your cell:

`str_nums = ['2.14', '-9.300', '42']`

Convert this to a numpy array named `X`. What is the base type? Show the contents of X.

Then, convert X to an array of single precision floating point numbers. (HINT: use `astype`). Show X again.

In [12]:
# https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html#numpy-ndarray-astype

str_nums = ['2.14', '-9.300', '42']
X = np.array(str_nums)
print(X.dtype)
print(X)
X = X.astype("float32")
print(X.dtype)
print(X)

<U6
['2.14' '-9.300' '42']
float32
[ 2.14 -9.3  42.  ]


---
Let's assume you have two weeks' worth of quiz scores. Quizzes are out of 10 points, and are given every day. Suppose you had a rather simplistic way of storing the data using Python lists:  

```
days = ["Mon","Tue","Wed","Thu","Fri"]  
scores_1 = [9.5, 8.75, 8, 10, 7.75]  
scores_2 = [9, 8, 9.5, 8.75, 7.25]  
```

The array `scores_1` represents quiz scores from week 1, and `scores_2` is week 2. The `days` array will be used for data selection purposes.

**6. [P]** Copy the three definitions above for `days`, `scores_1`, and `scores_2`. Create a numpy array called `scores` that has `scores_1` as the first row and `scores_2` as the second row, using `np.concatenate`. Then, change `days` into a `np.array` from the list `days`. Show the contents of `scores` and `days`, and output their shape.

In [13]:
# https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html#numpy-concatenate

days = ["Mon","Tue","Wed","Thu","Fri"]
scores_1 = [9.5, 8.75, 8, 10, 7.75]
scores_2 = [9, 8, 9.5, 8.75, 7.25]

scores = np.concatenate(([scores_1], [scores_2]))
days = np.array(days)
print(scores)
print(scores.shape)
print(days)
print(days.shape)

[[ 9.5   8.75  8.   10.    7.75]
 [ 9.    8.    9.5   8.75  7.25]]
(2, 5)
['Mon' 'Tue' 'Wed' 'Thu' 'Fri']
(5,)


**7. [P]** Repeat the previous problem with the creation of `scores` from `scores_1` and `scores_2`, but repeat it with `np.vstack`. The array should be identical.

> REMINDER – As stressed in earlier courses, do not simply answer questions using functions or methods you know nothing about! If you have never used vstack before, then look it up! Understand it!



In [14]:
# https://numpy.org/doc/stable/reference/generated/numpy.vstack.html#numpy-vstack

scores = np.vstack(([scores_1],[scores_2]))
scores

array([[ 9.5 ,  8.75,  8.  , 10.  ,  7.75],
       [ 9.  ,  8.  ,  9.5 ,  8.75,  7.25]])

For the next several questions, you are going to exercise some advanced data selection techniques using the days array to assist with looking up data. For example, if you are asked to reveal the scores from only Friday, then you must write the numpy selection expression to do this. For example, the following shows the scores only from Friday:

`scores[:, days == "Fri"]`

**8. [P/M]** Compare the result of the expression `days == "Fri"` if the variable `days` was a Python list as entered above, vs. `days` being a numpy array. What is the difference in the result? In general, how does numpy deal with standard comparison operators?**
      
 **Be sure `days` is your numpy array before going to the next problem.**



In [15]:
print(scores[:, days == "Fri"])
days = ["Mon","Tue","Wed","Thu","Fri"]
print(scores[:, days == "Fri"])

[[7.75]
 [7.25]]
[]


The difference is when days is a regular Python list it doesn't work properly but when days is an ndarray it will check the index at which the element is "Fri".

**9. [P]** The scores array represents quizzes that are out of 10 pts. Show the scores array scaled to be out of 100 pts.

HINT: Your first row should be 95.0, 87.5, 80.0, 100.0, 77.5

In [16]:
scores = scores*10
scores

array([[ 95. ,  87.5,  80. , 100. ,  77.5],
       [ 90. ,  80. ,  95. ,  87.5,  72.5]])

**10. [P]** Select the scores that fell on Tuesday

In [17]:
scores = scores/10
days = np.array(days)
tue = scores[:, days == "Tue"]
tue

array([[8.75],
       [8.  ]])

**11. [P]** Select all of the scores that are NOT on Tuesday (Hint – look up the ~ operator)

In [18]:
not_tue = scores[:, days != "Tue"]
not_tue

array([[ 9.5 ,  8.  , 10.  ,  7.75],
       [ 9.  ,  9.5 ,  8.75,  7.25]])

**12. [P]** Select the scores that were on Tuesday or Thursday

In [19]:
tue_thu = scores[:, (days == "Tue") | (days == "Thu")]
tue_thu

array([[ 8.75, 10.  ],
       [ 8.  ,  8.75]])

**13. [P]** Show the minimum and maximum scores for the entire array of scores

A correct result should look like:
```
min =  7.25
max =  10.0
```

In [20]:
print("min = " + str(np.min(scores)))
print("max = " + str(np.max(scores)))

min = 7.25
max = 10.0


**14. [P] Show the maximum scores for each week as a new array. The result should have the same dimensions (hint – use the `keepdims` parameter of `np.max`). Your result should be something like:**

|   | 0    |
|---|------|
| 0 | 10.0 |
| 1 | 9.5  |

In [24]:
# https://numpy.org/doc/stable/reference/generated/numpy.max.html

new_arr = np.max(scores, axis=1, keepdims=True)
new_arr

array([[10. ],
       [ 9.5]])

**15. [P]** Report the day that the maximum score occurred each week. (HINT: use `argmax` and use that result to index `days`.)

In [28]:
# https://numpy.org/doc/stable/reference/generated/numpy.argmax.html#numpy-argmax

max_wk1 = np.argmax(scores[0])
print("Week 1 max occurs on " + days[max_wk1])

max_wk2 = np.argmax(scores[1])
print("Week 2 max occurs on " + days[max_wk2])

Week 1 max occurs on Thu
Week 2 max occurs on Wed


**16. [P]** Report the mean of the scores of each week

In [29]:
# https://numpy.org/doc/stable/reference/generated/numpy.mean.html#numpy.mean

mean_wk1 = np.mean(scores[0])
print("The mean of week 1 is : " + str(mean_wk1))

mean_wk2 = np.mean(scores[1])
print("The mean of week 2 is : " + str(mean_wk2))

The mean of week 1 is : 8.8
The mean of week 2 is : 8.5


**17. [P]** Suppose the lowest score was dropped from each week. Report the mean of each week, but without the minimum score for that week.

In [40]:
# https://numpy.org/doc/stable/reference/generated/numpy.delete.html#numpy-delete

alt_scores = np.delete(scores, np.argmin(scores, axis=1), axis=1)
print("The scores without each weeks minimum are : " + str(alt_scores))

The scores without each weeks minimum are : [[ 9.5   8.75  8.   10.  ]
 [ 9.    8.    9.5   8.75]]


**18. [P]** Show the number of scores that were > 9 each week as an array.

In [62]:
scores = np.concatenate(([[9.5, 8.75, 8, 10, 7.75]],[[9, 8, 9.5, 8.75, 7.25]]))
count1 = np.sum(scores[0] > 9, axis=0)
count2 = np.sum(scores[1] > 9, axis=0)
greater_than_nine = np.vstack((count1, count2))
print(greater_than_nine)

[[ 9.5   8.75  8.   10.    7.75]
 [ 9.    8.    9.5   8.75  7.25]]
[[2]
 [1]]


---
For the next exercise, we will set up a new 'X' array of random inteegers between 1 and 100, with 10 rows and 5 columns. We will also set the random seed to 1234 to ensure everyone gets the same results.

In [87]:
# https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html#numpy-random-randint

np.random.seed(1234)
X = np.random.randint(1, 101, size=(10,5))

You may use standard selection techniques with integers and slicing for the following exercises. Be sure to show your results of each. (Most of these are one-line statements, and thus you won't need a `print` or `display` call.) For each of these, you should only show the result of the selection. Do NOT reassign `X`.

**19. [P]** Select the first row of X

In [88]:
X[0]

array([48, 84, 39, 54, 77])

**20. [P]** Select the last column of X

In [89]:
print(X[:,[-1]])

[[77]
 [27]
 [59]
 [48]
 [39]
 [81]
 [66]
 [61]
 [14]
 [32]]


**21. [P]** Select the first AND last column of X

In [90]:
X[:, [0,-1]]

array([[48, 77],
       [25, 27],
       [31, 59],
       [93, 48],
       [51, 39],
       [68, 81],
       [ 4, 66],
       [76, 61],
       [47, 14],
       [97, 32]])

**22. [P]** Select every other row of X

In [91]:
# https://numpy.org/doc/stable/user/basics.indexing.html

X[::2]

array([[48, 84, 39, 54, 77],
       [31, 44, 31, 27, 59],
       [51, 77, 38, 35, 39],
       [ 4,  3, 20, 13, 66],
       [47, 29, 82, 88, 14]])

**23. [P]** Show the transpose of X, but don't change X itself

In [92]:
np.transpose(X)

array([[48, 25, 31, 93, 51, 68,  4, 76, 47, 97],
       [84, 16, 44, 70, 77, 12,  3, 82, 29, 13],
       [39, 50, 31, 81, 38,  1, 20, 15, 82, 70],
       [54, 24, 27, 74, 35, 76, 13, 72, 88, 96],
       [77, 27, 59, 48, 39, 81, 66, 61, 14, 32]])

**24. [P]** Select the first column of X and set the result to Y.

In [93]:
Y = X[:, [0]]
Y

array([[48],
       [25],
       [31],
       [93],
       [51],
       [68],
       [ 4],
       [76],
       [47],
       [97]])

**25. [P]** Increment the first value of Y, then show the corresponding value of X. Did both values in X and Y change?

In [94]:
Y[0,0] += 1
X
# X value did not change

array([[48, 84, 39, 54, 77],
       [25, 16, 50, 24, 27],
       [31, 44, 31, 27, 59],
       [93, 70, 81, 74, 48],
       [51, 77, 38, 35, 39],
       [68, 12,  1, 76, 81],
       [ 4,  3, 20, 13, 66],
       [76, 82, 15, 72, 61],
       [47, 29, 82, 88, 14],
       [97, 13, 70, 96, 32]])

**26. [P]** Repeat exercise 23, but ensure that Y is assigned a copy of the selected data. Increment the first value of Y again and ensure that the corresponding value of X did not change.

In [95]:
# https://numpy.org/doc/stable/reference/generated/numpy.copy.html#numpy-copy

Y = np.copy(np.transpose(X))
Y[0,0] += 1
X
#X value did not change

array([[48, 84, 39, 54, 77],
       [25, 16, 50, 24, 27],
       [31, 44, 31, 27, 59],
       [93, 70, 81, 74, 48],
       [51, 77, 38, 35, 39],
       [68, 12,  1, 76, 81],
       [ 4,  3, 20, 13, 66],
       [76, 82, 15, 72, 61],
       [47, 29, 82, 88, 14],
       [97, 13, 70, 96, 32]])

**27. [P]** Create an array that contains the sequence of numbers 0, 0.1, 0.2, … 9.8, 9.9 using arange, as a 10x10 matrix, stored as X.

In [99]:
# https://numpy.org/doc/stable/reference/generated/numpy.arange.html#numpy-arange

X = np.arange(0, 10, 0.1).reshape(10,10)
X

array([[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
       [1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9],
       [2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
       [3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9],
       [4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9],
       [5. , 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9],
       [6. , 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9],
       [7. , 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9],
       [8. , 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.8, 8.9],
       [9. , 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9]])

**28. [P]** Set the RNG seed to 1234. Then create an array X of 100 uniformly distributed numbers, with all values between 1.0 and 10.0. Then, show the mean, the median, the minimum and maximum values of X.

In [103]:
np.random.seed(1234)
X = np.random.uniform(1,10, size=(1,100))
print("The mean of X is : " + str(np.mean(X)))
print("The median of X is : " + str(np.median(X)))
print("The max of X is : " + str(np.max(X)))
print("The min of X is : " + str(np.min(X)))

The mean of X is : 5.665266170909222
The median of X is : 5.827284349552116
The max of X is : 9.928733195695253
The min of X is : 1.0558766492841647


**29. [M]** Define what is meant by a normal distribution. What are the parameters of a normal distribution?

**ANSWER**: A normal distribution is a probibility distribution that depicts the probability of certain possibilities, it is really dependent on the parameters, but it depicts a bell-shape that implies that the probability of possibilities close to the average are more likely to happen than possibilities that are very far away in a positive or negative direction. The parameters for a normal distribution are the mean and variance of the data set.

https://www.techtarget.com/whatis/definition/normal-distribution

**30. [M]** In simple terms, using a normal distribution, what does the Law of Large Numbers tell us?

**ANSWER:** From my understanding, the Law of Large Numbers tells us that with larger and larger samples, our sample values will converge to that of the population, and the normal distribution graph will converge to a centered bell-curve.

**31. [P]** Write a function called `test_normal_dist`. The purpose of this function is to evaluate the law of large numbers. It should have four parameters:** 

* `mu` = mean of distribution  
* `sd` = standard deviation  
* `vec_length` = length of the vector to generate randomly from a normal distribution, with mu and sd as parameters
* `num_trials` = number of times to repeat the experiment  

The function should behave as follows. First, initialize the seed value to 1000 before your loop. Then, loop for num_trials, generating vec_length numbers from a normal distribution. Compute the mean and the deviation of the vetor, where deviation is the absolute value of (observed mean - expected mean). This should be repeated for all trials. Return the *average* of this deviation over all trials.


In [104]:
# https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html#numpy-random-normal

def test_normal_dist(mu, sd, vec_length, num_trials):
    """
    Test out a normal distribution, and return the average deviation

    Parameters:
    mu = mean
    sd = standard deviation
    vec_length = size of each individual vector to generate
    num_trials = number of vectors to generate
    """
    
    # Set seed to 1000 and initialize total deviation
    np.random.seed(1000)
    tot_deviation = 0

    # Randomly generate vec_length numbers num_trials times, 
    # calculate mean of current trial, then finally
    # take deviation each time and add to total deviation
    for trial in range(num_trials):
        vec = np.random.normal(mu, sd, vec_length)
        curr_mean = np.mean(vec)
        curr_deviation = np.abs(mu - curr_mean)

        tot_deviation += curr_deviation

    # total deviation divided by the number of trials will give you
    # the average trial mean deviation from the actual mean
    return tot_deviation / num_trials

**32. [P]** Use `test_normal_dist` to obtain the deviation for vector lengths of 10, 100, 1000, 10000, and 100000. Use a fixed number of trials of 100 for each experiment. Report the results as a numpy array with two dimensions. the first being the vector length, and the second being the average deviation resulting from your test_normal_dist function.**
  
Example: Using a mu = 10 and sd = 2, your resulting array should look something like:

|        |        |
|--------|--------|
| 10     | 0.4474 |
| 100    | 0.1638 |
| 1000   | 0.0427 |
| 10000  | 0.015  |
| 100000 | 0.0047 |



In [135]:
mu = 10
sd = 2

# For values [10, 1000, ..., 100000,], use that value as vec_length
# and do 100 trials within test_normal_dist for each vec_length
for i in range(1,6):
    if i == 1:
        results = np.array([10**i, test_normal_dist(mu, sd, 10**i, 100)])
    else:
        results = np.vstack((results, np.array([10**i, test_normal_dist(mu, sd, 10**i, 100)])))

print(results)

[[1.00000000e+01 4.47356805e-01]
 [1.00000000e+02 1.63805884e-01]
 [1.00000000e+03 4.26737793e-02]
 [1.00000000e+04 1.50443904e-02]
 [1.00000000e+05 4.72044942e-03]]
