<center> <h1> Programming for Data Analytics 2021 NumPy Assignment </h1> </center>

<div align="center"> <b> Student Name: Kate McGrath </b> </div>
<div align="center"> <b> Student Number: G00398908 </b>  </div>
<div align="center"> <b> Submission Date: 19/11/2021 </b> </div>

<center> <h2> Part One: Purpose of NumPy Random Package </h2> </center>

<center> <h3> 1.1. Introduction to NumPy (Numerical Python) </h3> </center>





NumPy, or Numerical Python, is one of the most widely used scientific libraries in Python programming [1].

An open-source library, created in 2005 by Travis Oliphant [2], it has become the de facto standard for working with numerical data in Python, and is used by everyone from beginners to experienced researchers at the cutting edge of scientific and commercial R&D [3]. Additionally, several other Python analytical libraries have been built on top of NumPy, including Pandas, Scikit-learn and Matplotlib [1]. 

<center> <h4>Multidimensional Arrays</h4> </center>

NumPy's core object is the multidimensional array (ndarray), which is essentially a fast and flexible container, built to facilitate batch numerical operations on blocks of data.The key advantage of using NumPy arrays for numerical computations rather than Python lists is that the former is designed for efficiency on large arrays of data. 

NumPy arrays can be considered as a grid of values, with a defined number of rows and columns.  In the NumPy package, rows and columns are referred to as dimensions. A vector consists of a single dimension, a matrix contains two dimensions, and a tensor comprises three or more dimensions.  Each dimension has a corresponding axis, starting at index [0]. Axes are used to locate and operate on elements within a specific dimension.

Run the below code to see examples of one, two and three-dimensional arrays.


In [None]:
oneDArray = np.array ([1,2,3])
print(oneDArray)

twoDArray = np.array ([(1,2), (3,4)])
print(twoDArray)

threeDArray = np.array ([(1,2), (3,4), (5,6)])
print(threeDArray)

<center> <h4>NumPy Arrays vs Python Lists</h4> </center>

NumPy arrays are significantly faster than Python lists for the following reasons:

1. NumPy arrays are homogeneous, i.e. consisting of a single data type and allocated to a continuous block of memory (continuous). To access the next element stored in an array, the programme simply needs to move to the next memory address. In contrast, Python lists are heterogeneous (not confined to a single data type) and stored in non-consecutive memory locations; both of these factors contribute to processing overhead.
2. The NumPy package breaks down a task into multiple fragments and processes them in parallel. 
3. NumPy incorporates elements of the C, C++ and Fortran programming languages in Python. These are low level languages and therefore have a reduced execution time compared to Python.

In addition to the speed advantages, NumPy arrays allow for operations on singular elements within the array without requiring loops, and consume less memory than their Python list counterparts. The below code illustrates the performance advantages of NumPy arrays over python lists, in terms of speed and memory usage. 

In [None]:



import numpy as np
import time
from sys import getsizeof

size = 1000000

list1 = range(size)
list2 = range(size)

#Need to get current time to second to subtract from time taken to complete operation
startTime = time.time()

newlist = []

#This code is zipping list 1 and 2 together - mapping each list1 value to its corresponding list2 value and multiplying them
#Then adding them to new list
for a,b in zip(list1,list2):
    newlist.append(a*b)

TimeNow = time.time()
TimeTaken = TimeNow - startTime
print("Time taken to complete in seconds: {}".format(TimeTaken))

#Declaring Arrays
array1 = np.arange(size)  
array2 = np.arange(size)

startTimeNumpy = time.time()

NewArray = array1* array2

TimeNowNumpy = time.time()
TimeTakenNumpy = TimeNowNumpy - startTimeNumpy
print("Time taken to complete in seconds: {}".format(TimeTakenNumpy))
print("Time difference in seconds: {}".format(TimeTaken - TimeTakenNumpy))


#Checking difference in memory consumption between numpy array and python list
totalMemoryList = getsizeof(newlist)
totalMemoryNumpy = getsizeof(NewArray)

print("Total memory consumed by Python List: {} bytes".format(totalMemoryList))
print("Total memory consumed by Numpy Array: {} bytes".format(totalMemoryNumpy))
print("Memory difference in bytes: {}".format(totalMemoryList - totalMemoryNumpy))

<center> <h4>Getting Started with NumPy</h4> </center>


The only prerequisite to installing NumPy is Python itself. For convenience, the Anaconda distribution is recommended for beginners as it comes with NumPy, Python and other statistical Python packages pre-installed. 

To install NumPy on its own, the following command can be used:

In [None]:
pip install numpy

In order to use NumPy Random, the library should be imported using the following command:

In [None]:
import numpy as np
from numpy import random

<center> <h3> 1.2. NumPy Random Package </h3> </center>

The NumPy random module provides methods of obtaining random numbers following any of several statistical distributions, as well as methods of choosing random elements from and shuffling arrays. 

<center> <h4>Applications of Random Numbers to Data Science</h4> </center>

Randomness is one of the key concepts underpinning the field of statistics. It is a phenomenon where the outcome of a single repetition is uncertain, yet in a large number of repititions a regular distribution of outcomes is observed, e.g. the bell curve that characterizes the normal distribution. 

A random number is one generated using a large set of numbers and a mathematical algorithm that ensures an equal probability that any of the numbers within the specified distribution will be selected. The ability to generate random numbers is paramount within the fields of data science and machine learning.

Sample use cases for random number generation are listed below.

1. The collection of sample data generally entails picking truly random data points from the population; the more random these points are, the better representative they are of the population. However, if the distribution of data for which the model is being developed is already known, it may not be necessary to collect data if random numbers matching the distribution can be generated. For example, if we know the mileage of a car follows a normal distribution, we can generate random numbers following a normal distribution for a study. This process is called simulation.
2. When developing a machine learning model, data is split into training and testing data, with 80% going to the former category and 20% to the latter. This splitting of data must be random for the model to perform efficiently.
3. Neural network algorithms, which are sets of algorithms modelled loosely on the human brain and designed to recognize patterns, also rely on randomness. A neural network is made up of interconnected nodes. The strength of the connection between two nodes is the weight, i.e. the influence that the input should have on the output. Neural network algorithms are initialized with random weights and with each epoch (passing of the enitre dataset through the neural network) the value is changed to reduce error and increase accuracy.


<center> <h4>True Random and Pseudorandom Number Generation</h4> </center>

Two types of random number generation exist: True Random and Pseudorandom Number Genration (TRNG/PRNG).

The majority of computer programs, including the NumPy random package, rely on Pseudorandom Number Generation. This is because computers are deterministic, which means that the same input will always produce the same output. Pseudorandom number generators, such as those used by the random module, are algorithms that generate apparently random, yet reproducible data. The pseudorandom number generator starts with an integer called a seed, and then generates numbers in succession according to the specified distribution. The same seed will provide the same sequence of random numbers, hence if the seed is set at the start of a programme, the numbers generated will be reproducible. The below code sets the seed and creates an array with 10 random numbers between 1 and 10. Each time this code is run the output will be the same. 

In [None]:
import numpy as np
from numpy import random
np.random.seed(55)
random_numbers = np.random.randint(1, 10, size = 5)
print(random_numbers)

Conversely, true random number generators cannot rely on computer algorithms as by their nature these will always be deterministic. Instead, TRNGs extract randomness from physical phenomena and introduce it to a computer. Potential sources may include time between keystrokes or the movement pattern of a mouse, or naturally occuring phenomena like radioactive decay. The generation of truly random numbers from these sources relies on the identification of small, unpredictable changes within a dataset. Some fields, like cryptography, require the generation of true random numbers. However, TRNGS are inefficent compared to PRNGS, and for applications like data simulation and modelling reproducibility can be an advantage. Hence, PRNGS such as those used in the NumPy random package are more applicable to the field of data science.

<center> <h2> Part 2:  Simple Random Data and Permutations </h2> </center>

<center> <h3> 2.1 Simple Random Data </h3> </center>

Simple Random data refers to a set of functions that can be used for simple random sampling. A simple random sample takes a small, random  portion of the entire population, where each member has an equal chance of being selected. This portion is considered representative of the dataset as a whole.

Four main functions for simple random sampling are included in NumPy. These are as follows:

1. integers : Returns random integers from the discrete uniform distribution.
2. random: Returns random floats from the continuous normal distribution.
3. choice: Generates a random sample from within a given NumPy array
4. bytes: Generates a random string of bytes

<center> <h4>1.  numpy.random.Generator.integers </h4> </center>

This function generates a sample of integers from the discrete uniform distribution. 

The uniform distribution is defined by two parameters, a and b, where a is the minimum and b is the maximum, and there is an equal probabality that any value between a and b can be chosen. 

In the discrete normal distribution, the values between a and b are finite/countable, as illustrated by the below probability distribution.

![Discrete Uniform Probability Distribution](https://www.statisticshowto.com/wp-content/uploads/2013/09/Uniform_discrete_pmf_svg.svg_.png)

The syntax for the function is as follows:

***random.Generator.integers(low, high=None, size=None, dtype=np.int64, endpoint=False)***

1. Generator: Generates random numbers within the given distribution and value range. Relies on a bit generator, which generates the random numbers and manages the state, and uses the generated numbers to sample from different probability distributions.
2. low = Mandatory parameter, minimum value for sampling.
3. high = Default none, maximum value
4. size = Default none, number of elements to return. If size is not specified, a single integer will be returned
5. dtype = Data type of returned elements, defaults to int64 - signed integer data type that can store a maximum of 64 bits
6. endpoint = Boolean, if true sample from low (inclusive) to high (inclusive), if false sample from low (inclusive) to high (exclusive)

The below code uses the random integers function to generate sets of numbers between 1 and 5 of various sample sizes and plots the returned arrays on histograms.

In [None]:
import numpy as np
from numpy import random
import seaborn as sns
import matplotlib.pyplot as plt

#Setting up generator to use default bit generator - PCG64
rng = np.random.default_rng(12345)

#Uniform Distribution
#Generating multiple plots to show effect of sample size
#Expect that as sample size increases, plot will more closely resemble a uniform distribution - 
#equal probability that any of the numbers between 1 and 5 will be chosen
uniformdist10 = rng.integers(size = 10, low = 1, high =5)
uniformdist1000 = rng.integers(size = 1000, low = 1, high =5)
uniformdist10000 = rng.integers(size = 10000, low = 1, high = 5)
uniformdist100000 = rng.integers(size = 100000 , low = 1, high = 5) 

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
plt.suptitle("Effect of Sample Size on Uniform Distribution")
ax1.hist(uniformdist10, color = '#7A82AB')
ax1.set_title("10 Samples")
ax2.hist(uniformdist1000, color = '#307473')
ax2.set_title("1000 Samples")
ax3.hist(uniformdist10000, color = '#12664F')
ax3.set_title("10000 Samples")
ax4.hist(uniformdist100000, color = '#C6D4FF')
ax4.set_title("100000 Samples")
plt.tight_layout()
plt.savefig("uniform distribution.png")
plt.close()

As seen in the resulting plot, the greater the sample size, the more closely the samples match the discrete uniform distribution. 

![uniform distribution.png](attachment:ddb3a6e1-37bf-410f-8408-99835d25ddbb.png)

<center> <h4>2.  numpy.random.Generator.random </h4> </center>

This function draws samples from the continuous uniform distribution.

The syntax for this function is as follows:

***random.Generator.random(size=None, dtype=np.float64, out=None)***

out = array in which to place the returned output if required. 

As with the discrete uniform distribution, this probability distribution is characterised by a minimum and maximum value, and there is an equal probability that any value between these two values will be selected. However, the values between a and b in a continuous distribution are not easily countable, resulting in a probability distribution without defined points that resembles a rectangle in shape.

This is illustrated by the below distribution, where a = 1 and b = 3. 

![Continuous Uniform Probability Distribution](https://www.statisticshowto.com/wp-content/uploads/2013/09/uniform-distribution-a-b.jpg)

By default the function will return floats in the half-interval [0.0,1.0]. This means that the endpoint, 1.0, will not be returned.

To sample numbers from a wider distribution, the elements in the ouptut array should be multiplied by (b-a) and then a should be added, where a is the minimum value and b is the maximum. For example, the below code returns a three dimensional array, with each dimension containing two floats between 1 and 5. 

In [None]:
rng = np.random.default_rng(12345)
random_floats = 5 * rng.random((3, 2)) + 0
print(random_floats)

The below code uses the random function to create floats between 1 and 5 of increasing sample sizes. As seen with the integers function, as the sample size increases the distribution more closely resembles the rectangular shape of the continuous uniform distribution. 

![ct_uniform distribution.png](attachment:59d3dd68-7a3e-4f41-a09a-7066387815c2.png)

In [None]:
import matplotlib
from matplotlib import pyplot as plt

ct_uniform10 = 5 * rng.random (10) + 0
ct_uniform1000 = 5 * rng.random (1000) + 0
ct_uniform10000 = 5 * rng.random (10000) + 0
ct_uniform100000 = 5 * rng.random (100000) + 0

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
plt.suptitle("Effect of Sample Size on Uniform Distribution")
ax1.hist(ct_uniform10, color = '#7A82AB')
ax1.set_title("10 Samples")
ax2.hist(ct_uniform1000, color = '#307473')
ax2.set_title("1000 Samples")
ax3.hist(ct_uniform10000, color = '#12664F')
ax3.set_title("10000 Samples")
ax4.hist(ct_uniform10000, color = '#C6D4FF')
ax4.set_title("100000 Samples")
plt.tight_layout()
plt.savefig("ct_uniform distribution.png")
plt.close()

<center> <h4>3.  numpy.random.Generator.choice</h4> </center>

This function is used to obtain a random sample of elements from a given NumPy array and returns a new array with the selected values. 

The syntax for this function is as follows:

random.Generator.choice(a, size=None, replace=True, p=None, axis=0, shuffle=True)

1. a = array from which samples should be taken
2. replace = if this is set to False, after an element is selected it will be removed from the original array and replaced with another value. If set to True, the same element may be selected multiple times.
3. p = probability for each element in the array. If this is supplied, elements will be selected from the array based on the assigned probability of that element. If not supplied, a uniform probability distribution is assumed for each of the elements in the array.
4. Axis = The axis of the ndarry that is used for selection. By default, this parameter is set to 0, meaning selection will be across the rows of the array. To select vertically/column wise, the value for axis should be increased accordingly. 

The below code generates two single dimensional arrays containing 1000 elements each, ranging between 1 and 5. The first array retains the standard uniform distribution and the probability values for each element are set for the second array. The two arrays are plotted to illustrate the effect of the probability settings. 

In [None]:
import matplotlib
from matplotlib import pyplot as plt

selected_probability = rng.choice(5, 1000, p = [0.0, 0.0,0.2,0.0, 0.8])
standard_probability = rng.choice(5, 1000)

fig, ((ax1), (ax2)) = plt.subplots(1, 2)
ax1.hist(standard_probability, color = '#7A82AB')
ax1.set_title("Uniform Probability")
ax2.hist(selected_probability, color = '#307473')
ax2.set_title("Non Uniform Probability")
plt.savefig("prob_choice.png")

As seen below, for the second plot only the third and fifth elements are selected as the probability for the other three elements is set to 0. 200 samples are generated for the third element and 800 for the fifth, in line with their assigned probabilities of 0.2 and 0.8 respectively. 

![prob_choice.png](attachment:c9caf467-fb37-494f-a4a2-b72d7f5950ff.png)

<center> <h4>3.  numpy.random.Generator.bytes</h4> </center>

This function returns a string containing a specified number of bytes. The bytes are encoded according to the ISO-8859-1 standard; each character is represented by a single byte. The below code generates 20 random bytes and decodes them to show the corresponding characters.

In [None]:
import numpy as np
from numpy import random
random_byte_array  = np.random.default_rng().bytes(20)
print(random_byte_array)
print(random_byte_array.decode('ISO-8859-1'))

<center> <h3> 2.2 Permutations </h3> </center>

The term permutations refers to the different orders by which elements can be arranged. Within the domain of machine learning, permutation can be used to test a model. Permutation importance is calculated after the model has been trained and assesses the effect on accuracy if the elements of a single attribute are shuffled. Ideally, the random reordering of elements should reduce accuracy as the new data is no longer representative of real-world statistics.

The NumPy random package contains three functions for permutation. These are as follows:

1. **numpy.random.Generator.permutation**: Randomly permutes a given NumPy array, returns a new array with the permuted elements. Optionally, the axis to be shuffled along may be specified. 
2. **numpy.random.Generator.permuted**: Randomly permutes a given NumPy array, returns a new array with permuted elements. Optionally, a destination array can be specified. For multi-dimensional array, the axis to shuffle along can be specified. 
3. **numpy.random.Generator.shuffle**: Shuffles the contents of a given NumPy array, and, unlike the previous two methods, returns the permuted version of the original array.

The below code generates a two dimensional array: three groups, each containing three elements between 1 and 9, and demonstrates the difference in behaviours between the permute, permuted and shuffle functions .

In [1]:
#Code to show differences between shuffle, permute and permuted
import numpy as np
myarray = np.arange(1,10).reshape(3,3)
rng = np.random.default_rng()
print("Original Array : {} ".format(myarray))

#Along axis 0 - will change the order of the dimensions/groups but not individual elements
#Along axis 1 - will change the order of the elements but not the dimensions/groups
#Both methods will return new array
permute_ax0 = rng.permutation(myarray, axis = 0)
permute_ax1 = rng.permutation(myarray, axis = 1)
print("Permutation Axis 0: {}\n Permutation Axis 1: {}".format(permute_ax0,permute_ax1))

#Specified myarray as out array - means the numpy array order will be altered rather than returning a new array
#Axis 0 - will change the elements in the individual dimensions/groups of elements rather than just shuffling group order, and order of elements within group
#Axis 1 - will take the already shuffled myarray and shuffle the elements within the group
permuted_ax0 = rng.permuted(myarray, axis = 0, out = myarray)
print("Permuted Axis 0: {}".format(myarray))
permuted_ax0 = rng.permuted(myarray, axis = 1, out = myarray)
print("Permuted Axis 1 : {}".format(myarray))

#The shuffle method will alter the original array without specifying an out paramter
#Axis 0 - will shuffle the groups but not the element order within them
#Axis 1 - will shuffle the elements but not the group order

rng.shuffle(myarray, axis = 0)
print("Array shuffled along axis 0: {}".format(myarray))

rng.shuffle(myarray, axis = 1)
print("Array shuffled along axis 1: {}".format(myarray))

Original Array : [[1 2 3]
 [4 5 6]
 [7 8 9]] 
Permutation Axis 0: [[7 8 9]
 [1 2 3]
 [4 5 6]]
 Permutation Axis 1: [[3 2 1]
 [6 5 4]
 [9 8 7]]
Permuted Axis 0: [[1 8 3]
 [7 5 6]
 [4 2 9]]
Permuted Axis 1 : [[3 1 8]
 [7 5 6]
 [2 4 9]]
Array shuffled along axis 0: [[2 4 9]
 [3 1 8]
 [7 5 6]]
Array shuffled along axis 1: [[2 4 9]
 [3 1 8]
 [7 5 6]]


<center> <h2> Part 3:  NumPy Random Distributions </h2> </center>

<center> <h3> Poisson Distribution </h3> </center>

The Poisson Distribution is primariliy used to show how often a given event is likely to occur over a specified period. It is used when the variable of interest is a discrete variable, i.e. can only take on a finite number of values such as 0,1,2,3 etc. 

It is used to predict the probability of certain events happening when you know how often it has occured in reality. For example, if the average number of TV units in an electrical store is 500 per day, the Poisson Distribution could be used to calculate the probability of more units being sold on a given day, such as the last weekend before Christmas. Therefore, a common commerical application of the Poisson distribution is to make forecasts about the number of customers or sales on certain days/seasons of the year, to ensure appropriate allocation of stock and resources.

The formula for the Poisson distribution is as follows:

**P(x; μ) = (e-μ * μx) / x!**

x: The number of times we think x event might occur within the timeframe
μ: The average number of times the event has occured historically
e: Euler's number, a constant

The syntax for generating Poisson data in NumPy is as follows:

**random.poisson(lam=1.0, size=None)**

lam: Rate or known number of occurences, e.g if calculating storms in a given area over the course of a year and 2 storms took place last year, lam would correspond to 2
Size: Shape of returned array.

The below code generates NumPy arrays following the Poisson Distribution with varying values for lam. The KDE (Kernel Density Estimation) array is plotted on the same graph to visualize how the lam values impact the distribution The KDE plots the values for a given distribution on the X axis and the probability that this value occurs in the distribution on the Y axis. The area under the entire KDE curve must add up to 1.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
poissondist1 = np.random.poisson(1, 10)
poissondist2 = np.random.poisson(2, 10)
poissondist3 = np.random.poisson(3, 10)

fig, ax = plt.subplots()

plt.suptitle("KDE Plot Poisson Distributions")
sns.kdeplot(poissondist1, ax = ax, color = '#FF7B9C', label = "Lamda: 1")
ax2 = ax.twinx()
ax2.get_yaxis().set_ticks([])
sns.kdeplot(poissondist2, ax = ax2, color = '#607196', label = "Lamda: 2")
ax3 = ax.twinx()
ax3.get_yaxis().set_ticks([])
sns.kdeplot(poissondist3, ax = ax3, color = '#FFC759', label = "Lamda: 3")
lines, labels = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
lines3, labels3 = ax3.get_legend_handles_labels()
ax2.legend(lines + lines2 + lines3, labels + labels2 + labels3, loc=0)
plt.savefig("poisson.png")
plt.close()

The resulting KDE plot is given below.

![poisson.png](attachment:515a8b8b-b683-45de-aabe-ae0dfe8f25a6.png)

In [None]:
As seen, the highest probability values are clustered around the Lamb