<center> <h1> Programming for Data Analytics 2021 NumPy Assignment </h1> </center>

<div align="center"> <b> Student Name: Kate McGrath </b> </div>
<div align="center"> <b> Student Number: G00398908 </b>  </div>
<div align="center"> <b> Submission Date: 19/11/2021 </b> </div>

<center> <h2> Part One: Purpose of NumPy Random Package </h2> </center>

<center> <h3> 1.1. Introduction to NumPy (Numerical Python) </h3> </center>





NumPy, or Numerical Python, is one of the most widely used scientific libraries in Python programming [1].

An open-source library, created in 2005 by Travis Oliphant [2], it has become the de facto standard for working with numerical data in Python, and is used by everyone from beginners to experienced researchers at the cutting edge of scientific and commercial R&D [3]. Additionally, several other Python analytical libraries have been built on top of NumPy, including Pandas, Scikit-learn and Matplotlib [1]. 

<center> <h4>Multidimensional Arrays</h4> </center>

NumPy's core object is the multidimensional array (ndarray), which is essentially a fast and flexible container, built to facilitate batch numerical operations on blocks of data.The key advantage of using NumPy arrays for numerical computations rather than Python lists is that the former is designed for efficiency on large arrays of data. 

NumPy arrays can be considered as a grid of values, with a defined number of rows and columns.  In the NumPy package, rows and columns are referred to as dimensions. A vector consists of a single dimension, a matrix contains two dimensions, and a tensor comprises three or more dimensions.  Each dimension has a corresponding axis, starting at index [0]. Axes are used to locate and operate on elements within a specific dimension.

Run the below code to see examples of one, two and three-dimensional arrays.


In [8]:
oneDArray = np.array ([1,2,3])
print(oneDArray)

twoDArray = np.array ([(1,2), (3,4)])
print(twoDArray)

threeDArray = np.array ([(1,2), (3,4), (5,6)])
print(threeDArray)

[1 2 3]
[[1 2]
 [3 4]]
[[1 2]
 [3 4]
 [5 6]]


<center> <h4>NumPy Arrays vs Python Lists</h4> </center>

NumPy arrays are significantly faster than Python lists for the following reasons:

1. NumPy arrays are homogeneous, i.e. consisting of a single data type and allocated to a continuous block of memory (continuous). To access the next element stored in an array, the programme simply needs to move to the next memory address. In contrast, Python lists are heterogeneous (not confined to a single data type) and stored in non-consecutive memory locations; both of these factors contribute to processing overhead.
2. The NumPy package breaks down a task into multiple fragments and processes them in parallel. 
3. NumPy incorporates elements of the C, C++ and Fortran programming languages in Python. These are low level languages and therefore have a reduced execution time compared to Python.

In addition to the speed advantages, NumPy arrays allow for operations on singular elements within the array without requiring loops, and consume less memory than their Python list counterparts. The below code illustrates the performance advantages of NumPy arrays over python lists, in terms of speed and memory usage. 

In [5]:



import numpy as np
import time
from sys import getsizeof

size = 1000000

list1 = range(size)
list2 = range(size)

#Need to get current time to second to subtract from time taken to complete operation
startTime = time.time()

newlist = []

#This code is zipping list 1 and 2 together - mapping each list1 value to its corresponding list2 value and multiplying them
#Then adding them to new list
for a,b in zip(list1,list2):
    newlist.append(a*b)

TimeNow = time.time()
TimeTaken = TimeNow - startTime
print("Time taken to complete in seconds: {}".format(TimeTaken))

#Declaring Arrays
array1 = np.arange(size)  
array2 = np.arange(size)

startTimeNumpy = time.time()

NewArray = array1* array2

TimeNowNumpy = time.time()
TimeTakenNumpy = TimeNowNumpy - startTimeNumpy
print("Time taken to complete in seconds: {}".format(TimeTakenNumpy))
print("Time difference in seconds: {}".format(TimeTaken - TimeTakenNumpy))


#Checking difference in memory consumption between numpy array and python list
totalMemoryList = getsizeof(newlist)
totalMemoryNumpy = getsizeof(NewArray)

print("Total memory consumed by Python List: {} bytes".format(totalMemoryList))
print("Total memory consumed by Numpy Array: {} bytes".format(totalMemoryNumpy))
print("Memory difference in bytes: {}".format(totalMemoryList - totalMemoryNumpy))

Time taken to complete in seconds: 0.5237135887145996
Time taken to complete in seconds: 0.0019936561584472656
Time difference in seconds: 0.5217199325561523
Total memory consumed by Python List: 8697456 bytes
Total memory consumed by Numpy Array: 4000104 bytes
Memory difference in bytes: 4697352


<center> <h4>Getting Started with NumPy</h4> </center>


The only prerequisite to installing NumPy is Python itself. For convenience, the Anaconda distribution is recommended for beginners as it comes with NumPy, Python and other statistical Python packages pre-installed. 

To install NumPy on its own, the following command can be used:

In [19]:
pip install numpy

Note: you may need to restart the kernel to use updated packages.


In order to use NumPy Random, the library should be imported using the following command:

In [17]:
import numpy as np
from numpy import random

<center> <h3> 1.2. NumPy Random Package </h3> </center>

The NumPy random module provides methods of obtaining random numbers following any of several statistical distributions, as well as methods of choosing random elements from and shuffling arrays. 

<center> <h4>Applications of Random Numbers to Data Science</h4> </center>

Randomness is one of the key concepts underpinning the field of statistics. It is a phenomenon where the outcome of a single repetition is uncertain, yet in a large number of repititions a regular distribution of outcomes is observed, e.g. the bell curve that characterizes the normal distribution. 

A random number is one generated using a large set of numbers and a mathematical algorithm that ensures an equal probability that any of the numbers within the specified distribution will be selected. The ability to generate random numbers is paramount within the fields of data science and machine learning.

Sample use cases for random number generation are listed below.

1. The collection of sample data generally entails picking truly random data points from the population; the more random these points are, the better representative they are of the population. However, if the distribution of data for which the model is being developed is already known, it may not be necessary to collect data if random numbers matching the distribution can be generated. For example, if we know the mileage of a car follows a normal distribution, we can generate random numbers following a normal distribution for a study. This process is called simulation.
2. When developing a machine learning model, data is split into training and testing data, with 80% going to the former category and 20% to the latter. This splitting of data must be random for the model to perform efficiently.
3. Neural network algorithms, which are sets of algorithms modelled loosely on the human brain and designed to recognize patterns, also rely on randomness. A neural network is made up of interconnected nodes. The strength of the connection between two nodes is the weight, i.e. the influence that the input should have on the output. Neural network algorithms are initialized with random weights and with each epoch (passing of the enitre dataset through the neural network) the value is changed to reduce error and increase accuracy.


<center> <h4>True Random and Pseudorandom Number Generation</h4> </center>

Two types of random number generation exist: True Random and Pseudorandom Number Genration (TRNG/PRNG).

The majority of computer programs, including the NumPy random package, rely on Pseudorandom Number Generation. This is because computers are deterministic, which means that the same input will always produce the same output. Pseudorandom number generators, such as those used by the random module, are algorithms that generate apparently random, yet reproducible data. The pseudorandom number generator starts with an integer called a seed, and then generates numbers in succession according to the specified distribution. The same seed will provide the same sequence of random numbers, hence if the seed is set at the start of a programme, the numbers generated will be reproducible. The below code sets the seed and creates an array with 10 random numbers between 1 and 10. Each time this code is run the output will be the same. 

In [18]:
import numpy as np
from numpy import random
np.random.seed(55)
random_numbers = np.random.randint(1, 10, size = 5)
print(random_numbers)

[8 9 6 8 6]


Conversely, true random number generators cannot rely on computer algorithms as by their nature these will always be deterministic. Instead, TRNGs extract randomness from physical phenomena and introduce it to a computer. Potential sources may include time between keystrokes or the movement pattern of a mouse, or naturally occuring phenomena like radioactive decay. The generation of truly random numbers from these sources relies on the identification of small, unpredictable changes within a dataset. Some fields, like cryptography, require the generation of true random numbers. However, TRNGS are inefficent compared to PRNGS, and for applications like data simulation and modelling reproducibility can be an advantage. Hence, PRNGS such as those used in the NumPy random package are more applicable to the field of data science.

<center> <h4>NumPy Random vs Python Random</h4> </center>

<center> <h2> Part 2:  Simple Random Data and Permutations </h2> </center>

<center> <h3> Simple Random Data </h3> </center>

Simple Random data refers to a set of functions that can be used for simple random sampling. A simple random sample takes a small, random  portion of the entire population, where each member has an equal chance of being selected. This portion is considered representative of the dataset as a whole.

Four main functions exist for simple random data sampling in NumPy. Using these functions, both discrete and continuous variables can be generated from within the uniform distribution. Continuous variables (floats) are those that can take on an infinite number of values and would therefore take forever to count, whereas discrete variables are countable in a finite amount of time (integers). In addition to generating data from the uniform distribution, simple random functions also allow for sampling of random elements from a given array and the generation of random bytes.

<center> <h4> Integers </h4> </center>

This function allows for the generation of continuous variables, i.e. integers, from within the uniform distribution. The uniform distribution is defined by two parameters, a and b, where a is the minimum and b is the maximum, and there is an equal probabality that any value between a and b can be chosen, i.e. constant probability. The function takes one mandatory parameter, the minimum value. Optionally, a maximum value can be specified; if this is not supplied, the function will return just one value. 