# Programming for Data Analysis Assignment 2020

Module: Programming For Data Analysis <br>
Student Name: John Kavanagh  <br>
Lecturer: Brian McGinley  <br>

The purpose of this assignment is to performa a dissection of the structure of the Numpy.Random package and to explain the interdependeencies of these sections through practical examples of code blocks as well as a visual component. 

The areas to be reviewed and analysed are laid out as follows, similar in pattern to the official Numpy documentation.

### Area for discussion

- Explain the overall purpose of the packag
- Explain the use of the simple random data function. 
- Explain the use of Permutations Funcitons. 
- Explain the use and purpose of at least five distributions functions. 
- Explain the use of seeds in generations pseudorandom numbers.


While there is an extensive amount of information available, and a large amount of research conducted in this area, the primary area of investigation will be conducted on the official Numpy Documentation catalogue listed [here](https://numpy.org/doc/stable/reference/random/index.html)

___

### Software Requirements

There were a number of pieces of software that are required in order to run this analysis. These are: 

1. Anaconda
2. Jupyter Notebook
3. Numpy
4. MatPlotLib
5. Seaborn


In [2]:
import matplotlib.pyplot as plt

In [3]:
import numpy as np

___

Having Reviewed the layout of the question and distilling what has been requested, as well as reviewing the information that has been laid out by the documentation from the Numpy site, the analysis will be given the following structure.

### Structure of the Analysis

#### Section 1 Overview of Numpy.Random
- Structure of the package
    - State & Seeding

#### Section 2 Review of Simple Random Data
- Four Key Areas of Simple Random Data

#### Section 3 Review of Permutations
- Permutations and Shuffle

#### Section 4 Leveraging Numpy.Random for Distrubtions
- Deep dive into distributiuons and implementation of plotting

#### Section 5 Conclusion
- Recap on Sections 1-3

#### Section 6 Bibliography
- Reference Material

___

### Section 1 Overview of Numpy.Random

#####  Structure of the Numpy.Random Package


The principle purpose of the package is random number generation, as per [Numpy.org](https://numpy.org/doc/stable/reference/random/index.html?highlight=numpy%20random#quick-start). Numpy, as per the latest edition of the package, has been updated to include several upgrades in the package. Principle among these, is the introduction of the Permuted Congruential Generator 64 as the default pseudo random number generator. Otherwise known as PCG64, it supports the methods for how random number generation operates [Numpy.Org](https://numpy.org/doc/stable/reference/random/bit_generators/pcg64.html\0)


Within it's hierarchail structure, the numpy package is divided into two core parts, *BitGenerators* and *Generators*. 
We will be at first reviewing the Bitgenerator, via some sample code to demonstrate the packages capabilities.<br>

The relationship between th BitGenerator and the Generator allows for random sampling to occur across numerous distributions.

##### BitGenerator

As per the official documentation on the [numpy.org](https://numpy.org/doc/stable/reference/random/bit_generators/generated/numpy.random.BitGenerator.html#numpy.random.BitGenerator) the bit generator performs a very limited number of tasks within the numpy.random pakage, but that are crucial nonetheless. Amongst these though, is the management of the state as well as functions that will allow for random 32bit and 64bit values.

The BitGenerator creates sequences for the package. These sequences are in turn used by the Generator to sample distrubitons that are being created, such as a Zipf, Binomial and Gaussian distribution. These Distributions will be explained later on in the analysis under Section 4. 

The default BitGenerator for the most recent numpy package release is the PCG64. 

For further information on this, please review [here](https://numpy.org/doc/stable/reference/random/bit_generators/pcg64.html#numpy.random.PCG64)

###### State & Seeding
Now that we understand that the BitGenerator provides a stream of numbers, we need to understand, how seeds are used and facilitate the random number generationm otherwise known as pseudorandom numbers. In order for the BitGenerator to proudce random numbers, it needs to first have it's state initialised by a seed. This process is completed via a process called Seed-Sequence. <br>

Seed Sequence is responsible for setting this initial state. It completes this task, via a process of entropy, which can be reviewed via this [link](https://numpy.org/doc/1.19/reference/random/bit_generators/generated/numpy.random.SeedSequence.html#numpy.random.SeedSequence).

In [59]:
import numpy as np

In [60]:
np.random.BitGenerator

numpy.random.bit_generator.BitGenerator

In [61]:
from numpy.random import Generator, PCG64, SeedSequence
sg = SeedSequence(1234)
rg = [Generator(PCG64(s)) for s in sg.spawn(10)]

This is the source for a BitGenerator test.
Source [numpy.org](https://numpy.org/doc/stable/reference/random/bit_generators/generated/numpy.random.BitGenerator.random_raw.html#numpy.random.BitGenerator.random_raw)<br> 

In order to initiate a new instance of a generator, we must use the following code as well as calling a method, to gather information on the distrubtion

In [62]:
from numpy.random import default_rng

In [63]:
# what is it you're trying to prove here - give it a reason or get rid.

In [64]:
rg = default_rng(12)
rg.random()

0.2508244581084461

In [65]:
rng = np.random.default_rng()

In [66]:
vals = rng.standard_normal(10)
more_vals = rng.standard_normal(10)

In [67]:
from numpy.random import Generator, PCG64
rg = Generator(PCG64(12345))
rg.standard_normal()

-1.4238250364546312

##### Generator



Following on from the bitgenerator, we have the generator, which is actually a container for the Bitgenerator.

In [68]:
np.random.default_rng()

Generator(PCG64) at 0x23BA0291040

In [69]:
# The default_rng is a method that allows us to run a request on the BitGenerator

___

### Section 2 Review of Simple Random Data
#### Simple Random Data

We have seen from the structured nature of this package, that there are a  number of interdependencies in order for it to run successfully. Next, as part of our analysis into the package, we have the area of simple random data. 
There are four core sections to this section of the package. They are:  

1. Integers
2. Random
2. Choice
4. Bytes

Each section has it's own capabilities added to it to make it distinguishable from the last, as we will see in due course.

These can be reviewed more extensively from [Numpy.org](https://numpy.org/doc/stable/reference/random/generator.html)

In [70]:
rng = np.random.default_rng()

We use the code in the text box above,  in order to best set ourselves up to test the code to follow. Whereby hte rng on the left hand side, stands for random number generator

##### 1.Integers 

Integers as a data type, being used within the numpy.random package. The same as a lot of other functions, the integer subsection of the random funcoint, is dependent on an understanding of its' parameters. i.e, what information we will place into the funciton in order to generate an output. 

In [10]:
rng.integers(0, 10, size=4)

array([7, 4, 5, 4], dtype=int64)

In the code above, we can see that there are three parameters, that have been passed into the funciton.
Firstly, we have 'low' integer, this is the lowest starting point for what will be generated in the output. In this case it is 0. Next, we have the 'high' integer, which is exclusive. This means that the random numbers generated will not take, in this instance, 10, into the output. Lastly, we have the third parameter, Size. Size, in this context, relates to the number of random numbers to be generated in the ouput.

For further examples of this case, please review the [Numpy.org](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.integers.html#numpy.random.Generator.integers) site.

##### 2.Random

As true as it was for Integers, we're dependent upon the parameters, in order to extract the return from the Random Subsection. In this instance, the random funciton will allow for floating points to be taken into consideration. Floating points can be reviewed via clicking ont he following link, should you require more informaiton on them, via [TechTerms](https://techterms.com/definition/floatingpoint#:~:text=As%20the%20name%20implies%2C%20floating,decimal%20places%20are%20called%20integers.). 

For further examples of this case, please review the [Numpy.org](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.random.html#numpy.random.Generator.random)

In [11]:
type(rng.random())

float

In [12]:
3 * rng.random((5,)) 

array([0.57653239, 1.10099779, 1.40455853, 2.44300135, 2.84391021])

##### 3.Choice

Choice has been created with the understanding of creating a single item, or an array for an output. What differentiates it from previosu iterations ofthe package, is that we can code into the parameter an array, for its consideration. The package, however, will not taek in an integer of negative value.  

For more information on the Choice subsection , please review the following link from [Numpy.org](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice)

In [13]:
rng.choice(9, (2,5))

array([[0, 4, 2, 5, 2],
       [1, 6, 0, 6, 0]], dtype=int64)

In the example listed above, we have our integer value inserted in the code, in this it is 1. Within the second paramter, we have a list, of type int. This list will form the structure of the output. Working with the brackets, we have the number of arrays that will be generated, 2, in this case, as well as the number of number of numbers that will be randomly chosen within these arrays.

We have also mentioned that we can take into the function, a string. If we create a list of data type string, as follows:

In [14]:
Cars_Manu = ['Audi', 'Seat', 'VW']

In [15]:
rng.choice(Cars_Manu, 5)

array(['Seat', 'VW', 'VW', 'Seat', 'Audi'], dtype='<U4')

We can see from the array created above that there are multiple instances of the one manufactures name, as the size of the array exceeds the parameters in the variable Cars_Manu.

##### 4.Bytes

The last of the four subsections within the Simple Data funciton, bytes, is used to return just that, bytes. 
The parameter for this snippet of code is limited to one, and that is length of type integer. 

For more information on this subsection, please review the link via [numpy.org](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.bytes.html#numpy.random.Generator.bytes)


In [16]:
rng.bytes(7)

b'\xdd\xa5\xed\xad:U\xc4'

This is string of length 7, as hard coded into the line.

____

### Section 3 Review of Permutations



Within this segement of the package, we are going to be reviewing teo types of order that we can put an array through. 
They are:
- Shuffle
- Permutations

To permute, simply means to change the order of something, as per [Merriam Webster](https://www.merriam-webster.com/dictionary/permutate#:~:text=%3A%20change%2C%20interchange%20especially%20%3A%20to%20arrange%20in%20a%20different%20order)

##### Shuffle

In order to shuffle a list of integers, we must set up our code to accept the integer, as follow.

For more information on this, please review [Numpy.org](https://numpy.org/doc/stable/reference/random/generated/numpy.random.shuffle.html#numpy.random.shuffle)

For further reading on the reshape function, please visit [here](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.shuffle.html#numpy.random.Generator.shuffle).

In [17]:
Shuf = np.arange(20)

In [18]:
np.random.shuffle(Shuf)

In [19]:
Shuf

array([ 8,  6, 11,  0, 14, 10, 18,  5, 13,  4,  7,  1, 12,  3,  2, 19, 16,
       17,  9, 15])

We can see from the array that has been returned from the codeblock, that there are 20 integers in the array. You will also see that there is no number 20 in this list, as the list has commenced from the number 0.

In [20]:
Shuf = np.arange(15).reshape((5, 3))
np.random.shuffle(Shuf)
Shuf

array([[12, 13, 14],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [ 0,  1,  2],
       [ 3,  4,  5]])

We can also add in the reshape funciton, so that we can create multiple arrays, based on the common denominators that comprise an integer. In the exmaple displayed above, we ahve shuffled through the 15 numbers and divided them into 5 seperate arrays, of 3 numbers each. This is only possible as 5 multiplied by 3 is equal to the number set out in the parameter.

##### Permutations

Similar in its basis to shuffle, the permutations function will accept a integer as well as an array as a parameter. 

For more inforamtion on this topic, please review [Numpy.org](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.permutation.html#numpy.random.Generator.permutation)

In [21]:
# If we take the basis for our analysis as: 
rng = np.random.default_rng()

In [22]:
# and we try to run through a permutation of an integer, 7. 
rng.permutation(7) 

array([1, 3, 4, 5, 2, 0, 6])

In [23]:
# we can also perform the permutation on an array [1, 2, 3, 4, 5, 6, 7]

In [24]:
rng.permutation([1, 2, 3, 4, 5, 6, 7])

array([4, 2, 6, 1, 3, 5, 7])

___

### Section 4 Leveraging Numpy.Random for Distrubtions

The objective of this exercise is to review at elast five different types of distribution.

The five types of distribution that we are going to review are: 

- Binomial
- Chi-Square
- Standard T
- Normal (Gaussian) Distribution
- Zipf


##### Binomial Distribution Overview

For more information [Numpy.org](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.binomial.html#numpy.random.Generator.binomial)

Binomial distributuion refers to an event where there are two outcomes from the event. The most common example of a binomial distribtion is the toss of a coin. We can set up our analysis as follows:

In [25]:
# number of trials
n = 10

In [26]:
# Probability of event occuring
p = 0.5

In [27]:
result = rng.binomial(n, p, 100)

In [28]:
result

array([6, 2, 3, 7, 3, 1, 5, 4, 5, 6, 7, 3, 6, 4, 6, 3, 5, 3, 6, 3, 4, 4,
       7, 6, 5, 6, 4, 6, 4, 5, 5, 3, 5, 4, 5, 6, 3, 5, 6, 3, 5, 6, 3, 5,
       5, 2, 8, 5, 4, 3, 2, 5, 4, 7, 6, 6, 9, 4, 4, 6, 4, 4, 6, 4, 7, 6,
       5, 3, 7, 4, 6, 6, 5, 4, 4, 5, 6, 7, 6, 6, 6, 7, 5, 3, 3, 4, 5, 6,
       5, 6, 5, 7, 8, 6, 6, 4, 5, 4, 6, 6], dtype=int64)

In [29]:
# we can see fromt eh paramter above the number 100. This represents the number of times the trial occured. 
# That is, there were 10 tosses of a coin completed 100 times.

##### Plotting the data

##### Chi-Square Distribution Overview

This distribution test has been created in order to test teh degrees of freedom. Thse degrees of freedom are measured via the sum of squared standard deviates of the distribution. 

For more information on this, please review [OnlineStatbook](http://onlinestatbook.com/2/chi_square/distribution.html). The code in the block below was retrieved from: [numpy.Org](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.chisquare.html#numpy.random.Generator.chisquare)

In [30]:
rng.chisquare(12,4)

array([ 8.70964887, 11.86211936,  4.83195326,  6.00539237])

##### Plotting the data

###### Standard T 

Th Standard T sitribution devlops on in some ways from Ch-Square, in that it encompasses degrees of freedom. 
The T-Distribution, also refered to as the Student's T Distribution according to [Stat Trek](https://stattrek.com/probability-distributions/t-distribution.aspx). This distrubtiuon is briguht in when the sample isn't as large as would be expceted, whilst the variance is also unkonwn. 

> Side Note: "The derivation of the t-distribution was first published in 1908 by William Gosset while working for the Guinness Brewery in Dublin. Due to proprietary issues, he had to publish under a pseudonym, and so he used the name Student." as retrieved from [numpy.org](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.standard_t.html#numpy.random.Generator.standard_t) 

The t Score information can be retrieved from teh site [Numpy](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.standard_t.html#numpy.random.Generator.standard_t)

In [31]:
Collection = np.array([123, 456, 789, 987, 654, 321])

In [32]:
s = rng.standard_t(10, size=100000)

# where 10 is the degrees of freedom

In [33]:
np.mean(Collection)

555.0

In [34]:
Collection.std(ddof=1)

316.9738159533055

In [35]:
t = (np.mean(Collection)-7725)/(Collection.std(ddof=1)/np.sqrt(len(Collection)))

#### Plotting The data

##### Normal (Gaussian) Distrubtion

The normal distrubtion is enacted when there is a need to observe a large array of information across a spectrum. These, when placed in a plot, are observed as a bell curve distribution. Such is ht eimportanc eof this bell curve distrubtion, to be viewed, we nbeed to ensure that we have access to the centre of the curve. i.e. Create a Mean and Standard Deviation.

For more information on the normal distrubtion [numpy.org](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.normal.html#numpy.random.Generator.normal)

In [36]:
a = 0

# this can be our mean

In [37]:
b = 0.2

# this can be our standard deviation


In [38]:
# Therefore

c = rng.normal(a, b, 100)

In [39]:
# Validate mean and the variance 

abs(a - np.mean(c))

0.02096754907380681

In [40]:
abs(c - np.std(c, ddof = 1))

array([0.04452851, 0.49746858, 0.03170955, 0.13430554, 0.36997455,
       0.03864352, 0.00223776, 0.05021569, 0.23212128, 0.52229361,
       0.16621339, 0.24345637, 0.37859472, 0.02552468, 0.20087735,
       0.14741755, 0.06561285, 0.03500648, 0.17782717, 0.33128167,
       0.0351818 , 0.29028545, 0.24634152, 0.07076772, 0.07958716,
       0.16762488, 0.10048407, 0.06671509, 0.32823976, 0.02623364,
       0.18981329, 0.51070321, 0.10379321, 0.20433068, 0.12855217,
       0.06962672, 0.36375727, 0.13176729, 0.09209161, 0.26373347,
       0.14488102, 0.01342043, 0.63964845, 0.25472395, 0.409157  ,
       0.21126194, 0.07609396, 0.16231396, 0.18570106, 0.00614409,
       0.01569813, 0.26617651, 0.86737123, 0.23710294, 0.04124694,
       0.25277001, 0.49312893, 0.45413698, 0.20103207, 0.17767453,
       0.07022327, 0.15706801, 0.20042936, 0.50906923, 0.23638639,
       0.47450007, 0.06172553, 0.01890044, 0.00594943, 0.05175953,
       0.21917507, 0.1593243 , 0.75580144, 0.44438136, 0.40952

#### Plotting the data

##### Zipf Distribution

The Zipf distribtion is based around the principles of Zipf's Law, whereby the law states that: The Frequebcy of an item is inversely proportional to its rank  in a frequency table. 

For more information on the Zipf Distribution, please review [numpy.org](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.zipf.html#numpy.random.Generator.zipf)

In [41]:
d = 4

In [42]:
Zipf = rng.zipf(a, 1000)

ValueError: a <= 1 or a is NaN

#### Plotting the data     

___

### Section 5 Conclusion

___

### Section 6 Bibliography


<br>

Official Numpy site, November 2020

https://numpy.org/doc/stable/reference/random/index.html?highlight=numpy%20random#module-numpy.random
<br>
Markdown Style Guide, November 2020

https://www.markdownguide.org/basic-syntax/
<br>
Binomial Distribution 

https://www.datacamp.com/community/tutorials/probability-distributions-python
<br>
Explanantion of Tuple 

https://www.programiz.com/python-programming/tuple
<br>
Explanation of Chi-Square Rule

http://onlinestatbook.com/2/chi_square/distribution.html
<br>

Explanation of Students T Distribution

https://stattrek.com/probability-distributions/t-distribution.aspx
<br>

Explamnation of where PCG64 originates
https://numpy.org/doc/stable/reference/random/bit_generators/pcg64.html
<br>


 