# `numpy`

The following exercises test some of your new `numpy` skills.  In some cases there is more than one solution.  Feel free to experiment with different methods.

## Imports

In [4]:
import numpy as np

## Exercise 1

In this exercise, you will use `numpy` to generate a random variable following a standard normal distribution.  You will then use the statistical functions built into numpy to analyse the distribution of the variable.

**Task**:
* Take 10,000 samples from the standard normal distribution
* Create a function called `basic_descriptives` that returns the mean, stdev, and 1st/99th percentiles of a numpy array parameter.
* Call your function and printout the results.

**Hints**:
* You can assume the numpy array passed to the function contains float data or you could check and raise an exception.


In [None]:
# your code here...

In [10]:
# example solution

SAMPLE_SIZE = 10_000

# create a random number generator
rng = np.random.default_rng(42)

# generate numpy array of size SAMPLE_SIZE using standard normal
samples = rng.normal(size=SAMPLE_SIZE)
print(type(samples))
print(samples.shape)

<class 'numpy.ndarray'>
(10000,)


In [13]:
def basic_descriptives(data):
    """
    Returns mean, stdev, and 1st and 99th percentile of a 1D numpy.ndarray
    
    Assumes `data` is numpy array of floats.
    
    Parameters:
    ------------
    data: numpy.ndarray 
        numeric data to analyse
        
    Returns:
    --------
    (float, float, float, float)
    """
    mean = data.mean()
    std = data.std()
    per_1st = np.percentile(data, 1) 
    per_99th = np.percentile(data, 99)
    
    return mean, std, per_1st, per_99th


results = basic_descriptives(samples)
print(results)

(-0.01024987541401165, 1.006285768537041, -2.3738111979173713, 2.3558409670159173)


## Exercise 2

Reuse the data generated in exercise 1.  You are going to analyse the tails of the distribution you generated.

**Task:**
* Select only the samples that have a value greater than or equal to +1.96 and less than or equal to -1.96
* Determine the proportion of data that falls into these tails.
* Are these proportions what you expect for the standard normal distribution?

**Hints**:
* You may want to create one or two general functions so that you can use to vary the cut-offs.

In [None]:
# your code here ...

**Example solution:**

It is very simple to work with `numpy` arrays containing numeric data. For example if we wanted to find all of our samples that are greater than or equal to +1.96 we use:
    

In [22]:
# result is a array of bools
result = samples >= 1.96

print(result.shape)
print(type(result))
print(result)

(10000,)
<class 'numpy.ndarray'>
[False False False ... False False False]


The code returns a new numpy.ndarray that contains boolean (True/False) values. The value at array index i is True if the corresponding value at index i in array data is >= 2.3 and False otherwise. If we had used a python List we would have needed to loop through all of list items and perform the check ourselves.

Let's create some generalised functions to return the probabilities that a value is greater or less than a user specified value in our data set.

To do that we need to know that

```python
False == 0
True == 1
```

Therefore we can take the sum of our boolean array to find out how many array elements are greater or less than a user specified values. i.e.

In [23]:
(samples >= 1.96).sum()

257

In [24]:
def prob_great_than_or_equal_to(data, x):
    '''
    Return the proportion of the dataset that is greater than or equal to x
    
    Parameters:
    -----------
    data: numpy.ndarray 
        numeric data to analyse
    x: float
        Lower cut-off. 
        
    Returns:
    --------
    float
    '''
    return (data >= x).sum()/data.shape[0]


def prob_less_than_or_equal_to(data, x):
    '''
    Return the proportion of the dataset that is less than or equal to x
    
    Parameters:
    -----------
    data: numpy.ndarray 
        numeric data to analyse
    x: float
        Upper cut-off. 
        
    Returns:
    --------
    float
    '''
    return (data <= x).sum()/data.shape[0]

p1 = prob_great_than_or_equal_to(samples, 1.96)
p2 = prob_less_than_or_equal_to(samples, -1.96)

print(p1, p2)
print(1 - (p1+p2))


0.0257 0.0257
0.9486


Our test of these functions shows use that around 95% of data lie between points -1.96 and +1.96 (which again we would expect with the standard normal).