In [1]:
import numpy as np

import numpy
np = numpy

## Statistics Functions on Arrays

**Numpy** is a Python package that, among other things, has many useful statistics **functions**.  These take any array-like object as an input and can be found inside the **np** library.  Sometimes, the same functionality can be found both as a Numpy function  and an array method, giving you the choice of how you'd like to use it.  


```python
>>> np.mean([1, 2, 3, 4])
2.5

>>> np.ptp([1, 2, 3, 4])
3
```

The full list of functions in Numpy can be found here: https://docs.scipy.org/doc/numpy/reference/

**Exercise**: Calculate the statistics on the following numbers:

In [3]:
data = [2, 8, 5, 9, 2, 4, 6]
data

[2, 8, 5, 9, 2, 4, 6]

1. Get the mean of the data.

In [4]:
np.mean(data)

5.142857142857143

2. What is the sum of the data?

In [5]:
np.sum(data)

36

3. What is the minimum of the data?

In [6]:
np.min(data)

2

4. The standard deviation?

In [7]:
np.std(data)

2.531435020952764

## Arrays and Vectorization

**Arrays** are sequences of same-type data points (most-often numbers).  Numpy allows us to work with the sequence without writing a for-loop, using a technique called **vectorization** or **broadcasting**.  

Besides an **array()** class, Numpy also includes a lot of math functions, which makes analysis much easier.  Let's try some out!

## Numpy Exercises

In [8]:
np.arange(1, 10)

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

### Building Arrays

Numpy has some convenient array-building functions as well.  Some commonly-used are examples are **arange()**, **linspace()**, **zeros()**, and the random number generation functions in **random**.

| function | Purpose |  Example |
| :-----------: | :-------------: | :-------------: |
| **np.arange()**                  | Makes an array with all the integers between two values | np.arange(2, 7) |
| **np.linspace()**               | Makes a specific-length array |  np.linspace(2, 3, 10) |
| **np.zeros()**                    | Makes an array of all zeros | np.zeros(5) |
| **np.ones()**                     | Makes an array of all ones | np.ones(3) |
| **np.random.random()** | Makes an array of random numbers | np.random.random(100) |
| **np.random.randn()**     | Makes an array of normally-distributed random numbers | np.random.randn(100) |


1. Make an array containing the integers from 1 to 15.

In [9]:
np.arange(1, 16)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

1.5 Make an array of only 6 numbers between 1 and 10, evenly-spaced between them.

In [13]:
np.linspace(1, 10, 6)

array([ 1. ,  2.8,  4.6,  6.4,  8.2, 10. ])

2. Make an array containing 20 zeros.

In [15]:
np.zeros(20)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.])

3. Make an array contain 20 ones!

In [17]:
np.ones(20)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1.])

In [18]:
np.ones(20) * 2

array([2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
       2., 2., 2.])

4. Generate an array of 10 random numbers from Numpy's **random** submodule, using any function you want.

In [19]:
np.random.random(20)

array([0.75401775, 0.62111712, 0.85247307, 0.42064688, 0.91949136,
       0.13000003, 0.91832225, 0.17010639, 0.05117292, 0.17732298,
       0.88743926, 0.14877498, 0.59106721, 0.70752668, 0.63087351,
       0.02412888, 0.50210328, 0.2904242 , 0.1487731 , 0.93430677])

In [20]:
np.random.choice(["Lela", "Anna"])

'Anna'

5. What is the standard deviation of the integers between 2 and 20?

In [22]:
np.std(np.arange(2, 21))

5.477225575051661

6. What is the standard deviation of the numbers generated from the np.random.randn() function?  

In [45]:
np.std(np.random.randn(2000000))

1.0004438722182996

## Statistics Methods on Arrays

Arrays have many useful math methods.  For example, to get the mean of an array of numbers:

```python
data = np.random.random(100)
data.mean()
```

**Exercise**: Calculate the statistics on the following numbers:

In [46]:
data = np.arange(2, 7)
data

array([2, 3, 4, 5, 6])

1. Get the mean of the data.

In [47]:
data.mean()

4.0

2. What is the sum of the data?

In [48]:
data.sum()

20

3. The maximum of the data?

In [49]:
data.max()

6

4. The standard deviation of the data?

In [50]:
data.std()

1.4142135623730951

## Arithmetic with Arrays

Arrays can also be added, subtracted, multiplied, and divided.  

For example, to add 10 to all values in an array:

```python
data = np.random.randn(5)
data + 10
```

Here is multiplying two arrays together: 

```python
data * data
```



**Exercises**: Modify the following arrays using the math operators  (+, -, *, /)

In [53]:
data = np.arange(-3, 5)
data

array([-3, -2, -1,  0,  1,  2,  3,  4])

1. Multiply the data by 100

In [54]:
data * 100

array([-300, -200, -100,    0,  100,  200,  300,  400])

2. Add 40 to each value in the array.

In [55]:
data + 40

array([37, 38, 39, 40, 41, 42, 43, 44])

3. Divide the numbers by 100

In [56]:
data / 100

array([-0.03, -0.02, -0.01,  0.  ,  0.01,  0.02,  0.03,  0.04])

4. Subtract the data from itself.

In [57]:
data - data

array([0, 0, 0, 0, 0, 0, 0, 0])

In [60]:
data2 = np.array([4, 5, 6, 7, 8, 1, 2])
data - data2

ValueError: operands could not be broadcast together with shapes (8,) (7,) 

5. Subtract each value in the data from its own mean (a.k.a. "mean-centering" the values)

In [61]:
data - data.mean()

array([-3.5, -2.5, -1.5, -0.5,  0.5,  1.5,  2.5,  3.5])

6. Make an array of 10 values, all of them 2!

In [62]:
np.ones(10) * 2

array([2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])

In [63]:
np.ones(10) + 1

array([2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])

In [65]:
np.ones(10) + np.ones(10) + np.zeros(10)

array([2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])

7. Calculate the square of all the numbers from 0 to 8

In [67]:
np.arange(9) ** 2

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64])

In [68]:
np.square(np.arange(9))

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64])

In [69]:
np.arange(9) * np.arange(9)

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64])

8. Calculate the square roots of all the numbers from 0 to 8.

In [71]:
np.sqrt(np.arange(9))

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712])

In [73]:
data2 = np.arange(9)
data2 ** (1/2)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712])

In [76]:
np.sqrt(np.linspace(0, 8, 9))

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712])

### Translating Algorithms into Code

Calculate the standard deviation of an array's values, without using the numpy.std() function.  (Formula can be found here: http://www.mathsisfun.com/data/standard-deviation-formulas.html)

1. Work out the Mean (the simple average of the numbers)
2. Then for each number: subtract the Mean and square the result
3. Then work out the mean of those squared differences.
4. Take the square root of that and we are done!


In [85]:
data = np.random.randn(1000000)

In [88]:
def std(data):
    return np.sqrt(((data - data.mean()) ** 2).mean())

In [93]:
%%timeit
std(np.arange(2000000))

16.4 ms ± 77.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [94]:
%%timeit
np.std(np.arange(2000000))

16.4 ms ± 78.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Broadcasting vs. For-Loops
All of these tasks can also be done without NumPy, using the built-in Python ints, floats, lists, etc.  To tell Python to do the same task repeatedly, however, there is some extra syntax: The For-Loop.

Here, we'll look at a special version of the loop called a **list comprehension**, which builds a new list out of an old list, performing some modification to each value along the way:

```python
>>> data = [1, 2, 3, 4, 5]
>>> [x ** 2 for x in data]
[1, 4, 9, 16, 25]
```

```python
>>> data2 = [x - 2 for x in data]
>>> data2
[-1, 0, 1, 2, 3]
```

### Exercises

Using the following array, perform the following transformations using both a List Comprehension and a Numpy operation:

In [95]:
data = np.arange(1, 11)
data

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

1. Multiply all the values by 10

with broadcasting:

In [96]:
data * 10

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])

with a list comprehension:

In [98]:
y = [x * 10 for x in data]
y

[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

2. Add each value to itself

with broadcasting:

In [99]:
data + data

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

with a list comprehension:

In [100]:
[x + x for x in data]

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

3. Calculate the square root of each value

with broadcasting:

In [101]:
np.sqrt(data)

array([1.        , 1.41421356, 1.73205081, 2.        , 2.23606798,
       2.44948974, 2.64575131, 2.82842712, 3.        , 3.16227766])

with a list comprehension

In [102]:
[np.sqrt(x) for x in data]

[1.0,
 1.4142135623730951,
 1.7320508075688772,
 2.0,
 2.23606797749979,
 2.449489742783178,
 2.6457513110645907,
 2.8284271247461903,
 3.0,
 3.1622776601683795]

# Filtering Data

Sometimes you want to remove certain values from your dataset.  In Numpy, this can be done with **Logical Indexing**, and in normal Python this is done with an **If Statement**

## With Logical Indexing

### Step 1: Create a Logical Numpy Array

We can convert all of the values in an array at once with a single logical expression.  This is broadcasting, the same as is done with the math operations we saw earlier:

```python
>>> data = np.array([1, 2, 3, 4, 5])
>>> data < 3
[True, True, False, False, False]
```

**Exercises**: Make arrays of True/False values that answer the following questions about the dataset below for each element.

In [103]:
import numpy as np

list_of_values = [3, 7, 10, 2, 1, 7, 20, -5]
data = np.array(list_of_values)

1. Which values are greater than zero?

In [104]:
data > 0

array([ True,  True,  True,  True,  True,  True,  True, False])

2. Which values are equal to 7?

In [105]:
data == 7

array([False,  True, False, False, False,  True, False, False])

3. Which values are greater or equal to 7?

In [106]:
data >= 7

array([False,  True,  True, False, False,  True,  True, False])

4. Which values are not equal to 7?

In [107]:
data != 7

array([ True, False,  True,  True,  True, False,  True,  True])

## Step 2: Filter with Logical Indexing

If an array of True/False values is used to *index* another array, and both arrays are the same size, it will return all of the values that correspond to the True values of the indexing array:

```python
>>> data = np.array([1, 2, 3, 4, 5])
>>> is_big = data > 3
>>> is_big
[False, False, False, True, True]

>>> data[is_big]
[4, 5]
```


**Exercises**:  Using the data below, extract only the values that corresspond to each question

In [112]:
data = np.array([3, 1, -6, 8, 20, 2, 7, 1, 9, 7, 7, -7])
data

array([ 3,  1, -6,  8, 20,  2,  7,  1,  9,  7,  7, -7])

1. The values that are less than 0

In [113]:
mask = data < 0
data[mask]

array([-6, -7])

2. The values that are greater than 3

In [115]:
mask = data > 3
data[mask]

array([ 8, 20,  7,  9,  7,  7])

3. The values equal to 7

4. The values not equal to 7

  5. The values equal to 20

### Step 2.5: Combine Step 1 and Step 2 into a single line

Both steps can be done in a single expression.  Sometimes this can make things clearer!


```python
>>> data = np.array([1, 2, 3, 4, 5])
>>> data[data > 3]
[4, 5]
```



**Exercises**: Do the same as in the previous section, this time in a single line.

In [28]:
data = np.array([3, 1, -6, 8, 20, 2, 7, 1, 9, 7, 7, -7])
data

array([ 3,  1, -6,  8, 20,  2,  7,  1,  9,  7,  7, -7])

1. The values that are less than 0

In [117]:
less_0 = data < 0
data[less_0]

array([-6, -7])

2. The values that are greater than 3

In [118]:
is_big = data > 3
data[is_big]

array([ 8, 20,  7,  9,  7,  7])

3. The values equal to 7

In [119]:
data[data == 7]

array([7, 7, 7])

4. The values not equal to 7

In [120]:
not_equal_to_seven = data != 7
data[not_equal_to_seven]

array([ 3,  1, -6,  8, 20,  2,  1,  9, -7])

  5. The values equal to 20

**Extra Exercises**: Using the following dataset, have Python to calculate the answers to the questions below:

In [121]:
data = np.array([3, 1, -6, 8, 20, 2, 7, 1, 9, 7, 7, -7])
data

array([ 3,  1, -6,  8, 20,  2,  7,  1,  9,  7,  7, -7])

1. How many values are greater than 4 in this dataset?  (*Hint:* the len() function is useful here)

In [123]:
len(data[data > 4])

6

2. How many values are equal to 7 in this dataset?

In [124]:
len(data[data == 7])

3

3. What is the mean value of the positive numbers in this dataset?

In [125]:
np.mean(data[data > 0])

6.5

4. What is the mean value of the negative numbers in this dataset?

In [126]:
np.mean(data[data < 0])

-6.5

5. What proportion of the values in this dataset are positive?

In [127]:
len(data[data > 0]) / len(data)

0.8333333333333334

In [128]:
np.mean(data > 0)

0.8333333333333334

In [131]:
(data > 0)

array([ True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True, False])

6. What proportion of the values in this dataset are less than or equal to 8?

In [132]:
np.mean(data <= 8)

0.8333333333333334

## With If-Statements

List comprehensions also support using an **if** statement, which lets you do the filtering in a single step as well:

```python
>>> data = [1, 2, 3, 4, 5]
>>> [x for x in data if x > 3]
[4, 5]
```


**Exercises**: Filter the list below, this time using list comprehensions instead.

In [None]:
[x ** 2 for x in data if x > 3]

In [133]:
data = [3, 1, -6, 8, 20, 2, 7, 1, 9, 7, 7, -7]
data

[3, 1, -6, 8, 20, 2, 7, 1, 9, 7, 7, -7]

1. The values that are less than 0

In [135]:
[x for x in data if x < 0]

[-6, -7]

2. The values that are greater than 3

In [136]:
[x for x in data if x > 3]

[8, 20, 7, 9, 7, 7]

3. The values equal to 7

In [137]:
[x for x in data if x == 7]

[7, 7, 7]

4. The values not equal to 7

In [138]:
[x for x in data if x != 7]

[3, 1, -6, 8, 20, 2, 1, 9, -7]

  5. The values equal to 20

In [139]:
[x for x in data if x == 20]

[20]

# Comparing NumPy with Core Python: Demonstration and Discussion

## Speed

## Memory

## Flexibility