# Understanding Memory Efficiency of Numpy Arrays

In [None]:
import numpy as np
import pandas as pd
import random
import sys

In [None]:
my_list = [10,12,14]
my_list

<div class="alert alert-block alert-info">
<p><b>`sys.getsizeof()`</b></p>

<p>Returns the number of bytes occupied by a container (or any object) in the memory. It is a good practice to often check the size of the containers that are created in a program. <b> Make sure you `import sys` to use this function</b></p>

<p> [More info](https://docs.python.org/3/library/sys.html#sys.getsizeof)</p>
</div>

In [None]:
sys.getsizeof?

In [None]:
print(sys.getsizeof(my_list))

In [None]:
my_numpy = np.array(my_list)
my_numpy

In [None]:
print(sys.getsizeof(my_numpy))

In [None]:
my_list = [x for x in range(0,1000000)]

In [None]:
list_bytes = sys.getsizeof(my_list)
print(list_bytes/1000000)

In [None]:
my_numpy = np.array(my_list)
my_numpy

In [None]:
numpy_bytes = sys.getsizeof(my_numpy)
print(numpy_bytes/1000000)

In [None]:
numpy_bytes / list_bytes

### Python lists are not memory efficient

As you saw in the above example, whenever the list is of a descent size, Python List was occupying a lot more memory (more bytes) than a Numpy Array. This is one of the main reasons to use Numpy Arrays. 

# Time Efficiency: Numpy Universal Functions (UFuncs) to the Rescue

## The Slow Python Lists 

We saw earlier that the python lists are **not memory efficient**, but we'll also see that they are  **not time efficient** when performing operations on a large number of data elements. 

This is **very bad news** for us, since that's pretty much the core of what we do as data scientists. Thankfully, NumPy provides us a way to perform repetitive operations with lightning speed.

<div class="alert alert-block alert-info">

Python has a <strong>huge</strong> community of developers and users who create awesome libraries like NumPy and give them away for free.

</div> 


Before we show how awesome NumPy is, let's show how bad the problem can be in normal Python. We'll start by using an example that is similar to your textbook.


### Reciprocals with Python Lists

In [None]:
# Define a function that will take an argument (parameter) called `lst`
# It will return another list with the reciprocal values
def compute_reciprocals_list(lst):
    
    #Create an empty list that gets appended with reciprocal values one at a time
    reciproc = []
    
    # For each element 'elem' in the 'lst', compute the reciprocal and append it to 
    # the output list
    for elem in lst:
        reciproc.append(1/elem)
    return reciproc

list_one = [1,2,3,4,5,6,7]
compute_reciprocals_list(list_one)

In [None]:
list_one = range(1,10)
%timeit -n 1 compute_reciprocals_list(list_one)

#### Timing Code Execution: `%timeit` 

When dealing with large amounts of data, you are going to want to learn how to make your code run fast. To be able to make it faster, you have to be able to see how long the various parts of your code take to execute.

IPython (Jupyter) makes this extremely easy to do with the **`%timeit` magic command.**

<div class="alert alert-block alert-info">
`%timeit -n 1`: means that you are asking Jupyter to run it once and report the time it took to run it 
</div>


In [None]:
%timeit?

<div class="alert alert-block alert-info">
<h5> Measures of execution time </h5>
<p>$ ns $ - Nano second, it is equal to 1/1,000,000,000 of a second (1 billionth of a second)</p>
<p>$\mu s$ - Micro second, it is equal to 1/1,000,000 of a second ( 1 millionth of a second)</p>
<p>$ ms$ - Milli second, it is equal to 1/1000 of a second ( 1 thousandth of a second) </p>
<p>$ s$ - Second</p>
</div>

### Reciprocals with Numpy Arrays and For loops

In [None]:
# Define a function that will take an argument (parameter) called `array_one`
# It will return another array with the reciprocal values
def compute_reciprocals_numpy(array_one):
    
    # Create an `output` array that starts with the same number 
    # of elements that are in the `array_one` parameter.
    output = np.empty(len(array_one)) 
    
    # For each item in the `array_one` parameter
    # Retrieve its value and index.
    for index, value in enumerate(array_one):
        
        # Update the same index position in the `output` object
        # With 1.0 divided by the current interation value
        output[index] = 1.0 / value
        
    # Return the updated `output` array.    
    return output

array_one = np.arange(1,10)
compute_reciprocals_numpy(array_one)

In [None]:
array_one = np.arange(1,10)
%timeit -n 1 compute_reciprocals_numpy(array_one)

In [None]:
big_list = [random.randint(1,100) for x in range(1,1000000)]
%timeit -n 1 compute_reciprocals_list(big_list)

In [None]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit -n 1 compute_reciprocals_numpy(big_array)

### Wait! Why is a NumPy array slower than a list? 

You could be thinking, why are NumPy arrays slower than a Python List? This is because, though we have created a memory efficient NumPy array, but we are still using it in the traditional loops and **LOOPS ARE SLOW**. 

## UFuncs to the Rescue <a name="ufuncs"></a>

The NumPy package has **UFuncs**, or **Universal Functions** which can dramatically improve the speed operations on array elements. They are also referred to a **vectorized** operations.

Basically, these functions push the loop processing into the C code that lies underneath Python/NumPy so that the operations are performed much faster than normal.

This only works because all the data elements of an array are of the same type.

In [None]:
value = 10
print(1/value)

In [None]:
array_one = np.arange(1,10)

print(array_one)
print(compute_reciprocals_numpy(array_one))

In [None]:
# UFunc / Vectorized Version
# This notation is as if you are saying take 1 and divide each by each element 
# of `array_one` and store the results
print(1 / array_one)

In [None]:
# You can also assign it to a new variable the reciprocal values
new_array = 1 / array_one
print(new_array)

In [None]:
# Time it takes using a list
%timeit -n 1 compute_reciprocals_list(big_list)

In [None]:
# Time it takes using a np array
%timeit -n 1 compute_reciprocals_numpy(big_array)

In [None]:
# Now time the UFunc approach
# Remember, the other way took a looooong time.
%timeit -n 1 (1.0 / big_array)


### Takeaway: Loops are a big NO! NO!

As you could see there is dramatic improvements in terms of speed by using vecotrized functions (UFuncs) instead of loops. This is a very important when working with large datasets. Hence, **avoid writing loops and use the built-in functions in NumPy** to improve the speed. 

### Arithmetic UFuncs
As we just demonstrated, there is a UFunc for division operations. It probably will not surprise you then to discover that all the normal Python arithmetic operations are replicated with UFuncs.

Here are some examples:

In [None]:
simple_int_array = np.arange(1, 6)
print(simple_int_array)

In [None]:
# Add 5 to each array element
simple_int_array + 5

In [None]:
# Subtract each element from 10
# Notice the somewhat subtle difference here.  It's important.
10 - simple_int_array

In [None]:
# You can perform multiple operations.
# Standard math order of operations is followed

# Raise each element to the 3rd power and subtract 10
new_math_operation_array = simple_int_array ** 3 - 10
print(new_math_operation_array)

#### An Alternative Syntax
In additional to using standard mathematical operators (i.e. `+, -, *, /, **`) you can also accomplish the same thing by invoking the UFuncs by their names.

For example:

In [None]:
# Add 3.5 to each element of our `simple_int_array`
# Notice how the ints are "upcasted" to floats?
np.add(3.5, simple_int_array)

In [None]:
# Divide each array element by 3
np.divide(simple_int_array, 4)

In [None]:
# And notice that the order of parameters is important
# When dividing and substracting...
np.divide(4, simple_int_array)

In [None]:
4 / simple_int_array

#### Summary Table
Here is the summary table of common arithmetic UFuncs availble to you.


| Operator      | Equivalent ufunc    | Description |                         
|---------------|---------------------|---------------------------------------|
|``+``          |``np.add``           |Addition (e.g., ``1 + 1 = 2``)         |
|``-``          |``np.subtract``      |Subtraction (e.g., ``3 - 2 = 1``)      |
|``-``          |``np.negative``      |Unary negation (e.g., ``-2``)          |
|``*``          |``np.multiply``      |Multiplication (e.g., ``2 * 3 = 6``)   |
|``/``          |``np.divide``        |Division (e.g., ``3 / 2 = 1.5``)       |
|``//``         |``np.floor_divide``  |Floor division (e.g., ``3 // 2 = 1``)  |
|``**``         |``np.power``         |Exponentiation (e.g., ``2 ** 3 = 8``)  |
|``%``          |``np.mod``           |Modulus/remainder (e.g., ``9 % 4 = 1``)|

#### Operations between Two NumPy Arrays
In your textbook, it talks about how you have invoke arithmetic UFuncs with `scalar` values or other arrays.

For those who are not programming experts, a **scalar** value simply means that is in an object with a single value -- like a number. This is opposed to a **container**-type object like a `list` or `ndarray` that holds multiple values.

Let's see how you can use UFuncs where both objects are arrays.



In [None]:
# Let's create two new arrays.
# One will have the numbers 1-5 and the other 6-10
one_to_five = np.arange(1, 6)
six_to_ten = np.arange(6, 11)

print(one_to_five, six_to_ten)

In [None]:
# Now lets add them together.
# Notice how it takes the 1st element of both and adds them together
# then the second and so on...
print(np.add(one_to_five, six_to_ten))
print(one_to_five + six_to_ten)

In [None]:
# The same thing will happen with other operations.
# Here will we divide each element of `one_to_five` by `six_to_ten`
print(np.divide(one_to_five, six_to_ten))
print(one_to_five / six_to_ten)

<div class="alert alert-block alert-warning">
<h5>Important Note!</h5>

<p>Being able to perform mathematical operations between two arrays is a really powerful tool.  But, take note that this only works when you have two arrays of the same size and shape.</p>

</div>

In [None]:
# Shape mismatched arrays will cause problems...
np.arange(5) + np.arange(10)

## Many more mathematical operations

* **`np.abs`**: get the absolute value
* **`np.sin`, `np.cos`, `np.tan`**: trignometric operations
* **`np.power`, `np.exp`, `np.exp2`**: exponent operations
* **`np.log`, `np.log2`, `np.log10`**: logorithmic operations

# Array Aggregation with NumPy

We can use NumPy to compute summary statistics for the data in question. In the following, we will see some important summary statistics performed using NumPy functions

## `np.sum`

In [None]:
# Let's get our familiar int array with 1 to 10 in it.
simple_int_array = np.arange(1, 11)
simple_int_array

In [None]:
# Here's how easy it is to get the sum all of element values.
print(np.sum(simple_int_array))
print(simple_int_array.sum())

<div class="alert alert-block alert-warning">
<h3>NumPy Aggregations vs. Built-in Aggregations</h3>
<p>We have seen in an earlier class that Python has a built-in standard `sum` method ( as well as functions like `min` & `max`)</p>

<p>
However, it is important to note that you will almost always want to use the NumPy versions of this functions. <b> The standard Python versions won't have the speed advantages of the NumPy ones </b> and then do not always support multi-dimensional arrays.
</p>

</div>

In [None]:
big_array = np.random.rand(1000000)
%timeit -n 1 sum(big_array)
%timeit -n 1 np.sum(big_array)
%timeit -n 1 big_array.sum()

<div class="alert alert-block alert-info">
<p>
Though not discussed here **practice `np.sum` on two-dimensional arrays by using optional axis parameter**. PDSH Page 60 and [online resources](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html)
</p>
</div>

### Aggregation functions available in NumPy

| Equivalent ufunc    | Description |                         
|---------------------|---------------------------------------|
|``np.sum``           |Compute sum of elements                |
|``np.prod``          |Compute product of elements            |
|``np.mean``          |Compute mean of elements               |
|``np.std``           |Compute standard deviation of elements |
|``np.var``           |Compute variance of elements           |
|``np.min``           |Find the minimum value                 |
|``np.max``           |Find the maximum value                 |
|``np.argmin``        |Find the index of minimum value        |
|``np.argmax``        |Find the index of maximum value        |
|``np.median``        |Compute median of elements             |
|``np.percentile``    |Compute rank based stats of elements   |
|``np.any``           |Evaluate whether any elements are true |
|``np.all``           |Evaluate whether all elements are true |

## Revisit: ND Football Roster Example

In [None]:
# At this point you don't have to know the details of following data loading. 
# However, understand that it is loading the weights of all the athletes
nd_player_weights = np.array(pd.read_csv('./data/nd-football-2018-roster.csv')['Weight'])
nd_player_names = np.array(pd.read_csv('./data/nd-football-2018-roster.csv')['Name'])
nd_player_heights = np.array(pd.read_csv('./data/nd-football-2018-roster.csv')['Height'])
print(nd_player_heights[:5])
print(nd_player_weights[:5])
print(nd_player_names[:5])

## Activity:

Use NumPy `nd_player_weights`,`nd_player_heights` and `nd_player_names` arrays to compute the following details. Also, these arrays are aligned in such a way that the $i^{th}$ indexed element in one array corresponds to $i^{th}$ indexed element in another array. 

* Average weight, average height `232.xx / 73.xx`
* The median weight, height `222 / 74`
* Variance of weights, heights `1789 / 5.78xx`
* Name of lightest player (**Hint**: Use np.argmin) `Lawrence Keys`
* Name of talest player (**Hint**: Use np.argmax) `Josh Lugg`

# NumPy Array Comparisons & Masking

Now, we will learn another set of Numpy functions that will compare the value of each element to a given condition and (generally) return a new array specifying if each element did or did not meet that condition.

## Available Numpy Comparison Functions
You can invoke Numpy's comparison functions either through an operator or by an explicit function call. You need to be familiar with both styles as you will see both in other people's code. 

Here are the available functions:

| Operator    | Equivalent ufunc    |
|---------------|---------------------|
|``==``         |``np.equal``         |
|``!=``         |``np.not_equal``     |
|``<``          |``np.less``          |
|``<=``         |``np.less_equal``    |
|``>``          |``np.greater``       |
|``>=``         |``np.greater_equal`` |

In [None]:
nd_player_weights

In [None]:
# Which players weigh more than 200lbs?
# I'll use the operator syntax this time.
nd_player_weights > 200

Interesting. It returns a new array that is full of `boolean` values. If the value is `true` at a given index, it means that specific player's weight was over 200 lbs.

<div class="alert alert-block alert-info">
<p>
For our purposes here, a boolean just means it is either true or false.
</p>
</div> 

In [None]:
# Which players are not 6ft (72 in) tall?
# This time I'll explicitly call the UFunc.
np.not_equal(nd_player_heights, 72)

### Comparison UFuncs + `np.sum`, `np.all`, or `np.any`
Above, we answered the question, *which players weigh more than 200 lbs?* Now we will combine that information with additional functions to answer the following:

In [None]:
nd_player_weights > 200

In [None]:
# Are any of the players > 200 lbs?
# The `np.any` function will return true if any array values are `True`.
np.any( nd_player_weights > 200 )

In [None]:
# Are ALL of the players > 200 lbs?
# The `np.all` function returns true if ALL the array values are true.
np.all( nd_player_heights > 200 )

In [None]:
# How many players weigh > 200 lbs?
np.sum( nd_player_weights > 200 )

<div class="alert alert-block alert-info">
<h5>Where is `np.sum` getting a number from?</h5>
<p>
In an earlier tutorial we learned that the `np.sum` aggregate function adds all the values of an array together. But, there are no numeric values in an array full of `True/False` so where do these come from?
</p>
<p>
Turns out, that in Python the boolean `True` value has a corresponding numeric value of `1`. So, each time `np.sum` encounters `True` in the boolean array, it adds a `1` to its running total.
</p>
</div> 

In [None]:
np.mean( nd_player_weights > 200 )

<div class="alert alert-block alert-info">
<h5>What does `np.mean` getting us?</h5>
<p>
Since the `nd_player_weights>200` returns boolean values True/False, and we know from above that `True` is considered a `1` and `False` is considered a `0`, when we take an average we are getting a fraction of players who are above 200. 
</p>
</div> 

### Comparison UFuncs + Bitwise Boolean Operators
This one might be a little bit confusing at first, so we'll start with a practical example.

Let's say that we wanted to know which players were between 72 and 75 inches tall? **Bitwise boolean operators** allow us to combine & join comparisons together and get the net result.

Let's demonstrate.

In [None]:
# Which plays are between 72 and 75 inches tall?
(nd_player_heights >= 72) & (nd_player_heights <= 75)

<div class="alert alert-block alert-danger">
<h5>Parenthesis are Important Here</h5>
<p>
The parenthesis here are important because of 
<a href="https://docs.python.org/3/reference/expressions.html#operator-precedence" target="_blank">
Python's operator precedence rules</a> which would lead to the following evaluation if I hadn't included the parenthesis: `player_heights >= (72 & player_heights) <= 75`
</p>
<p>
This would obviously have a different result. So, be mindful to use parathesis to force the correct order of operations when combining UFuncs with bitwise boolean operators.
</p>
</div> 

In [None]:
# Ok, now let's bring back in `np.sum` to get a 
# count of the players that match this criteria
np.sum(  (nd_player_heights >= 72) & (nd_player_heights <= 75)  )

In [None]:
nd_player_heights > 72

In [None]:
~(nd_player_heights > 72)

What we've done here is utilize a couple of **bitwise boolean operators**. When used, these operators evaluate each element of the two arrays being compared. For each element, it evaluates whether the two values match the operator condition, and then returns `True` or `False` for that element pair accordingly.

Yes, that is a mouthful of a sentence. So, practice how this works with a smaller set of arrays. But first, here is the full list of operators:

| Operator   | Equivalent ufunc  |
|------------|-------------------|
|`&`         |np.bitwise_and   |
|&#124;      |`np.bitwise_or`    |
|`^`         |`np.bitwise_xor`   |
|`~`         |`np.bitwise_not`   |

## Activity:

Use NumPy `nd_player_weights`,`nd_player_heights` and `nd_player_names` arrays to compute the following details. Also, these arrays are aligned in such a way that the $i^{th}$ indexed element in one array corresponds to $i^{th}$ indexed element in another array. 

* How many players are above 72 inches in Height?  `77`
* Are there any players between 250 lbs to 260 lbs? `True`
* How many players are either above 75 inches or below 70 inches? `34` 
* How many players are not below 250 lbs? `31`
* What percentage of players are above 75 inches in Height? `25.8`

#### Special Note for `bitwise.not (~)`
Up above, I said that the bitwise boolean operators evaluate two arrays. Well, in the case of `bitwise.not`, that isn't true.

Unlike the other bitwise operators, this one simply reverses the values in a boolean array.