![numpy](https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/numpy.png)

[NumPy](https://www.numpy.org/) is the fundamental package for scientific computing with Python. Many other data science packages, especially those that work with matrices, rely on it for its speed and utility.

# Objectives

- Use NumPy to create arrays and perform efficient operations with them
- Use NumPy's other mathematical tools relevant to data analysis

For numpy, the standard alias is `np`.

In [1]:
#LN:lots of function and speed for collection of data esp numeric
#LN:array is multideimensional collection of numbers
import numpy as np 

# NumPy Arrays

Python lists and NumPy arrays can both hold numbers. However, Python lists have limited functionality for mathematical operations. NumPy arrays make it easy and fast to do math with a collection of numbers.

In [2]:
x = np.array([1, 2, 3])
print(x)
print(type(x))

[1 2 3]
<class 'numpy.ndarray'>


Note that there is an [`array` class in base Python](https://docs.python.org/3/library/array.html), but we will not be using it. It is essentially a list constrained to one type (e.g. int).

In [3]:
#LN:dont recommend using this
import array
x = array.array('i',[1,2,3])
print(x)
print(type(x))

array('i', [1, 2, 3])
<class 'array.array'>


## Character Arrays

NumPy arrays can either hold strings or numbers, not both. There is a special `chararray` object you can use with strings.

In [5]:
#LN:numpy for text data in array 
names_list = ['Bob', 'John', 'Sally']

# Use numpy.array for numbers and numpy.char.array for strings.

names_array = np.char.array(['Bob', 'John', 'Sally'])

print(names_list)
print(names_array)
type(names_array)
#LN:array no commas
#LN:cannot mix data types in array in numpy

['Bob', 'John', 'Sally']
['Bob' 'John' 'Sally']


numpy.chararray

In [6]:
# The character array has string-functionality that numeric
# arrays don't have.

names_array.endswith('b')

array([ True, False, False])

## Numeric Arrays

Let's make a list and an array of three numbers, and see how they function differently

In [7]:
numbers_list = [0, 5, 7]
numbers_array = np.array([0, 5, 7])

### Type

Numeric arrays are have the `ndarray` type, which is short for N-Dimensional Array.

In [8]:
print(type(numbers_list))
print(type(numbers_array))

<class 'list'>
<class 'numpy.ndarray'>


## Arithmetic Operations

Arithmetic operators (e.g. +, -, * and /) work according to mathematical principles for arrays, unlike with lists. These operations are done "element-wise".

In [9]:
# multiply the array by 3
#LN:multiply each number by 3
numbers_array * 3

array([ 0, 15, 21])

In [10]:
# multiply the list by 3
#LN:list shows 3 times
numbers_list * 3

[0, 5, 7, 0, 5, 7, 0, 5, 7]

In [11]:
#LN:add 20 to each element
numbers_array + 20

array([20, 25, 27])

In [13]:
#LN:error cause number can put brackets around 20 to add to list
print(numbers_list + 20)
print(numbers_list + [20])

TypeError: can only concatenate list (not "int") to list

### Speed

Below, you will find a piece of code we will use to compare the speed of operations on lists vs arrays.

In [14]:
size_of_vec = 1000

X = list(range(size_of_vec))
Y = list(range(size_of_vec))

In [15]:
#LN:%timeit returns how long took to run, only works for a single line
%timeit [X[i] + Y[i] for i in range(size_of_vec)]

136 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [16]:
X = np.array(range(size_of_vec))
Y = np.array(range(size_of_vec))

In [17]:
%timeit X + Y

1.02 µs ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


## Array Attributes and Methods

Type `numbers_list.` and then hit `TAB`. What options do you have?

In [None]:
numbers_list.

The names of standard Python list methods appear:

- `append(x)` (add x to the end of the list)
- `clear()` (delete all elements of the list)
- `copy()` (make a copy of the list)
- `count(x)` (return the number of instances of x in the list)
- `extend([x, y])` (add x and y to the end of the list)
- `index(x)` (return the position in the list of x)
- `insert(x, y)` (insert y into position x in the list)
- `pop(i=-1)` (remove and return the element at position i in the list)
- `remove(x)` (remove x from the list)
- `reverse()` (reverse the order of the elements of the list)
- `sort()` (sort the elements of the list)

Now type `numbers_array.` and then hit `TAB`. What options do you have?

In [None]:
#LN:many more attributes for arrays vs lists
numbers_array.

Turns out, there are a _bunch_ of new tools!

### Numeric Methods

- `max()` (return the greatest value in the array)

In [18]:
numbers_array.max()

7

- `mean()` (return the arithmetic mean of the array)

In [19]:
numbers_array.mean()

4.0

- `min()` (return the smallest value in the array)

In [20]:
numbers_array.min()

0

- `round()` (round each entry in the array to a specified number of decimal places)

In [21]:
np.array([9.5, 1.2, 6.3]).round()

array([10.,  1.,  6.])

- `std()` (return the standard deviation of the array)

In [22]:
numbers_array.std()

2.943920288775949

- `sum()` (return the sum of the array's elements)

In [23]:
numbers_array.sum()

12

### Boolean Methods

- `all()` (returns True iff bool(element) == True for all elements in the array)

In [24]:
#LN:all test all are true vs any test if any are true
numbers_array.all()

False

- `any()` (returns True iff bool(element) == True for some element in the array)

In [25]:
numbers_array.any()

True

# Multi-Dimensional Indexing

Arrays are especially powerful when dealing with numbers in multiple dimensions - for example, a datset with rows and columns. We will primarily work with such 2-dimensional arrays.

In [26]:
#LN:write how you might write a list of lists
#LN:in numpy arrange so easy to see
nums = np.array([[1, 2, 3], [4, 5, 6]])
print(nums)

[[1 2 3]
 [4 5 6]]


In [27]:
#LN:returns lengthe of each dimension
print(nums.shape)

(2, 3)


In [28]:
#LN:make sure start counting from 0
nums[0, 2]

3

This more efficient than `nums[0][2]`. Why?

In [29]:
#LN:use common instead 2 brackets to speed things up 
%timeit nums[0, 2]

224 ns ± 34.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [30]:
%timeit nums[0][2]

335 ns ± 3.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


### Slicing

Use a trailing comma or colon to [slice](https://numpy.org/doc/stable/reference/arrays.indexing.html) sections of an array

In [35]:
nums = np.array([[1, 2, 3], [4, 5, 6]])
print(nums)

[[1 2 3]
 [4 5 6]]


In [31]:
nums[1]
#LN:row in 1st position

array([4, 5, 6])

In [32]:
nums[1,]
#LN:trailing , gives same thing

array([4, 5, 6])

In [33]:
nums[1,:]

array([4, 5, 6])

In [34]:
nums[:,1]

array([2, 5])

### Reshaping

In [36]:
new = np.array([[1, 2, 3], [4, 5, 6]])
new

array([[1, 2, 3],
       [4, 5, 6]])

- `shape` (stores the dimension of the array)

In [37]:
new.shape

(2, 3)

- `ravel()` (reduce array to one dimension)

In [38]:
new.ravel()

array([1, 2, 3, 4, 5, 6])

- `reshape()` (return an array with the specified dimensions)

In [39]:
new.reshape(3, 2)

array([[1, 2],
       [3, 4],
       [5, 6]])

- `T` (stores the transpose of the array)

In [40]:
#LN: can think diagnolly or array of columns
#LN:don't need () cause always same??
new.T

array([[1, 4],
       [2, 5],
       [3, 6]])

# NumPy Functions

NumPy has a bunch of functions, besides array methods, that are helpful for working with arrays.

## Array Constructors

In [None]:
#LN: create new arrays for you 
#LN:good if want to add to a bunch of ones or empty list etc
#LN:arange is array range
#LN:linspace range of values equally spaced (start, end, # values want)
print(np.zeros(10))
print(np.ones(10))
print(np.arange(10, dtype=float))
print(np.linspace(0.1, 1, 10))

### `np.concatenate()`

In [41]:
#LN: stick 2 arrays together
np.concatenate([[1, 2], [3, 4]])

array([1, 2, 3, 4])

## Filtering

In [42]:
data = np.array([10, 3, 4, 7, 6])

In [43]:
data < 5

array([False,  True,  True, False, False])

In [44]:
#LN: get out all data < 5
data[data < 5]

array([3, 4])

### `np.where()`

In [45]:
#LN:variables meet criteria replace 2nd variable, if don't 3rd varialbe
np.where(data < 5, "Low", "High")

array(['High', 'Low', 'Low', 'High', 'High'], dtype='<U4')

### `np.select()`

In [46]:
#LN:can have more than 2 conditions
conditions = [data < 5, data > 9]

choices = ['small', 'big!']

In [47]:
np.select(conditions, choices, default='other')

array(['big!', 'small', 'small', 'other', 'other'], dtype='<U5')

## Broadcasting

Two arrays can be combined with mathematical operations via [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html).  

From the [docs](https://numpy.org/doc/stable/user/basics.broadcasting.html):

    When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions and works its way forward. Two dimensions are compatible when
    - they are equal, or
    - one of them is 1
    
Let's try to figure out what will happen in the operations below

In [48]:
#LN:compare if compatible if are then perform
#LN:compatible = same size or one has to have just 1 dimension
#LN:start by looking at last dimension to see if they are equal
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([10, 100, 1000])
print(arr1)
print(arr2)

[[1 2 3]
 [4 5 6]]
[  10  100 1000]


In [49]:
print(arr1.shape)
print(arr2.shape)
#LN:second one 1 dimension

(2, 3)
(3,)


In [52]:
#LN:multipl 10, 100, 1000 through each
arr1 * arr2

array([[  10,  200, 3000],
       [  40,  500, 6000]])

In [50]:
arr3 = np.array([10, 100])
arr3.shape

(2,)

In [51]:
arr1 * arr3

ValueError: operands could not be broadcast together with shapes (2,3) (2,) 

In [54]:
#LN:to try to fix add extra set of brackets makes 2 dimensional array length 
#LN:of first dimenion 1
#LN: still get same problem
arr4 = np.array([[10, 100]])
arr4.shape

(1, 2)

In [55]:
arr1 * arr4

ValueError: operands could not be broadcast together with shapes (2,3) (1,2) 

In [56]:
#LN: column array
#LN: put each number in its own bracket
arr5 = np.array([[10],[100]])
arr5.shape

(2, 1)

In [57]:
arr1 * arr5

array([[ 10,  20,  30],
       [400, 500, 600]])

# Other NumPy Tools

NumPy comes with an assortment of mathematical tools that can come in handy, separate from arrays.

## Trigonometry 

- `np.pi` for $\pi$

In [58]:
np.pi

3.141592653589793

- `np.sin()` for the sine function

In [59]:
np.sin(np.pi / 6)

0.49999999999999994

## Sequences


- `np.cumsum()` to calculate, recursively, the sum of sequence terms

In [60]:
#LN: add in cumulative sum 
np.cumsum([1, 4, 9, 16])

array([ 1,  5, 14, 30])

- `np.diff()` to calculate, recursively, the differences between sequence terms

In [61]:
np.diff([1, 4, 9, 16])

array([3, 5, 7])

## Logarithms
- `np.exp()` for Euler's number with exponent

In [62]:
np.exp(2)

7.38905609893065

- `np.log()` for logarithms

In [63]:
np.log(10)

2.302585092994046

## np.nan

NaN stands for "not a number" - NumPy's `nan` class is a handy way of representing these.  These are very useful for representing missing data.

Since `np.nan` is is a float, we don't get errors when doing operations on arrays that have missing data. 

In [64]:
#LN: Not a number, represent missing data so can work with data w/o errors
#LN: gives you float
type(np.nan)

float

In [65]:
arr5 = np.array([1, 10, np.nan])

In [66]:
arr5.mean()

nan

Even though the array has a NaN, we don't get an error in calculating its mean. Moreover, we can do this:

In [67]:
#LN: work around using like mean so count nan in len
np.nansum(arr5) / len(arr5)

3.6666666666666665

Is the right measure of the mean? Well, maybe. But if not, we also have this:

In [68]:
#LN: does not include nan 
np.nanmean(arr5)

5.5

## np.inf

Sometimes you will end up with values of infinity represented by `np.inf`, such as if you divide by zero. `np.inf` is a float, like `np.nan` is.

In [69]:
np.array([2])/np.array([0])

  np.array([2])/np.array([0])


array([inf])

In [70]:
type(np.inf)

float

In [71]:
#LN: is finite
np.isfinite(np.inf)

False

This can also be useful when handling edge cases in custom functions

In [72]:
def inv(x):
    return x**(-2)

In [73]:
inv(0)

ZeroDivisionError: 0.0 cannot be raised to a negative power

In [74]:
#LN: now can make work if put in 0 give back inf instead of error
def inverse(x):
    if x == 0:
        val = np.inf
    else:
        val = x**(-2)
    return val

In [75]:
inverse(0)

inf