# <center> DATA PROCESSING IN PYTHON USING NUMPY<br/><br/> CSCAR WORKSHOP <br/><br/> 11/01/2018
## <center> Marcio Duarte Albasini Mourao

Thank you Prof. Kerby Shedden for making available part of the material for this workshop!

# <center> Setup

* Go to the page https://marcio-mourao.github.io/ <br/><br/>
* Click on the html file under "Data Processing in Python using Numpy" <br/><br/>
* If you really want to use the jupyter-notebook: <br/><br/>
    
    * First <br/><br/>
        * Click on the ipynb file (this should open a new tab)
        * Click on the 'Raw' tab
        * Save page as 'Worshop.ipynb' to your 'username/Documents' <br/><br/>
    
    * Second <br/><br/>
        * Click the Windows button (Bottom Left Corner)
        * Click "All apps"
        * Click "Anaconda3"
        * Click "Anaconda Prompt"
        * Enter "conda update numpy" <br/><br/>
    
    * Third <br/><br/>
        * Click the Windows button (Bottom Left Corner)
        * Click "All apps"
        * Click "Anaconda3"
        * Click "Jupyter Notebook"
        * Click "Workshop.ipynb" (this should open a new tab in the browser)

# <center> Introduction

  * Don't forget to go to: http://cscar.research.umich.edu/ to know what we're offering!
  * Any questions/feedback, you can send an email to <a href="mailto:mdam@umich.edu" target="_top">Marcio.</a>

# <center> Summary of this workshop


* Summary of Python Data Types <br/><br/>
* Numpy <br/><br/>
    * Introduction
    * Arithmetic Operations
    * Indexing and Slicing
    * Vectorization
    * Reducing functions
    * Broadcasting

# <center> References

* https://www.continuum.io/anaconda-overview
* http://www.numpy.org/

# <center> Summary of Python Data Types

## Python Simple Data Types
### Integers
### Floats
### Strings
### Booleans

## Python Complex Data Structures

### Lists

In [1]:
example_list = [2,4,'fg',8,[3,4]]
print(example_list)
print(example_list[0])
print(example_list[2:4])
print(example_list[-2])

[2, 4, 'fg', 8, [3, 4]]
2
['fg', 8]
8


In [2]:
example_list[2]=20
print(example_list)
print(example_list[4][0])

[2, 4, 20, 8, [3, 4]]
3


### Tuples

In [3]:
example_tuple = (2,4,6,8,10)
print(example_tuple)
print(example_tuple[1])
#example_tuple[2]=20 # This should produce an error.

(2, 4, 6, 8, 10)
4


### Dictionary

In [4]:
example_dictionary = {'A':20,'B':40,'C':60}
print(example_dictionary)
print(example_dictionary['B'])
example_dictionary['C']=100
print(example_dictionary)
print(example_dictionary.keys())
print(example_dictionary.values())

{'A': 20, 'B': 40, 'C': 60}
40
{'A': 20, 'B': 40, 'C': 100}
dict_keys(['A', 'B', 'C'])
dict_values([20, 40, 100])


# <center> Numpy

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides several array-like data structures, including a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

The most commonly used array-like data structure is the ndarray (“n-dimensional array”) object. An ndarray is a Python wrapper around a contiguous chunk of memory that allows it to be manipulated like an array.

Conventionally 'Numpy' is abbreviated as np:

In [5]:
import numpy as np

Numpy arrays are homogeneous, contiguous, typed arrays. This makes them dramatically faster than core Python lists for many operations, since the Python list stores all values by indirection and is dynamically typed. The main exception to this would be if you need to store heterogeneous data, and/or you need to shrink or grow the array frequently, in which case the Python list type may actually be more efficient.

There are currently 24 Numpy data types, called “dtypes”, documented here. This includes the usual 12 numeric types (1, 2, 4, and 8 byte signed and unsigned integers, 4 and 8 byte floating point values, and 4 and 8 byte complex number values). In addition there are string, date/time, and Python object dtypes. The default type for many array creation operations is float64, which is an 8 byte floating point value that is mostly interchangeable with a regular Python float value.

The np.zeros function creates an array of zeros, defaulting to float64 type:

In [6]:
m = 10
x = np.zeros(m) # Sets all values to 0.
print(type(x))
print(x.dtype)

<class 'numpy.ndarray'>
float64


A few other ways to create arrays. Note that each of these functions can take the dtype argument specifying any dtype:

In [7]:
x = np.ones(m) # Sets all values to 1.
print(x, x.dtype)
x = np.arange(m) # 0, 1, 2, ..., m-1.
print(x, x.dtype)
x = np.arange(m, dtype=np.float64) # 0, 1, 2, ..., m-1.
print(x, x.dtype)

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] float64
[0 1 2 3 4 5 6 7 8 9] int64
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] float64


In [8]:
y = x.astype(np.int8) # Convert to int8.
print(y, y.dtype)

[0 1 2 3 4 5 6 7 8 9] int8


Numpy provides two main ways to work with string data. The first approach, which is much more common, uses the Python string pool to manage the strings, and simply places the object id’s into the ndarray. This produces an array of type Object, e.g.

In [9]:
x = np.array(["cat", "dog", "fish"], dtype = 'O')
x.dtype

dtype('O')

You can see that this array only contains object id’s by running the following:

In [10]:
s = "lion"
x[0] = s
print(id(s), id(x[0]))
x

4483767184 4483767184


array(['lion', 'dog', 'fish'], dtype=object)

The other way to store strings in a ndarray is to use a fixed string width, in which case the string data is actually packed into the array directly:

In [11]:
x = np.array(["cat", "dog"])
print(x)
x.dtype

['cat' 'dog']


dtype('<U3')

The dtype “<U3” refers to a Unicode string of 3 characters. Note that in this setting, if you attempt to assign a string that does not fit into the allotted storage, the string is truncated:

In [12]:
x[0] = "fish"
print(x)

['fis' 'dog']


## Arithmetic operations

Lets see (remind ourselves) what happens when you sum two lists:

In [13]:
operand1 = [1,2,3]
operand2 = [4,5]

In [14]:
print(type(operand1))
print(type(operand2))
operand1 + operand2

<class 'list'>
<class 'list'>


[1, 2, 3, 4, 5]

Unlike Python lists, Numpy arrays behave like mathematical vectors and matrices with respect to arithmetic operations, e.g. you can do something like this:

In [15]:
x = np.arange(5)
y = np.arange(1, 6)
print(x)
print(y)

[0 1 2 3 4]
[1 2 3 4 5]


In [16]:
print(x + y)  # Pointwise sum.
print(x - y)  # Pointwise difference.
print(x / y)  # Pointwise quotient.
print(x ** y) # Pointwise exponentiation.
print(x % y)  # Pointwise remainder.
print(x * y)  # Pointwise product.

[1 3 5 7 9]
[-1 -1 -1 -1 -1]
[0.         0.5        0.66666667 0.75       0.8       ]
[   0    1    8   81 1024]
[0 1 2 3 4]
[ 0  2  6 12 20]


In [17]:
x = np.ones((3,4)) # Creates a two dimensional numpy array.
x, x.dtype

(array([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]), dtype('float64'))

In [18]:
print(x.ndim)  # Returns number of dimensions in an array.
print(x.shape) # Return tuple describing array shape.
print(x.size)  # Returns number of elements.

2
(3, 4)
12


In [19]:
y = np.array([[1, 2], [3, 4]])
z = np.array([[5, 6], [7, 8]])

In [20]:
print(y)
print(z)

[[1 2]
 [3 4]]
[[5 6]
 [7 8]]


In [21]:
y + z # Summs element wise.

array([[ 6,  8],
       [10, 12]])

In [22]:
np.round(y / z, 2) # Divides element wise.

array([[0.2 , 0.33],
       [0.43, 0.5 ]])

An easy way to avoid making copies when performing array arithmetic in Numpy is to use the in-place arithmetic operators +=, *=, etc. When we use x = x + y, a new allocation is made to hold the value x + y, and this allocated memory is then assigned to x, with the previous memory block of x (eventually) being garbage collected. But x += y does not result in a new allocation, as seen below:

In [23]:
y = np.array([[1, 2], [3, 4]])
z = np.array([[5, 6], [7, 8]])

print(id(y))
y = y + z   # Regular sum.
print(id(y))
print(y)

4503799408
4503799568
[[ 6  8]
 [10 12]]


In [24]:
y = np.array([[1, 2], [3, 4]])
z = np.array([[5, 6], [7, 8]])

print(id(y))
y += z        # In-place sum.
print(id(y))
print(y)

4503799248
4503799248
[[ 6  8]
 [10 12]]


## Indexing and Slicing

Indexing and slicing numpy arrays behaves similarly to indexing and slicing in Python lists. However, slices will normally return a “view” of the underlying data, meaning that if you change a slice, the same values will change in the parent array.

Lets start with an example of a list and then move on to a numpy array:

In [25]:
x = list(range(10))
y = x[3:6]
y[0] = 99

In [26]:
print(x,y)
print(id(x),id(y))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [99, 4, 5]
4503702600 4484391560


In [27]:
x = np.arange(10)
y = x[3:6]
y[0] = 99

In [28]:
print(x,y)
print(id(x),id(y))

[ 0  1  2 99  4  5  6  7  8  9] [99  4  5]
4503798768 4503800688


In [29]:
x = np.array([[1, 2], [3, 4]])
x

array([[1, 2],
       [3, 4]])

In [30]:
print(x[0]) # Returns the first line.
print(x[1:]) # Returns a similar structure to x with the second line.
print(x[0][1]) # Returns the second element of the first line.
print(x[0,1]) # A more concise way of writing the above lookup.

[1 2]
[[3 4]]
2
2


Again, slices will normally return a “view” of the underlying data, meaning that if you change a slice, the same values will change in the parent array. This is also the case with higher dimensional cases:

In [31]:
x = np.array([[1, 2, 3], [4, 5, 6]])
y = x[:,0] # retrieves the first column.
y[1]=100 # modifies its element.

In [32]:
print(x)
print(y)

[[  1   2   3]
 [100   5   6]]
[  1 100]


What if you would like to not have such behavior?

In [33]:
x = np.array([[1, 2, 3], [4, 5, 6]])
y = x[:,0].copy() # retrieves the first column (now a copy of it)
y[1]=100 # modifies its element.

In [34]:
print(x)
print(y)

[[1 2 3]
 [4 5 6]]
[  1 100]


Now suppose you want to retrieve elements from the first and the third columns of x:

In [35]:
x = np.array([[1, 2, 3], [4, 5, 6]])
y = x[:,[0,2]] # We use a list to index - this is also called advanced or fancy indexing.
y[1]=100

In [36]:
print(x)
print(y)

[[1 2 3]
 [4 5 6]]
[[  1   3]
 [100 100]]


When you use advanced indexing, no view is provided and a copy of the original object will be made.

In [37]:
x = np.array([[1,2], [3,4], [5,6]])
y = x[[-1,-2],:] # Another example of advanced indexing with negative indices.

In [38]:
print(x)
print(y)

[[1 2]
 [3 4]
 [5 6]]
[[5 6]
 [3 4]]


In [39]:
x = np.array([[1,-2], [3,4], [-5,6]])
x

array([[ 1, -2],
       [ 3,  4],
       [-5,  6]])

In [40]:
x < 0 # x < 0 is a boolean array.

array([[False,  True],
       [False, False],
       [ True, False]])

In [41]:
x[x<0]=0 # This is called boolean indexing. Here I am setting negative entries to zero.
x

array([[1, 0],
       [3, 4],
       [0, 6]])

## Vectorization

Vectorization within NumPy is used to express operations as occurring on entire arrays rather than in their individual elements. Here’s a definition from Wes McKinney:

"
This practice of replacing explicit loops with array expressions is commonly referred to as vectorization. In general, vectorized array operations will often be one or two (or more) orders of magnitude faster than their pure Python equivalents, with the biggest impact seen in any kind of numerical computations. (see [here](https://www.safaribooksonline.com/library/view/python-for-data/9781449323592/ch04.html?orpq))
"

Lets see an example:

In [42]:
values = np.random.choice([True,False], size=500)
values[:10] # Only showing a few values.

array([False,  True,  True,  True, False,  True,  True,  True, False,
        True])

In [43]:
def count_transitions1(values):
    """Returns the number of transitions from either False to True or from True to False"""
    output = 0
    for x,y in zip(values[:-1],values[1:]):
        if x!=y:
            output+=1
    return output

In [44]:
def count_transitions2(values):
    """Returns the number of transitions from either False to True or from True to False"""
    output = np.sum(values[:-1]!=values[1:])
    return output

In [45]:
print(count_transitions1(values))
print(count_transitions2(values))

244
244


In [46]:
%timeit -n 1000 count_transitions1(values)
%timeit -n 1000 count_transitions2(values)

743 µs ± 26.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
5.6 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Reducing functions

Numpy has reducing functions that collapse a multidimensional array to one single axis. The axes are numbered 0 (rows), 1 (columns), etc. The following is an example with a two dimensional array:

In [47]:
x = np.random.normal(size=(5,10))
x

array([[ 0.15361834, -0.9323961 ,  1.75365335,  0.6873561 , -0.71543811,
        -0.87553558, -0.2796889 , -0.81073661,  0.61154874, -0.32439873],
       [-0.84302238,  1.42273214,  0.64274908,  0.37132493, -1.11154913,
        -1.17341762, -0.08164711, -1.65344367,  0.54347979, -0.74886795],
       [ 0.1511163 ,  0.45450572, -1.0101531 ,  1.03196016,  0.25439743,
        -2.65277939, -0.29970784, -1.99011286, -1.4167083 ,  0.57062254],
       [-0.84188876, -0.18655635,  1.69137886, -0.12858569, -0.33067509,
         0.32722261,  0.17815516,  0.30888332,  0.97169305,  0.30049232],
       [-0.83882225,  1.4915064 ,  1.68740924, -0.78893662, -2.45055652,
         0.32969387, -0.96120084,  0.63815445,  0.93880032, -0.57658747]])

In [48]:
print(x.mean(0)) # column-wise means, size=10.
print(x.mean(1)) # row-wise means, size=5.

[-0.44379975  0.44995836  0.95300749  0.23462378 -0.87076428 -0.80896322
 -0.28881791 -0.70145107  0.32976272 -0.15574786]
[-0.07320175 -0.26316619 -0.49068593  0.22901194 -0.05305394]


## Broadcasting

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python.

One use of broadcasting is if we want to center or scale an array by column:

In [49]:
x = np.random.random(size=(10000, 4))
print(id(x))

print(x.mean(0))
print(x.std(0))

print(x.shape)
print(x.mean(0).shape)

x -= x.mean(0)
x /= x.std(0)

print(x.mean(0))
print(x.std(0))

print(id(x))

4503846624
[0.49848957 0.50341735 0.49999011 0.50023852]
[0.28743465 0.28775098 0.29027135 0.28880233]
(10000, 4)
(4,)
[-2.34228192e-15 -3.42936790e-15 -4.23601154e-15  4.32884839e-15]
[1. 1. 1. 1.]
4503846624


In the example above, x.mean(0) returns an array with dimension (4,), which matches from the right with the dimension of x, which is (10000,4). Therefore the shapes are compatible for broadcasting. The behavior in this case is that the result of x.mean(0) is only computed one time, and the same result is used for centering each row of x.

There is a special case of the broadcasting rules that applies when a dimension’s length is equal to 1. In this case, the value in that dimension is copied to match the dimension on the same axis in the other array:

In [50]:
a = np.array([10.0, 15.0, 20.0])
b = np.array([5.0, 5.0, 5.0])
a / b

array([2., 3., 4.])

In [51]:
a = np.array([10.0, 15.0, 20.0])
print(a.shape)
b = np.array([5.0])
print(b.shape)
a / b

(3,)
(1,)


array([2., 3., 4.])

In [52]:
x = np.random.normal(size=(10, 2))
y = np.random.normal(size=(10, 1))
print(x + y)

[[-1.42709134 -1.23543264]
 [-1.95143571  0.83955878]
 [-0.79188457 -1.46913012]
 [ 0.16114464 -0.31243844]
 [-2.30882339  1.04068704]
 [ 2.17504989  1.87153516]
 [ 0.50688296 -0.32216532]
 [-1.39755324 -0.09783976]
 [-1.11557     0.09074254]
 [-0.67935513  1.755363  ]]


There is also a special syntax for adding a new axis of length 1 to an array:

In [53]:
x = np.zeros(10)  # shape is (10,).
x

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [54]:
y = x[:, None]    # shape is (10,1).
y

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.]])

In [55]:
z = x[None, :]    # shape is (1,10).
z

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

A common setting where this is useful is when you want to center or scale a two-dimensional array by row. Adding a new column with dimension 1 allows the broadcasting rules to apply when they otherwise would not:

In [56]:
x = np.random.normal(size=(10, 3))
x -= x.mean(1)[:, None]
x /= x.std(1)[:, None]

See the intermediate values below:

In [57]:
x.mean(1), x.mean(1).shape

(array([-2.96059473e-16, -1.11022302e-16, -3.70074342e-17,  2.22044605e-16,
         0.00000000e+00,  7.40148683e-17, -7.40148683e-17,  2.22044605e-16,
         0.00000000e+00,  0.00000000e+00]), (10,))

In [58]:
x.mean(1)[:, None], x.mean(1)[:, None].shape

(array([[-2.96059473e-16],
        [-1.11022302e-16],
        [-3.70074342e-17],
        [ 2.22044605e-16],
        [ 0.00000000e+00],
        [ 7.40148683e-17],
        [-7.40148683e-17],
        [ 2.22044605e-16],
        [ 0.00000000e+00],
        [ 0.00000000e+00]]), (10, 1))

## Conclusion

Numpy is one of many tools designed from the 1980’s to the early 2000’s for array manipulation. These tools tend to follow a common set of design principles, namely:

* Arrays are contiguous in memory.
* Memory management is dynamic and mostly invisible.
* Mathematical operations are expressed in the same syntax that we usually use to write mathematics on paper, e.g. x = y + z assigns to x the pointwise sum of y and z.