In this lecture we go over some commmon modules in Python that are widely used in data science and automation tasks.

We first go over the 'collections' module. The 'collections' module is a built-in module that implements specialized container data types providing alternatives to Python’s general purpose built-in containers. We've already gone over the basics: dictionaries, lists, sets, and tuples. Simply put, this module basically adds more advanced data types.

To introduce new types of objects, we first study a new concept. In computer science, a **container** is a class, a data structure, or an abstract data type (ADT) whose instances are collections of other objects. In other words, they store objects in an organized way that follows specific access rules. The size of the container depends on the number of objects (elements) it contains. Underlying (inherited) implementations of various container types may vary in size and complexity, and provide flexibility in choosing the right implementation for any given scenario.

We first go over the 'Counter' data type. A 'Counter' is a dictionary subclass which helps count hash-able objects. Inside of it elements are stored as dictionary keys and the counts of the objects are stored as the value. To create a 'Counter' object, you will need to use the Counter() function within the 'collections' module.

In [1]:
from collections import Counter
l1=[1,1,1,1,2,2,2,4,0,55,5,5,5,5,5,7,7,7]
c1=Counter(l1)
print(c1)
print(type(c1))

Counter({5: 5, 1: 4, 2: 3, 7: 3, 4: 1, 0: 1, 55: 1})
<class 'collections.Counter'>


In [2]:
s = 'How many times does each word show up in this sentence word times each each word'
words = s.split()
Counter(words)

Counter({'How': 1,
         'many': 1,
         'times': 2,
         'does': 1,
         'each': 3,
         'word': 3,
         'show': 1,
         'up': 1,
         'in': 1,
         'this': 1,
         'sentence': 1})

Essentiall, you can think of 'Counter' as a cheap 'proc freq' in SAS:

In [3]:
c2 = Counter('abcdaajhhhjb')
print(c2)
for letter in 'abcdjeqq':
    print ('%s : %d' % (letter, c2[letter]))

Counter({'a': 3, 'h': 3, 'b': 2, 'j': 2, 'c': 1, 'd': 1})
a : 3
b : 2
c : 1
d : 1
j : 2
e : 0
q : 0
q : 0


There are also methods associated with 'Counter' objects. For example, the most_common(n) method can help us find the n-most common elements within an object:

In [4]:
c3 = Counter(words)
print(c3)
print(c3.most_common(2)) # find the 2 most common words
print(c3.most_common(3)) # find the 2 most common words
print(c3.most_common()[:-3:-1]) # find the 2 least common words

Counter({'each': 3, 'word': 3, 'times': 2, 'How': 1, 'many': 1, 'does': 1, 'show': 1, 'up': 1, 'in': 1, 'this': 1, 'sentence': 1})
[('each', 3), ('word', 3)]
[('each', 3), ('word', 3), ('times', 2)]
[('sentence', 1), ('this', 1)]


There are certainly other methods associated with the 'Counter' objects other than most_common(). Below are some examples:

In [5]:
print(sum(c3.values())) # calculating the total of all counts
print(list(c3)) # this creates a list of unique elements 
print(set(c3)) # this creates a set of unique elements 
print(dict(c3)) # this creates a dictionary for elements in the 'Counter' object
print(c3.items()) # this converts the 'Counter' object to a list of (elem, cnt) pairs
print(c3.clear()) # resetting all counts

16
['How', 'many', 'times', 'does', 'each', 'word', 'show', 'up', 'in', 'this', 'sentence']
{'word', 'sentence', 'up', 'many', 'this', 'How', 'does', 'times', 'show', 'in', 'each'}
{'How': 1, 'many': 1, 'times': 2, 'does': 1, 'each': 3, 'word': 3, 'show': 1, 'up': 1, 'in': 1, 'this': 1, 'sentence': 1}
dict_items([('How', 1), ('many', 1), ('times', 2), ('does', 1), ('each', 3), ('word', 3), ('show', 1), ('up', 1), ('in', 1), ('this', 1), ('sentence', 1)])
None


Besides the 'Counter' object, there are other interesting objects in the 'collections' module. One particular example is the 'defaultdict' object, which is a dictionary-like object which provides all methods provided by dictionaries but takes first argument (default_factory) as default data type for the dictionary. To motivate this type of object, let's first go back to the dictionary object. Recall that dictionaries are a convenient way to store data for later retrieval by names (keys). Keys must be unique, immutable objects, and are typically strings. The values in a dictionary can be anything. For many applications the values are simple types such as integers and strings.

It gets more interesting when the values in a dictionary are collections (lists, dicts, etc.) In this case, the value (an empty list or dict) must be initialized the first time a given key is used. While this is relatively easy to do manually, the 'defaultdict' type automates and simplifies these kinds of operations. A 'defaultdict' object works exactly like a normal dictionary, but it is initialized with a function ('default factory') that takes no arguments and provides the default value for a nonexistent key. A 'defaultdict' will never raise a 'KeyError'. Any key that does not exist gets the value returned by the default factory.

Let's see some examples. Let's first create a normal dictionary to illustrate what is a 'KeyError'. Below, we create a normal dictionary called 'd1' with three keys. As usual, if we reference, or index the correct key, we can get its corresponding value. But if we reference a key that does not exist (say 'k4'), then Python will error:

In [6]:
d1={'k1':1, 'k2':94, 'k3':'Sarah'}
print(d1['k1'])
print(d1['k3'])
try:
    print(d1['k4'])
except KeyError:
    print('You are printing a key that does not exist at this moment!')
for item in d1:
    print(item)

1
Sarah
You are printing a key that does not exist at this moment!
k1
k2
k3


Now by using the 'defaultdict' object, we don't have to worry about this type of error. The argument of the defaultdict() must be a function name. Recall that the object() function returns a featureless object which is a base for all classes. By doing this, we won't raise an error. 

In [7]:
from collections import defaultdict
d2 = defaultdict(object)
print(d2['k4', 'k5', 'k6'] )
for item in d2:
    print(item)

<object object at 0x00000202CB394820>
('k4', 'k5', 'k6')


Now as another example, suppose we use the int() function as the defaultdict() function's argument. Recall that the int() method returns an integer object from any number or string. This will default the value to 0 because int() (no parameters) will return a zero. On the other hand, if we use a list() function, we will have an empty list. This is because for the list() function, if no parameters are passed, the function list() returns an empty list. 

In [8]:
d3 = defaultdict(int)
print(d3['k4'] )
d4 = defaultdict(list)
print(d4['k4'] )
d5 = defaultdict(int, a=10, b=12, c=13)
print(d5['b'])

0
[]
12


The 'defaultdict' object is often used in conjunction with lambda expressions. 

In [9]:
d6 = defaultdict(lambda: 9) # directly assigning the number 9
print(d6['k1'] )

9


We now move on to the next subject about ordered dictionaries. An 'OrderedDict' is a dictionary subclass that remembers the order in which its contents are added. A normal dictionary is just a mapping but it does not necessarily retain the order. In contrast, an ordered dictionary retains the order. Let's compare an ordinary dictionary with an ordered dictionary:

In [10]:
print('Normal Dictionary:')
d = {}
d['a'] = 'A' # assigning values with keys
d['b'] = 'B'
d['e'] = 'E'
d['c'] = 'C'
d['d'] = 'D'
for k, v in d.items():
    print(k, v)
print(d)

print('\nAn Ordered Dictionary:')
import collections
od = collections.OrderedDict()
od['a'] = 'A'
od['b'] = 'B'
od['c'] = 'C'
od['d'] = 'D'
od['e'] = 'E'
for k, v in od.items():
    print(k, v)
print(od)

Normal Dictionary:
a A
b B
e E
c C
d D
{'a': 'A', 'b': 'B', 'e': 'E', 'c': 'C', 'd': 'D'}

An Ordered Dictionary:
a A
b B
c C
d D
e E
OrderedDict([('a', 'A'), ('b', 'B'), ('c', 'C'), ('d', 'D'), ('e', 'E')])


In the example above (which is not the best example in our current case), the normal dictionary 'd' has its values sequentially added but the dictionary itself does not necessarily always remember the order with which each value is added. Depending on different versions of Python, you may see d={'a': 'A', 'b': 'B', 'e': 'E', 'c': 'C', 'd': 'D'}, or d={'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D', 'e': 'E'}. In contrast, the ordered dictionary does remember the order. To further illustrate this issue, we can create a simple boolean logic to test whether two dictionaries with different orders are the same. If you have a normal dictionary, you will see the order does not matter, whereas for ordered dictionary the order does matter:

In [11]:
d1 = collections.OrderedDict()
d1['a'] = 'A'
d1['b'] = 'B'
d2 = collections.OrderedDict()
d2['b'] = 'B'
d2['a'] = 'A'
print('\nCompare ordered dictionary: ', d1, d2, ' Is d1==d2?')
print(d1 == d2)

d3={}
d3['a'] = 'A'
d3['b'] = 'B'
d4={}
d4['b'] = 'B'
d4['a'] = 'A'
print('\nCompare normal dictionary: ', d3, d4, ' Is d3==d4?')
print(d3==d4)


Compare ordered dictionary:  OrderedDict([('a', 'A'), ('b', 'B')]) OrderedDict([('b', 'B'), ('a', 'A')])  Is d1==d2?
False

Compare normal dictionary:  {'a': 'A', 'b': 'B'} {'b': 'B', 'a': 'A'}  Is d3==d4?
True


As you see above, d3 and d4 are just mappings, so order does not matter. But d1 d2 are more than just a mapping. So order does matter. 

We now examine another new object called 'namedtuple'. To understand what this is about, let's start with a standard tuple object as an example. The standard tuple uses numerical indexes to access its members, for example:

In [12]:
t = (12,13,14)
for i in [0,1,2]:
    print(i, t[i])

0 12
1 13
2 14


For a small tuple like the example above, we can easily index it to extract information from it (say t[0]). However, sometimes the tuple may be big and complex and it's hard to find that index. This is where the 'namedtuple' comes in handy.

A 'namedtuple' assigns a name as well as a numeric index to a tuple. The way one can think of a 'namedtuple' is like creating a class very quickly. So each kind of 'namedtuple' is represented by its own class, created by using the namedtuple() factory function. The arguments are the name of the new class and a string containing the names of the elements. This is why you can basically think of a 'namedtuple' object as a very quick way of creating a new object/class type with some attribute fields. For example, below we are creating a 'namedtuple' object called 'Dog'. It has attributes (separated by a blank space in a quotation mark) including age, breed, and name:

In [13]:
from collections import namedtuple
Dog = namedtuple('Dog','age breed name') # the first argument is the name of the class created and the second is a string of attributes separated by a blank space
print(type(Dog))
dog1 = Dog(age=2,breed='Bulldog',name='Sammy')
dog2 = Dog(age=16,breed='Shepard',name="Frankie")
print(dog1)
print(dog2)
print(type(dog1))
print(type(dog2))

<class 'type'>
Dog(age=2, breed='Bulldog', name='Sammy')
Dog(age=16, breed='Shepard', name='Frankie')
<class '__main__.Dog'>
<class '__main__.Dog'>


Basically, we have constructed the a 'namedtuple' object by first passing the object type name (Dog) and then passing a string with the variety of fields as a string with spaces between the field names. We can then call on the various attributes like below either through the attributes or numeric indices:

In [14]:
print(dog1.age)
print(dog1.name)
print(dog1[1])
print(dog2[0])

2
Sammy
Bulldog
16


In a nutshell, 'namedtuple' instances are just as memory efficient as regular tuples because they do not have per-instance dictionaries. Each kind of 'namedtuple' is represented by its own class, created by using the namedtuple() factory function. 

We now go over one of the most important modules for data scientists, called 'numpy'. The 'numPy' module is an open-source add-on module to Python that provides common mathematical and numerical routines in pre-compiled, fast functions. It is growing into a highly mature package that provides functionality that meets, or perhaps exceeds, that associated with common commercial software like MATLAB. The numpy (Numeric Python) package provides basic routines for manipulating large arrays and matrices of numeric data. 

A good resource for the numpy introduction can be found at the following link: https://engineering.ucsb.edu/~shell/che210d/numpy.pdf. 

In [15]:
import numpy

The central feature of 'numpy' is the 'ndarray' object class, or simply **arrays** as we will call it. Arrays are similar to lists in Python, except that every element of an array must be of the same type, typically a numeric type like 'float' or 'int'. Arrays make operations with large amounts of numeric data very fast and are generally much more efficient than lists.

An array can be created from a list. Array elements can be accessed, sliced, and manipulated just like lists. The numpy module's array class is called 'ndarray':

In [16]:
a = numpy.array([1, 4, 5, 8], float)
print(a)
print(type(a))
print(a[0])
print(a[1])
print(a[2])
print(a[:2]) # up to the second element (but not including the second element)
print(a[1:])

[1. 4. 5. 8.]
<class 'numpy.ndarray'>
1.0
4.0
5.0
[1. 4.]
[4. 5. 8.]


Arrays are by nature multidimensional. A two-dimensional array is a matrix, which we all learned from our linear algebra class in college. A numpy array is therefore a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. In numpy, dimensions are called **axes**. The number of axes is defined to be the **rank of the array** (thus the number of dimensions is the rank of the array). The **shape** of an array is a tuple of integers giving the size of the array along each dimension. Unlike lists, different axes are accessed using commas inside the bracket notation.

Below are some examples of matrices. Notice that array slicing works with multiple dimensions in the same way as usual, applying each slice specification as a filter to a specified dimension. Use of a single colon in a dimension indicates the use of everything along that dimension:

In [17]:
A = numpy.array([[1, 9, 3], [4, 8, 6]], float)
print(A)

[[1. 9. 3.]
 [4. 8. 6.]]


In [18]:
print(A[0,0])
print(A[0,1])
print(A[0,2])
print(A[1,0])
print(A[1,1])
print(A[1,2])
print(A[-1:,-2:])

1.0
9.0
3.0
4.0
8.0
6.0
[[8. 6.]]


Let's see a more complicated example:

In [19]:
B=numpy.array([[1.4,25,5.98,8.76,1], [2.11,3.434,4,7,8], [9,0,0,6.2,7.73], [11,45,6.98,0.93,0.9]])
print(B)
print(B[2,3])
print(B[2,:])
print(B[:,1])

[[ 1.4   25.     5.98   8.76   1.   ]
 [ 2.11   3.434  4.     7.     8.   ]
 [ 9.     0.     0.     6.2    7.73 ]
 [11.    45.     6.98   0.93   0.9  ]]
6.2
[9.   0.   0.   6.2  7.73]
[25.     3.434  0.    45.   ]


Since arrays are objects, they have properties associated with them. For example, the 'shape' property of an array returns a tuple with the size of each array dimension, whereas the 'dtype' property tells you what type of values are stored by the array. 

In [20]:
print(A.shape) # A has 2 rows and 3 columns 
print(A.dtype) #  float64 is a numeric type that NumPy uses to store double-precision (8-byte) real numbers
print(B.shape) # B has 4 rows and 5 columns
print(B.dtype)

(2, 3)
float64
(4, 5)
float64


When used with an array, the len() function returns the length of the first axis. Moreover, the 'in' statement can be used to test if values are present in an array:

In [21]:
print(len(A)) # the number of rows
print(len(B)) # the number of rows
print(2 in A)

2
4
False


Arrays can be reshaped using tuples that specify new dimensions. In the following example, we turn a ten-element one-dimensional array into a two-dimensional one whose first axis has five elements and whose second axis has two elements:

In [22]:
C = numpy.array(range(10), float)
print(C)
C=C.reshape((5, 2))
print(C)

[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[[0. 1.]
 [2. 3.]
 [4. 5.]
 [6. 7.]
 [8. 9.]]


We know we can create arrays by using existing lists. However, lists can also be created from one-dimensional arrays:

In [23]:
D = numpy.array([1, 2, 3], float)
D.tolist()
print(type(D), D)
list(D)

<class 'numpy.ndarray'> [1. 2. 3.]


[1.0, 2.0, 3.0]

One can fill an array with a single value (by using the fill() method). This way, every value in that array is filled with a particular value. 

In [24]:
E = numpy.array([[1, 2, 3, 5, 9], [0, 4, 4, 7, 5]], float)
print(E, "\n")
E.fill(818)
print(E, "\n") 

[[1. 2. 3. 5. 9.]
 [0. 4. 4. 7. 5.]] 

[[818. 818. 818. 818. 818.]
 [818. 818. 818. 818. 818.]] 



Transposed versions of arrays can also be generated, which will create a new array with the final two axes switched. Transposition is achieved through the transpose() method:

In [25]:
print('A is: \n', A, '\n')
A1=A.transpose()
print('A1 is the transposed matrix of A: \n', A1, '\n')

A is: 
 [[1. 9. 3.]
 [4. 8. 6.]] 

A1 is the transposed matrix of A: 
 [[1. 4.]
 [9. 8.]
 [3. 6.]] 



We can also concatenate arrays. One-dimensional versions of multidimensional arrays can be generated with the flatten() method. Two or more arrays can be concatenated together using the concatenate() function with a tuple of the arrays to be joined. If an array has more than one dimension, it is possible to specify the axis along which multiple arrays are concatenated. By default (without specifying the axis), 'numpy' concatenates along the first dimension. We always recommend to have the axis argument specified. Below are some examples:

In [26]:
A2= numpy.array([[1, 2, 3], [4, 5, 6]], float)
print(A2.shape)
print('Concatenating A2: ', A2.flatten())

(2, 3)
Concatenating A2:  [1. 2. 3. 4. 5. 6.]


In [27]:
A3 = numpy.array([1,2], float)
A4 = numpy.array([3,4,5,6], float)
A5 = numpy.array([7,8,9,10,11,12,13], float)
print('Concatenating all arrays (A3-A5) together: ', numpy.concatenate((A3, A4, A5)), '\n')
B1 = numpy.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], float)
B2 = numpy.array([[10, 11, 12], [13, 14, 15], [16, 17, 18]], float)
print('Concatenating along the first axis: \n', numpy.concatenate((B1,B2), axis=0)) # default is axis=0
print('Concatenating along the second axis: \n', numpy.concatenate((B1,B2), axis=1))

Concatenating all arrays (A3-A5) together:  [ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13.] 

Concatenating along the first axis: 
 [[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]
 [10. 11. 12.]
 [13. 14. 15.]
 [16. 17. 18.]]
Concatenating along the second axis: 
 [[ 1.  2.  3. 10. 11. 12.]
 [ 4.  5.  6. 13. 14. 15.]
 [ 7.  8.  9. 16. 17. 18.]]


Finally, the dimensionality of an array can be increased using the 'newaxis' constant in bracket notation:

In [28]:
C1 = numpy.array([1, 2, 3], float)
print(C1)
print(C1[:,numpy.newaxis])
print(C1[:,numpy.newaxis].shape)

[1. 2. 3.]
[[1.]
 [2.]
 [3.]]
(3, 1)


The above commands are basic for creating mathematical matrices that we usually deal with in college-level linear algebra classes. There are other ways to create arrays too, such as using the arange() function. We can also create some special matrices in the 'numpy' module. For example, the functions zeros() and ones() create new arrays of specified dimensions filled with these values. The identity() function creates identity matrices. The eye() function returns matrices with ones along the kth diagonal. Below are some examples:

In [29]:
print(numpy.arange(7))
print('ones: \n', numpy.ones((2,3), dtype=float)) # creating a 2 by 3 matrix with ones
print('zeros: \n', numpy.zeros(7, dtype=int)) # creating a row vector of 0's with a length of 7
print('identity matrix: \n', numpy.identity(4, dtype=float)) # creating an identity matrix
print('eye matrix: \n', numpy.eye(5, k=2, dtype=float)) # creating a matrix with ones along the '2th' diagonal 

[0 1 2 3 4 5 6]
ones: 
 [[1. 1. 1.]
 [1. 1. 1.]]
zeros: 
 [0 0 0 0 0 0 0]
identity matrix: 
 [[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]
eye matrix: 
 [[0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]


Based on the previous examples, you may have suspected that the 'numpy' module is kind of like a mini-version of MATLAB or PROC IML in SAS. Indeed, 'numpy' provides a powerful collection of objects and functions that handle basic matrix algebra. We now go over the arithmetic aspects of arrays. We will take a look at how we add, substract and perform other complicated operations on matrices and arrays.

When standard mathematical operations are used with arrays, they are applied on an element-by-element basis. This means that the arrays should be the same size during addition, subtraction, etc.. When it comes to multiplication, the product is the Hadamard product (element-wise multiplication). If the two arrays have different dimensions, Python will error. Below are some examples:

In [30]:
D1= numpy.array([1,2,3,4], float)
D2= numpy.array([5,2,6,1], float)
print(D1+D2)
print(D1-D2)
print(D1*D2) # multiplication elementwise
print(D1/D2)
print(D1%D2) # modulus
print(D1**D2) # exponent
D3 = numpy.array([[1,2], [3,4]], float)
D4 = numpy.array([[2,0], [1,3]], float)
print('for more than one dimensions: \n', D3*D4)

[6. 4. 9. 5.]
[-4.  0. -3.  3.]
[ 5.  4. 18.  4.]
[0.2 1.  0.5 4. ]
[1. 0. 3. 0.]
[  1.   4. 729.   4.]
for more than one dimensions: 
 [[ 2.  0.]
 [ 3. 12.]]


There are some caveats here worth mentioning. When it comes to mismatch of arrays, the arrays that do not match in the number of dimensions will be **broadcasted** by Python to perform mathematical operations. This means that the smaller array will be repeated as necessary to perform the operation indicated. Consider the following:

In [31]:
D5 = numpy.array([[1, 2], [3, 4], [5, 6]], float)
D6 = numpy.array([-1, 3], float)
print(D5.shape)
print(D6.shape)
print((D5+D6).shape)
print(D5+D6)

(3, 2)
(2,)
(3, 2)
[[0. 5.]
 [2. 7.]
 [4. 9.]]


Here, the one-dimensional array D6 was broadcasted to a two-dimensional array that matched the size of D5. In essence, D6 was repeated for each item in D5. The bottom line here is that when it comes to array operations, we highly suggest that arrays are conformable. 

In addition to the standard operators, the 'numpy' module offers a large library of common mathematical functions that can be applied elementwise to arrays. Among these are the mathematical common functions: abs, sign, sqrt, log, log10, exp, sin, cos, tan, arcsin, arccos, arctan, sinh, cosh, tanh, arcsinh, arccosh, and arctanh etc.

In addition, the functions floor(), ceil(), and rint() give the lower, upper, or nearest (rounded) integer. Also included in the module are two important mathematical constants 'pi' and 'e'. 

Below are some examples:

In [32]:
D7= numpy.array([1.12, 1.50, 1.92], float)
print(numpy.floor(D7))
print(numpy.e)

[1. 1. 1.]
2.718281828459045


Now we continue with our array operation discussion. Many functions exist for extracting whole-array properties. The items in an array can be summed or multiplied. A number of routines enable computation of statistical quantities in array datasets, such as the mean (average), variance, and standard deviation. These are also available for arrays. In addition, the argmin() and argmax() functions return the array indices of the minimum and maximum values:

In [33]:
D8 = numpy.array([2, 1, 9], float)
print(D8.mean(), D8.var(), D8.std(), D8.max(), sep=', ')

4.0, 12.666666666666666, 3.559026084010437, 9.0


For multidimensional arrays, each of the functions thus far described can take an optional argument axis that will perform an operation along only the specified axis, placing the results in a return array:

In [34]:
D9 = numpy.array([[0, 2], [3, -1], [3, 5]], float)
print(D9.mean(axis=0), D9.mean(axis=1), sep=', ')
print(D9.min(axis=0), D9.min(axis=1), sep=', ')

[2. 2.], [1. 1. 4.]
[ 0. -1.], [ 0. -1.  3.]


Arrays and lists bear a lot of similarities. For example, they both can be sorted and they both can be iterated. For multidimensional arrays, iteration proceeds over the first axis such that each loop returns a subsection of the array. Moreover, multiple assignment can also be used with array iteration. Below are some examples:

In [35]:
D10 = numpy.array([6.1, 2, 5, -1, 0.02], float)
print('\nThe one-dimensional array D10 is given by: \n', D10)
print('\nThe multidimensional array D9 is given by: \n', D9)
print('\nMultiple assignment:')
for (i,j) in D9:
    print(i+j)


The one-dimensional array D10 is given by: 
 [ 6.1   2.    5.   -1.    0.02]

The multidimensional array D9 is given by: 
 [[ 0.  2.]
 [ 3. -1.]
 [ 3.  5.]]

Multiple assignment:
2.0
2.0
8.0


Unique elements can be extracted from a one-dimensional array, and for two dimensional arrays, the diagonal can be extracted:

In [36]:
D11 = numpy.array([1,1,2,3,3,5,1,1], float)
print('unique elements in the one-dimensional array: ', numpy.unique(D11))
D12 = numpy.array([[1, 2], [3, 4]], float)
print('diagonal elements in multidimensional array: ', D12.diagonal())

unique elements in the one-dimensional array:  [1. 2. 3. 5.]
diagonal elements in multidimensional array:  [1. 4.]


Just like lists, Boolean comparisons can be used to compare members element-wise on arrays of equal size. The return value is an array of Boolean True/False values. Moreover, the results of a Boolean comparison can be stored in an array:

In [37]:
D13 = numpy.array([1, 3, 0], float)
D14 = numpy.array([0, 3, 2], float)
print(D13>D14)
judge = D13>D14
print(type(judge))

[ True False False]
<class 'numpy.ndarray'>


The any() and all() operators can be used to determine whether or not any or all elements of a Boolean array are true:

In [38]:
D15 = numpy.array([ True, False, False], bool)
print(any(D15))
print(all(D15))

True
False


Compound Boolean expressions can be applied to arrays on an element-by-element basis using special functions logical_and(), logical_or(), and logical_not(). The output of these functions are all arrays:

In [39]:
B3 = numpy.array([1, 3, 0], float)
print(numpy.logical_and(B3 > 0, B3 < 3))
B4 = numpy.array([True, False, True], bool)
print(numpy.logical_not(B4))
B5 = numpy.array([False, True, False], bool)
print(numpy.logical_or(B4, B5))

[ True False False]
[False  True False]
[ True  True  True]


The where() function forms a new array from two arrays of equivalent size using a Boolean filter to choose between elements of the two. Its basic syntax is where(boolarray,truearray, falsearray):

In [40]:
print('D1 =', D1)
numpy.where(D1>=2, 1, 0) # if True then assign 1, 0 otherwise

D1 = [1. 2. 3. 4.]


array([0, 1, 1, 1])

A number of functions allow testing of the values in an array. The nonzero() function gives a tuple of indices of the nonzero values in an array. The number of items in the tuple equals the number of axes of the array:

In [41]:
B6 = numpy.array([0,0,3,4,9,33,21,2,56,0,1])
print(B6.nonzero())
B7 = numpy.array([[0, 1], [3, 0], [9, 8]], float)
print(B7.nonzero())
print(type(B3.nonzero())) # tuple

(array([ 2,  3,  4,  5,  6,  7,  8, 10], dtype=int64),)
(array([0, 1, 2, 2], dtype=int64), array([1, 0, 0, 1], dtype=int64))
<class 'tuple'>


It is also possible to test whether or not values are NaN ("not a number") or finite:

In [42]:
B8 = numpy.array([1, numpy.NaN, numpy.Inf], float)
print(B8)
B9 = numpy.isnan(B8)
print(B9)
B10 = numpy.isfinite(B8)
print(B10)

[ 1. nan inf]
[False  True False]
[ True False False]


We have already seen that, like lists, individual elements and slices of arrays can be selected using bracket notation. Unlike lists, however, arrays also permit selection using other arrays. That is, we can use 'array selectors' to filter for specific subsets of elements of other arrays.

For example, Boolean arrays can be used as array selectors. For instance, suppose a is an arbitrary array. Sending the Boolean array given by a>=6 to the bracket selection for a, an array with only the True elements is returned. Below are some examples:

In [43]:
B11 = numpy.array([[6, 4, 8, 6], [5, 9, 3, 9]], float)
sel = (B11 >= 6)
print(B11[sel])

[6. 8. 6. 9. 9.]


In addition to Boolean selection, it is possible to select using integer arrays. Here, the integer arrays contain the indices of the elements to be taken from an array. In addition, lists can also be used as selection arrays. Consider the following one-dimensional example:

In [44]:
C2= numpy.array([2, 4, 6, 8], float)
C3 = numpy.array([0, 0, 1, 3, 2, 1], int) # this has to be integer because we are using it to index
print(C2[C3]) # using integer arrays to index
print(C2[[0, 0, 1, 3, 2, 1]]) # using lists to index

[2. 2. 4. 8. 6. 4.]
[2. 2. 4. 8. 6. 4.]


For multidimensional arrays, we have to send multiple one-dimensional integer arrays to the selection bracket, one for each axis. Then, each of these selection arrays is traversed in sequence. Below is an example that shows step-by-step indexing when we encounter multidimensional arrays:

In [45]:
C4 = numpy.array([[1, 4], [9, 16]], float)
C5 = numpy.array([0, 0, 1, 1, 0], int)
C6 = numpy.array([0, 1, 1, 1, 1], int)
print(C4[C5,])
print(C4[C5,C6])

[[ 1.  4.]
 [ 1.  4.]
 [ 9. 16.]
 [ 9. 16.]
 [ 1.  4.]]
[ 1.  4. 16. 16.  4.]


A special function take() is also available to perform selection with integer arrays. This works in an identical manner as bracket selection. The take() also provides an axis argument such that subsections of an multidimensional array can be taken across a given dimension.

In [46]:
C7 = numpy.array([2, 4, 6, 8], float)
C8 = numpy.array([0, 0, 1, 3, 2, 1], int)
print('take() for one-dimensional array: \n', C7.take(C8))
C9 = numpy.array([[0, 1], [2, 3]], float)
C10 = numpy.array([0, 0, 1], int)
print('take() using the axis argument: \n', C9.take(C10, axis=0))
print('take() using the axis argument: \n', C9.take(C10, axis=1))

take() for one-dimensional array: 
 [2. 2. 4. 8. 6. 4.]
take() using the axis argument: 
 [[0. 1.]
 [0. 1.]
 [2. 3.]]
take() using the axis argument: 
 [[0. 0. 1.]
 [2. 2. 3.]]


The opposite of the take() function is the put() function, which will take values from a source array and place them at specified indices in the array calling put():

In [47]:
C11 = numpy.array([0, 1, 2, 3, 4, 5], float)
C12 = numpy.array([9, 8, 7], float)
C11.put([0, 3], C12)
print(C11)

[9. 1. 2. 8. 4. 5.]


Note that the value 7 from the source array C12 is not used, since only two indices [0, 3] are specified. The source array will be repeated as necessary if not the same size, see below:

In [48]:
C11 = numpy.array([0, 1, 2, 3, 4, 5], float)
C11.put([0, 3], 5)
print(C11)

[5. 1. 2. 5. 4. 5.]


We now officially go over matrix algebra. The package 'numpy' provides many functions for performing standard vector and matrix multiplication routines, such as dot products, and inner/outer/cross products. In addition, the package also comes with a number of built-in routines for linear algebra calculations. These can be found in the sub-module 'linalg'. Among these are routines for dealing with matrices such as inverses, determinant, and singular value decomposition etc. Below are some examples:

In [49]:
v1 = numpy.array([1, 2, 3], float)
v2 = numpy.array([0, 1, 1], float)
print('v1.v2: ', numpy.dot(v1, v2), '\n')
v3 = numpy.array([[0, 1], [2, 3]], float)
v4 = numpy.array([2, 3], float)
v5 = numpy.array([[1, 1], [4, 0]], float)
print('v3.v4: \n', numpy.dot(v3, v4), '\n')
print('v3.v5: \n', numpy.dot(v3, v5), '\n')
print('v5.v3: \n', numpy.dot(v5, v3), '\n')

v1.v2:  5.0 

v3.v4: 
 [ 3. 13.] 

v3.v5: 
 [[ 4.  0.]
 [14.  2.]] 

v5.v3: 
 [[2. 4.]
 [0. 4.]] 



In [50]:
a = numpy.array([1, 4, 0], float)
b = numpy.array([2, 2, 1], float)
print(numpy.outer(a, b))
print(numpy.inner(a, b))
print(numpy.cross(a, b))

[[2. 2. 1.]
 [8. 8. 4.]
 [0. 0. 0.]]
10.0
[ 4. -1. -6.]


In [51]:
A = numpy.array([[4, 2, 0], [9, 3, 7], [1, 2, 1]], float)
print(A)
print(numpy.linalg.det(A))
print(numpy.linalg.inv(A))
vals, vecs = numpy.linalg.eig(A) # getting the eigenvalues and eigenvectors
print('eigenvalues: ', vals)
print('eigenvectors: ', vecs)


[[4. 2. 0.]
 [9. 3. 7.]
 [1. 2. 1.]]
-48.00000000000003
[[ 0.22916667  0.04166667 -0.29166667]
 [ 0.04166667 -0.08333333  0.58333333]
 [-0.3125      0.125       0.125     ]]
eigenvalues:  [ 8.85591316  1.9391628  -2.79507597]
eigenvectors:  [[-0.3663565  -0.54736745  0.25928158]
 [-0.88949768  0.5640176  -0.88091903]
 [-0.27308752  0.61828231  0.39592263]]


The 'numpy' package also supplies methods for working with polynomials. For example, given a set of roots, it is possible to
show the polynomial coefficients:

In [52]:
print(numpy.poly([-1, 1, 1, 10]))

[  1. -11.   9.  11. -10.]


Here, the return array gives the coefficients corresponding to $x^4 - 11x^3 + 9x^2 + 11x-10$.

The opposite operation can be performed: given a set of coefficients, the root function returns all of the polynomial roots:

In [53]:
print(numpy.roots([1, 4, -2, 3]))

[-4.5797401 +0.j          0.28987005+0.75566815j  0.28987005-0.75566815j]


Notice here that two of the roots of $x^3+4x^2-2x+3$ are imaginary.

Coefficient arrays of polynomials can be integrated. Consider integrating $x^2-2x+1$. By default, the constant after integration is set to zero:

In [54]:
print(numpy.polyint([1, -2, 1]))

[ 0.33333333 -1.          1.          0.        ]


The functions polyadd(), polysub(), polymul(), and polydiv() also handle proper addition, subtraction, multiplication, and division of polynomial coefficients, respectively. 

The function polyval() evaluates a polynomial at a particular point. The polyfit() function can be used to fit a polynomial of specified order to a set of data using a least-squares approach. The return value is a set of polynomial coefficients:

In [55]:
print(numpy.polyval([1,5,-6,12],1))
x = [1, 2, 3, 4, 5, 6, 7, 8]
y = [0, 2, 1, 3, 7, 10, 11, 19]
print(numpy.polyfit(x, y, 2)) # degree of 2 (so up to x^2)

12
[ 0.375      -0.88690476  1.05357143]


Lastly, we go over the statistical functionalities of the package. Here are some examples:

In [56]:
a = numpy.array([1, 4, 3, 8, 9, 2, 3], float)
print('Median: ', numpy.median(a))
a = numpy.array([[1, 2, 1, 3], [5, 3, 1, 8]], float) 
c = numpy.corrcoef(a) # correlation coefficients matrix
print('Correlation coefficients: \n', c)
print('Covariance: \n', numpy.cov(a))

Median:  3.0
Correlation coefficients: 
 [[1.         0.72870505]
 [0.72870505 1.        ]]
Covariance: 
 [[0.91666667 2.08333333]
 [2.08333333 8.91666667]]


In [57]:
numpy.random.seed(293423)
r1=numpy.random.rand(5) # generting a uniform distribution from [0,1)
print(r1)
print(type(r1))
r2=numpy.random.rand(2,5) # generating a random matrix
print(r2)

[0.33677247 0.52693437 0.79529578 0.78867702 0.02147624]
<class 'numpy.ndarray'>
[[0.84612516 0.0704939  0.1526965  0.77831701 0.80821151]
 [0.82198398 0.90239653 0.8385685  0.02638565 0.33681448]]


In [58]:
print(numpy.random.poisson(6.0)) # generating a single random  number from a Poisson distribution with lambda=6
print(numpy.random.normal(1.5, 4.0, size=5)) # generating 5 numbers from a normal distribution with mean=1.5 and std=4
print(numpy.random.randn(5)) # generating 5 numbers from a standard normal distribution

5
[10.55299897 -2.27683792  6.80054692 -1.70011158  0.15193698]
[ 0.00696034  0.02353519 -1.87609929  0.59609852  0.30850929]


The 'random' module can also be used to randomly shuffle the order of items in a list. This is sometimes useful if we want to sort a list in random order:

In [59]:
l=list(range(10))
print('before shuffle: ', l)
numpy.random.shuffle(l)
print('after shuffle: ', l)

before shuffle:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
after shuffle:  [7, 5, 2, 9, 4, 0, 8, 3, 6, 1]


In all, the 'numpy' package contains many other built-in functions that we have not covered here. In particular, there are routines for discrete Fourier transforms, more complex linear algebra operations, size / shape / type testing of arrays, splitting and joining arrays, histograms, creating arrays of numbers spaced in various ways, creating and evaluating functions on grid arrays, treating arrays with special (NaN, Inf) values, set operations, creating various kinds of special matrices, and evaluating special mathematical functions (e.g., Bessel functions). On the other hand, another closely related package 'SciPy' greatly extends the functionality of the 'numpy' routines. These two packages together are heavily used for data scientists. 