# Data Analysis in Python

> Data analysis is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.



## Numpy

> Python by itself doesn't have any built in data types for arrays or matrices. Any direct (matrix multiplication) or indirect (image manipulation) operations that require arrays are done with the numpy package.

Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. Numpy is one of the most commonly used and best supported libraries in python and is the basis of all numerical tools in python (most tools that we'll address from now on require it). 

Numpy is similar to matlab, as it is a fairly low level tool (many packages however - like pandas - feature higher level tools and are built on top of numpy).

### Arrays

Arrays are the most common data types in data science. A numpy array is a grid of values, all of the **same type**, and is **indexed by a tuple of nonnegative integers**.

Contrary to lists, arrays have a fixed size, which needs to be set during the creation of the array. This allows these structures to be much faster and memory-efficient compared to other python data structures.

An array has two main attributes:

- Its `dtype`, indicating the data type of **all elements** in the array.
- Its `shape`, which is a tuple showing the size (or length) of each dimension.

In mathematical terms:

- A 0D array, i.e. a number, is called a **scalar**.
- A 1D array is called a **vector**.
- A 2D array is called a **matrix**.
- A 3D array is called a **tensor**.

#### 1D Arrays

A one-dimensional array can be thought of as a list that has a predefined size (i.e. length) and can contain only elements of a single data type. 

Let's dive right in. We'll first define the following array in numpy:

$$ A =\left( \begin{array}{ccc}
1 & 2 & 3\end{array} \right) $$

In [1]:
from __future__ import print_function  # for python 2-3 compatibility
import numpy as np  # because we use numpy often, its more convenient to refer to it as np

A = np.array([1, 2, 3])  # this is how we define and initialize a known array
A

array([1, 2, 3])

But, what **type** of an object is *A*?

In [2]:
print(type(A))

<class 'numpy.ndarray'>


`ndarray` stands for **N-dimensional array**. This is a generalized array that can have any number of dimensions and is the numpy's main data structure. We'll see more further down.

How can I access a **single element** in *A*?

In [3]:
print(A[0], A[1], A[2])

1 2 3


Indexing works like it did with lists. As in most data types in python, the indexing in numpy arrays starts from 0.

We said before that an array must contain elements of the same type.  

What is the **data type of the elements** in *A*?

In [4]:
print(type(A[0]))

<class 'numpy.int32'>


They're not regular integers, but a custom numpy 32-bit integer data type. We could also see this from a built in array variable.

In [5]:
print(A.dtype)

int32


Numpy supports a lot of different **data types**. Integers and Unsigned Integers (8, 16, 32, 64 bit), float (half - 16, single - 32 and double precision - 64), complex numbers (16, 32 and 64 bit) and others.

Now that we got that out of the way, how can we **change an element** in *A*?

In [6]:
A[0] = 2.9; A[1] = 5; A[2] = 7.1
print(A)

[2 5 7]


Note that because `A` is an array of ints, it cast the values we assigned to it as ints (rather than changing its data type to suit the values)

Slicing works like lists too.

In [12]:
print(A[:2])   # print the first two elements

print(A[-2:])  # print the last two elements

print(A[-3:1])

[2 5]
[5 7]
[2]


#### 2D Arrays

Two dimensional arrays are essentially matrices, they are the most common form of arrays we will use. We usually refer to the first dimension as *rows* and the second as *columns*. Dimensions are referred to as **axes** in numpy.

Let's create the array:
$$ B = \left( \begin{array}{ccc}
1 & 2 & 3 \\
4 & 5 & 6 \end{array} \right) $$

In [14]:
import numpy as np
B = np.array([[1, 2, 3], [4, 5, 6]])  # we can initialize arrays with known values with a list of lists
print(B)

[[1 2 3]
 [4 5 6]]


The initialization is done through a list for the rows, containing lists of values for the columns.

Now say we want to retrieve the bottom right element (i.e. 6). The second row's index is 1 while the third column's index is 2.

In [14]:
print(B[1,2])

6


In [15]:
print(B[1,1])

5


In [16]:
print(B[0,2])

3


We just separate the two indices with a comma.

Slicing works on each dimension separately.

In [7]:
print(B[-1, ::2])  # print the last element from the odd columns

[4 6]


#### N-D Arrays

As N grows larger, these Arrays become increasingly more difficult to visualize and comprehend. However in numpy a 100D array is as complicated as a 3D one. This helps us a lot.

In [3]:
C = np.array([[[0, 1, 2], [3, 4, 5]], [[6, 7, 8], [9, 10, 11]], [[12, 13, 14], [15, 16, 17]]])  # list of list of lists
print(C)

[[[ 0  1  2]
  [ 3  4  5]]

 [[ 6  7  8]
  [ 9 10 11]]

 [[12 13 14]
  [15 16 17]]]


In [8]:
C[0,0,0]

C[0,0,1]

C[0,1,0]

C[2,0,1]

C[2][0][1]

13

Indexing an element in an n-dimensional array isn't that hard.

Suppose that the:

- **i** is the index of the element's 1st dimension (axis=0).
- **j** is the index of the element's 2nd dimension (axis=1).
- **k** is the index of the element's 3rd dimension (axis=2).
- **l** is the index of the element's 4th dimension (axis=3).
- **m** is the index of the element's 5th dimension (axis=4).  
...

The element with the above indices in array `Arr` is:
```python
element = Arr[i, j, k, l, m, ...]
```

As an example, the element *13* in array *C*:
- **i** is 3.
- **j** is 1.
- **k** is 2.

In [4]:
print(C[2, 0, 1])

13


#### Slicing 

Slicing works separately for each axis.

If we want the middle sub-array of *C*:
$$ slc = \left( \begin{array}{ccc}
6 & 7 & 8 \\
9 & 10 & 11 \end{array} \right) $$

In [15]:
C

array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]],

       [[12, 13, 14],
        [15, 16, 17]]])

In [5]:
C[:,:,2].reshape(1,6)

array([[ 2,  5,  8, 11, 14, 17]])

In [10]:
A = [[1,2,3],[4,5,6]]
A = np.array(A)

#A[:,:]

#A[::,::]

A[:1,:]

A[:,:1]

#A[::,::2]

#A[::,]

# # :: | : |  |

array([[1],
       [4]])

In [33]:
slc = C[1,:,:]  # we want the index 1 of the 1st axis and all the values from the rest
# this can also be written as C[1, ...]
print(slc)

[[ 6  7  8]
 [ 9 10 11]]


In [35]:
slc[0,0]
#slc[1,2]

11

Likewise if we wanted just the right columns of *C*:
$$slc = \left( \begin{array}{ccc}
2  \\
5  \\
8  \\
11  \\
14  \\
17  \end{array} \right) $$

In [7]:
# print(C)

slc = C[:,:,2]

# or equivalently C[..., 2]
print(slc)
C[..., 2]

[[ 2  5]
 [ 8 11]
 [14 17]]


array([[ 2,  5],
       [ 8, 11],
       [14, 17]])

In [42]:
slc.shape # row, column

(3, 2)

In [4]:
C.shape

(3, 2, 3)

In order to turn `slc` into the format we wanted we must reshape it.

In [39]:
slc = slc.reshape((2,3))
print(slc)
print(slc.shape)

[[ 6  7  8]
 [ 9 10 11]]
(2, 3)


More on that later though.

Lastly, if we want the whole array as a slice:

In [10]:
slc = C[:]       # slice containing C as a whole
slc2 = C[:,:,:]  # same thing
slc3 = C[...] # we need 3 dots

print(slc3)
print(np.array_equal(C, slc), np.array_equal(slc, slc2), np.array_equal(slc2, slc3))
# check if C == slc, slc == slc2 and if slc2 == slc3 

[[[ 0  1  2]
  [ 3  4  5]]

 [[ 6  7  8]
  [ 9 10 11]]

 [[12 13 14]
  [15 16 17]]]
True True True


In [46]:
C

array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]],

       [[12, 13, 14],
        [15, 16, 17]]])

In [12]:
S = C.reshape((1,18))
S
np.array_equal(S,C)

False

#### Indexing arbitrary elements

By passing a list of indices, we can get a slice of any arbitrary positions we want!

In [67]:
A = np.array([1,2,3])

In [15]:
print(A[[0, 2]]) # 1d array        # get elements in the positions 0 and 2 of array A

print(B[:, [0, 2]])   # 2d array   # get a slice of B with all the rows and columns in the positions 0 and 2

print(C[[0, 2], 0, :2])  # etc...

[1 3]
[[1 3]
 [4 6]]
[[ 0  1]
 [12 13]]


### Built-in Functions

There are also a lot of built-in functions in numpy arrays. The most important ones are the following

#### Array Information: 

- `A.dtype` returns the type of the elements in *A*.
- `A.shape` returns the shape of *A*, i.e. a tuple containing the size of its dimensions.
- `A.ndim` returns the rank of *A*, i.e. the number of its dimensions.
- `A.size` returns the number of elements in *A*, i.e. the product of the values in `A.shape`.
- `A.nbytes` returns the total bytes consumed by the elements of the array, i.e. (bytes in `A.dtype`) `* A.size / 8` .

#### Array Conversions:

- `A.tolist()` returns the array as a (possibly nested) list.
- `A.tofile()` writes *A* to a file as text or binary (default).
- `A.dump(file)` dumps a pickle of *A* to the specified *file*.
- `A.astype(dtype)` returns the copy of the array, cast to a specified type.
- `A.copy()` returns a copy of the *A*.

#### Shape Manipulation:

- `np.transpose(A)` or `A.T` returns transposed *A*.
- `A.reshape(shape)` changes the shape of *A* to *shape*.
- `A.swapaxes(axis1, axis2)` returns a view of the array with axis1 and axis2 interchanged.
- `A.flatten()` returns a copy of the array collapsed into one dimension.
- `A.repeat(n)` repeats elements of an array *n* times. e.g. $ A =\left( \begin{array}{ccc}
1 & 2 & 3\end{array} \right) $ for $n=2$ becomes $ A =\left( \begin{array}{ccc}
1 & 1 & 2 & 2 & 3 & 3\end{array} \right) $

#### Item Manipulation:

- `A.sort()` sorts *A*, in-place.
- `A.argsort()` returns the indices that would sort *A*.
- `A.nonzero()`	returns the indices of the elements that are non-zero.

#### Calculations:

- `A.max()` returns the largest element in *A*.
- `A.min()` returns the smallest element in *A*.
- `A.sum()` returns the sum of the elements in *A*.
- `A.mean()` returns the mean of the elements in *A*.
- `A.var()` returns the variation of the elements in *A*.
- `A.std()` returns the standard deviation of the elements in *A*.
- `A.prod()` returns the product of the elements in *A*.
- `A.all()` returns True if all elements in *A* are True.
- `A.any()` returns True if any of the elements in *A* are True.

In [11]:
# print('There are size', C.size, 'elements in C.')
# print('The shape of C is', C.shape)
# print('The rank of C is', C.ndim)
# print('The dtype',C.dtype)
# print('bytes',C.nbytes)

# print('List equivalent:\n', C.tolist())

# C = C.astype(float) # typecasting
# print(C.astype(float))
# print(C.dtype)


# D = C.copy()


# print(D)

# D =D +111
# print(D)
# print(C)
# print('Transposed:\n', C.T)

# print(C.sort())
# print(C.argsort())
# print(C.nonzero())

# print(C)
# print('max =', C.max())
# print('min =', C.min())
# print('sum =', C.sum())
# print('mean =', C.mean())
# print('sum / size = ', C.sum() / float(C.size))

# print(C.argsort())

[[[111. 112. 113.]
  [114. 115. 116.]]

 [[117. 118. 119.]
  [120. 121. 122.]]

 [[123. 124. 125.]
  [126. 127. 128.]]]
[[[ 0.  1.  2.]
  [ 3.  4.  5.]]

 [[ 6.  7.  8.]
  [ 9. 10. 11.]]

 [[12. 13. 14.]
  [15. 16. 17.]]]
Transposed:
 [[[ 0.  6. 12.]
  [ 3.  9. 15.]]

 [[ 1.  7. 13.]
  [ 4. 10. 16.]]

 [[ 2.  8. 14.]
  [ 5. 11. 17.]]]
None
[[[0 1 2]
  [0 1 2]]

 [[0 1 2]
  [0 1 2]]

 [[0 1 2]
  [0 1 2]]]
(array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], dtype=int64), array([0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1], dtype=int64), array([1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2], dtype=int64))


### Array Creation

There are a lot of ways to create arrays. The basic way we saw up till now (which creates and populates the array) serves only for small arrays. What we tend to do in practice, is to create an array, either empty or containing placeholder values. Then we populate that array through iteration.

In [59]:
shp = (3, 3, 3)           # The shape we want our array to have
E = np.empty(shp, dtype='f')  # We create an empty array with the predefined shape and data type
print(E[0,:10,:])             # let's print a small slice of E

[[5.7857652e-39 8.4489539e-39 5.3265527e-39]
 [7.8061461e-39 9.2755463e-39 1.0561242e-38]
 [7.3469686e-39 9.0000426e-39 9.6428784e-39]]


So, `E` is essentially initialized with random values. Other options include:

In [90]:
# E = np.random.random(shp)  # array containing random values in [0,1)
# print(E[0,:3,:])
# print('\n')
# shp = (3,2,3)
# E = np.zeros(shp, dtype='f')  # array containing zeros
# print(E[0,:3,:])
# print('\n')
# E = np.ones(shp, dtype='f')  # array containing ones
# print(E[0,:3,:])
# print('\n')
# E = np.full(shp, 7, dtype='f')  # array containing sevens
# print(E[0,:3,:])
# print('\n')
E = np.arange(np.prod(shp), dtype='f').reshape(shp)  # array with values from range(size)
print(E)
# The above line consists of three commands:
# # the first one, np.prod(shp), returns the product of the values in shp, i.e. 3*2*3 = 18.
# # the second one, np.arrange(18), creates a 1D array containing elements from range(18), i.e. C = [0, 1, 2, ... 17].
# # the last one, E.reshape(shp), changes E's shape to shp.
# print(E[0,:3,:])

[[[ 0.  1.  2.]
  [ 3.  4.  5.]]

 [[ 6.  7.  8.]
  [ 9. 10. 11.]]

 [[12. 13. 14.]
  [15. 16. 17.]]]


In [103]:
np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

Once we have an array of the shape and dtype of our choice, we can populate it with the values that we want.

### Array operations

By default, most operations performed on numpy arrays are elementwise.

$$
Arr1 = \left( \begin{array}{ccc}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9 \end{array} \right), \:
Arr2 = \left( \begin{array}{ccc}
11 & 12 & 13 \\
14 & 15 & 16 \\
17 & 18 & 19 \end{array} \right)
$$

For example in numpy the product of these two arrays is:

$$
Arr1 \cdot Arr2 =
\left( \begin{array}{ccc}
11 & 24 & 39 \\
56 & 75 & 96 \\
119 & 144 & 171 \end{array} \right)
$$

#### Elementwise operations

In [12]:
print('A =               ', A)
print('A + 1 =           ', A + 1)
print('A ** 2 =          ', A ** 2)
O = np.arange((3))
print('O =               ', O)
print('A + O =           ', A + O)
print('A * O =           ', A * O)
print('2 ** (O + 1) - A =', 2 ** (O + 1) - A)

A =                [1 2 3]
A + 1 =            [2 3 4]
A ** 2 =           [1 4 9]
O =                [0 1 2]
A + O =            [1 3 5]
A * O =            [0 2 6]
2 ** (O + 1) - A = [1 2 5]


These are also equivalent to.

```python
np.add(A, O)
np.subtract(A, O)
np.multiply(A, O)
np.divide(A, O)
np.exp(A)
np.sqrt(A)
```

Note that `A * O` is **not** matrix multiplication! It is an **elementwise multiplication**.

#### Vecotr/Matrix operations

In [None]:
print('matrix mul: A * O = ', np.dot(A, O))
print('same as: ', A.dot(O))
print('outer product: A x O\n', np.outer(A,O))

#### Broadcasting

Some elementwise operations are not defined mathematically. For this reason, numpy uses a technique called broadcasting. What this does is it repeats the element that has less axes to match the other. This is illustrated in the image below.

![](http://www.scipy-lectures.org/_images/numpy_broadcasting.png)

Note, the dimensions must be compatible for broadcasting to work (the dimension being repeated needs to be divisible by the dimension it is going to match.

```python
x  # array with shape (6, 2)
y  # array with shape (6, 3)
z  # array with shape (6, 8)

x + y  # error
y + z  # error
x + z  # array with shape (6, 8)
```

### Exercise 1: Try to confirm the above image in numpy

### Solution:

In [13]:
A = np.array([[0], [10], [20], [30]]).repeat(4, axis=1)
# Creates a column array with 4 values and repeats 3 times it along the horizontal axis.
B = np.array([[0, 1, 2,4],]).repeat(1, axis=1)
# print(A)

# print('B',B)
# Creates an array with 3 values and repeats 4 times it along the vertical axis.
print('{} \n + \n {} \n = \n {}'.format(A, B, A+B))

[[ 0  0  0  0]
 [10 10 10 10]
 [20 20 20 20]
 [30 30 30 30]] 
 + 
 [[0 1 2 4]] 
 = 
 [[ 0  1  2  4]
 [10 11 12 14]
 [20 21 22 24]
 [30 31 32 34]]


In [None]:
B = np.array([0, 1, 2])
print('{} \n + \n {} \n = \n {}'.format(A, B, A+B))      

In [None]:
A = np.array([[0], [10], [20], [30]])
print('{} \n + \n {} \n = \n {}'.format(A, B, A+B))      

#### Logical Operations

An array can also contain boolean values in it. Logical operations are operations between two such arrays.

In [120]:
a = np.array([True, True, False, False])
b = np.array([True, False, True, False])
# print(np.logical_or(a, b))
print(np.logical_and(a, b))

[ True False False False]


We can also compare the two arrays with all known logical operators. Like all `numpy` operators, they perform their operations elementwise.

In [None]:
print(a > b)  # True if the element of a is larger than the corresponding element of b 
print(a == b)
print(a != b)

**Tip:** You can use `np.all()` or `np.any()` for comparing arrays as a whole (python's built-in `all()`, `any()` functions may not work properly with arrays).

#### Transcendental Functions

These compute the `math` package functions to all the elements of an array.

In [125]:
a = np.arange(-1, 10)

print(a)
print(np.sin(a))
print(np.log(a))
print(np.exp(a))

[-1  0  1  2  3  4  5  6  7  8  9]
[-0.84147098  0.          0.84147098  0.90929743  0.14112001 -0.7568025
 -0.95892427 -0.2794155   0.6569866   0.98935825  0.41211849]
[       nan       -inf 0.         0.69314718 1.09861229 1.38629436
 1.60943791 1.79175947 1.94591015 2.07944154 2.19722458]
[3.67879441e-01 1.00000000e+00 2.71828183e+00 7.38905610e+00
 2.00855369e+01 5.45981500e+01 1.48413159e+02 4.03428793e+02
 1.09663316e+03 2.98095799e+03 8.10308393e+03]


  print(np.log(a))
  print(np.log(a))
