# Workshop 2: Matrix and Data Frame

## Matrix and Array

Note that both array and matrix are useful to calculate numbers including integers and floats, but they can act differently in terms of multiplication and division. Here we use `test_list` to create an array `test_array` and a matrix `test_matrix` to see the differences.

In [185]:
import numpy as np

### Array

In [186]:
test_list = [[1,2],[3,4]]
test_array = np.array(test_list)

Here we create a simple list and use `np.array()` to create an array.

In [187]:
type(test_array)

numpy.ndarray

In [188]:
test_array.shape

(2, 2)

In [189]:
test_array.ndim

2

In [190]:
test_array.size

4

Remember that you can use `type()` to check the data type. It's an `numpy.ndarray`. You can also check its dimension, shape (number of rows and columns) and size (number of elements) by using methods `.ndim`, `.shape`,  and `.size`.

In [191]:
test_array + test_array

array([[2, 4],
       [6, 8]])

In [192]:
test_array * test_array

array([[ 1,  4],
       [ 9, 16]])

We can see that the multiplication of two arrays will be the multiplication of each cell.

In [193]:
test_array.dot(test_array)

array([[ 7, 10],
       [15, 22]])

In [194]:
np.dot(test_array, test_array)

array([[ 7, 10],
       [15, 22]])

To perform matrix multiplication, use the method `.dot()`, function `np.dot()`, or use matrix data type.

In [195]:
test_vector = np.array([5,-6])
test_array * test_vector

array([[  5, -12],
       [ 15, -24]])

If we multiply `test_array` with `test_vector`, an array with only two cells, each row from `test_array` will multiply on it and correspond the value. You can see easily from the negative sign.

In [196]:
test_vector = np.array([5,-6,7])
# test_array * test_vector ### error

In [197]:
test_vector.shape

(3,)

In [198]:
type(test_vector.shape)

tuple

And if the shape doesn't match reasonabily, you'll receive an error. To check an array's shape, use the method `.shape`.

### Matrix

In [199]:
test_matrix = np.matrix(test_list)

In [200]:
type(test_matrix)

numpy.matrixlib.defmatrix.matrix

In [201]:
test_matrix + test_matrix

matrix([[2, 4],
        [6, 8]])

In [202]:
test_matrix * test_matrix

matrix([[ 7, 10],
        [15, 22]])

Here we observe the real matrix multiplication. Otherwise a matrix behaves pretty similar to an array.

In [203]:
test_matrix.dot(test_matrix)

matrix([[ 7, 10],
        [15, 22]])

In [204]:
np.dot(test_matrix, test_matrix)

matrix([[ 7, 10],
        [15, 22]])

We can still use the method `.dot()` or function `np.dot()` to ensure that we're doing matrix multiplication, as a way to increase the readibility of our codes but not necessary.

In [205]:
test_matrix.shape

(2, 2)

And the method `.shape` also works on matrix.

### Difference Between Array and Matrix

#### How To Compare

To compare two objects, we use `==` to return a `bool` object. The full lise of comparison operators is as following:

* `==` (equality)
* `!=` (inequality)
* `>` (greater than)
* `<` (less than)
* `>=` (greater than or equal to)
* `<=` (less than or equal to)


In [206]:
test_array == test_matrix

matrix([[ True,  True],
        [ True,  True]], dtype=bool)

In [207]:
(test_array * test_array) == (test_matrix * test_matrix)

matrix([[False, False],
        [False, False]], dtype=bool)

In [208]:
(test_array * test_array) < (test_matrix * test_matrix)

matrix([[ True,  True],
        [ True,  True]], dtype=bool)

In [209]:
np.dot(test_array, test_array) == (test_matrix * test_matrix)

matrix([[ True,  True],
        [ True,  True]], dtype=bool)

In [210]:
(test_array * test_array) < 5

array([[ True,  True],
       [False, False]], dtype=bool)

In [211]:
test_array[0] == test_matrix[0]

matrix([[ True,  True]], dtype=bool)

In [212]:
test_matrix[0,1] == test_array[0,1]

True

Recall that an array is a `numpy.ndarray` object, which mean **n-dimension array**. in fact matrix is a special case of array, with only two dimensions and special multiplication rules.

By default, the multiplication of two matrices IS matrix multiplication, but the one of two arrays IS NOT.  You can always use the method `.dot()` to perform matrix muliplication. In other aspects they're pretty similar. Both matrix and array are useful to store and process numerical data.

In [213]:
np.array(test_matrix)

array([[1, 2],
       [3, 4]])

In [214]:
np.matrix(test_array)

matrix([[1, 2],
        [3, 4]])

And yes, you can use `np.array()` and `np.matrix()` to switch between matrix and array.

### Indexing

In [215]:
test_matrix

matrix([[1, 2],
        [3, 4]])

In [216]:
test_matrix[0]

matrix([[1, 2]])

By default the index is still from the row aspect, so if we just index `[0]`, which points to the first object, we'll get the first row.

In [217]:
type(test_matrix[1])

numpy.matrixlib.defmatrix.matrix

In [218]:
str(test_matrix)

'[[1 2]\n [3 4]]'

The subset of a matrix is still a matrix, and if we try to turn it intro string, it will literally display its structure, in which `\n` is a line break.

In [219]:
test_matrix[1,1]

4

Now we can use [**row**, **column**] to subset a matrix, an array and a data frame later! From now on, it is recommended that you always specify row and column, instead of using only `[number]` to subset a matrix or array.
The rule of indexing still apply here, so remember the first object is on index 0.

In [220]:
type(test_matrix[1,1])

numpy.int64

In [221]:
str(test_matrix[1,1])

'4'

### Advanced Indexing

Just a little **recap of the subsetting rules**:
 
* There are three elements to subset, divided by `:`
* The first one is the starting index number
* The second one is the ending index number, which will not be included in the result
* The third one is the gap of index
* Using `-` to select backwards
* Using only `:` to subset every element

So for a single vector, either a row or column, here are some examples:

* `[1:4:1]` will select every element (the gap is 1) from the second object (on index 1) to the fourth object (on index 3)
* `[2::2]` will select one from every two element (the gap is 2) from the third object (on index 2) to the end, including the last object
* `[::-3]` will select one from every three element (the gap is 3) backwards (with -)
* `[::]`, like `[:]`, will just select every element.


In [222]:
test_matrix[:,1]

matrix([[2],
        [4]])

Remember how `:` works in subsetting? It's totally the same for matrix and array, except that now we can use it for both row and column. If you place `:` for row, and `1` for column, it will return **every row on column 2**.

In [223]:
test_matrix = np.matrix([[1,2,3],[4,5,6],[7,8,9]])
test_matrix

matrix([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

In [224]:
test_matrix.shape

(3, 3)

To demonstrate the advanced indexing, we need a bigger matrix.

In [225]:
test_matrix[::1,0]

matrix([[1],
        [4],
        [7]])

In [226]:
test_matrix[::2,0]

matrix([[1],
        [7]])

In [227]:
test_matrix[1::2,0]

matrix([[4]])

In [228]:
test_matrix[::-2,:]

matrix([[7, 8, 9],
        [1, 2, 3]])

Note the difference: `[::2]` starts picking elements from the first object (on index 0), and `[1::2]` starts from the second object (on index 1).

In [229]:
test_matrix[:]

matrix([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

In [230]:
test_matrix[:,:]

matrix([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

Use only `:` to return all the elements.

### Shape of A Matrix

Remember that we can use the method `.shape` to return the dimension of a matrix or array. The data type from `.shape` is actually a `tuple`, which is pretty similar to a `list`, so we can subset it and use the value to efficiently subset our matrix, like returning the value at the last row and the last column.

In [231]:
test_matrix.shape

(3, 3)

In [232]:
type(test_matrix.shape)

tuple

In [233]:
test_matrix.shape + test_matrix.shape

(3, 3, 3, 3)

As you can see, a `tuple` behaves like a `list` in some aspects.

In [234]:
test_matrix.shape[0] # row first

3

In [235]:
test_matrix.shape[1] # column second

3

So if we want to select the last value in our `test_matrix`, which is `9`, we can take advanatage of `.shape`.

In [236]:
test_matrix[(test_matrix.shape[0]-1), (test_matrix.shape[1]-1)]

9

For the sake of readibility, you can assign the shape first and then use it to subset a matrix to prevent nested codes.

In [237]:
nrow = test_matrix.shape[0]
ncol = test_matrix.shape[1]
test_matrix[nrow-1, ncol-1]

9

### Basic Manipulation

As what we did with `string` and `list`, we can manipulate `array` and `matrix`. On top of that, there are some useful methods that could help you deal with your math homework at ease.

#### Change Values

In [238]:
test_matrix[0,0] = 100
test_matrix

matrix([[100,   2,   3],
        [  4,   5,   6],
        [  7,   8,   9]])

In [239]:
test_matrix[:2,:2] = np.array([[1,2],[4,5]])
test_matrix

matrix([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

In [240]:
test_matrix = test_matrix * 2
test_matrix

matrix([[ 2,  4,  6],
        [ 8, 10, 12],
        [14, 16, 18]])

In [241]:
test_matrix = test_matrix / 4
test_matrix

matrix([[ 0.5,  1. ,  1.5],
        [ 2. ,  2.5,  3. ],
        [ 3.5,  4. ,  4.5]])

#### Transpose and Inverse

Here we use the simple matrix again. Besides constructing from list, there's another way to create a matrix from string, using `;` to divide different rows.

In [242]:
test_matrix = np.matrix('[1,2;3,4]')
test_matrix

matrix([[1, 2],
        [3, 4]])

In [243]:
test_matrix.getT()

matrix([[1, 3],
        [2, 4]])

In [244]:
test_matrix.getH()

matrix([[1, 3],
        [2, 4]])

Use the method `.getT()` or `.getH()` to transpose a matrix. The difference is that `.getH()` would actually return a **conjugate transpose** of the matrix. For further details please check out [the official document](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.H.html#numpy.matrix.H) and [this Wikipedia article](https://en.wikipedia.org/wiki/Conjugate_transpose).

In [245]:
test_matrix.getI()

matrix([[-2. ,  1. ],
        [ 1.5, -0.5]])

Use the method `.getI()` to inverse a matrix (if it's not singular).

In [246]:
test_matrix.getI() * test_matrix

matrix([[  1.00000000e+00,   4.44089210e-16],
        [  0.00000000e+00,   1.00000000e+00]])

In [247]:
test_matrix * test_matrix.getI()

matrix([[  1.00000000e+00,   1.11022302e-16],
        [  0.00000000e+00,   1.00000000e+00]])

#### More Methods

You can always learn more about matrix and array from: 

* [ndarray](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html#arrays-ndarray)
* [numpy.matrix](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html)
* [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook)

to learn more about their methods. The most useful ones are probably:

* `.getA()` returns the `ndarray` form of the matrix, which is just like applying `np.array()` to the matrix
* `.getA1()` returns the flat `ndarray` form of the matrix
* `.max()`, `.min()`, `.mean()`, `.cumsum()` etc returns some statistics
* `.reshape()` reshapes the dimension of the matrix
* `.tolist()` turns the matrix into a (nested) list

In [248]:
test_matrix.getA()

array([[1, 2],
       [3, 4]])

In [249]:
test_matrix.getA1()

array([1, 2, 3, 4])

In [250]:
test_matrix.max()

4

In [251]:
test_matrix.min()

1

In [252]:
test_matrix.mean()

2.5

In [253]:
test_matrix.cumsum()

matrix([[ 1,  3,  6, 10]])

In [254]:
test_matrix.reshape(4,1)

matrix([[1],
        [2],
        [3],
        [4]])

In [255]:
test_matrix.tolist()

[[1, 2], [3, 4]]

## Data Frame

### Create From Array/Matrix to Data Frame

In [256]:
import pandas as pd

In [257]:
test_matrix

matrix([[1, 2],
        [3, 4]])

In [258]:
test_df = pd.DataFrame(test_matrix)
test_df

Unnamed: 0,0,1
0,1,2
1,3,4


In [259]:
type(test_df)

pandas.core.frame.DataFrame

### Rename the column

In [260]:
test_df.columns = ['I','II']
test_df

Unnamed: 0,I,II
0,1,2
1,3,4


In [261]:
test_df = pd.DataFrame(test_matrix, columns = ['A', 'B'])
test_df

Unnamed: 0,A,B
0,1,2
1,3,4


### Indexing and Selection

#### Brackets

You can always use brackets to index list, array ,and, of course, data frame. Note that there are two ways to use brackets to a dataframe:

* `[]` single bracket version returns a Pandas Series
* `[[]]` double bracket version returns a Pandas DataFrame.

In [262]:
test_df['A']

0    1
1    3
Name: A, dtype: int64

In [263]:
test_df[['A', 'B']]

Unnamed: 0,A,B
0,1,2
1,3,4


In [264]:
test_df[[1]] # returns second columns

Unnamed: 0,B
0,2
1,4


You can also use index to select observations. Notice that using single number as we did for list and array could raise an error in data frame, so you'll have to specify either starting value, ending value or gap.

In [265]:
#test_df[1] ### error

In [266]:
test_df[:1] # returns the first one observation

Unnamed: 0,A,B
0,1,2


In [267]:
test_df[:2] # returns the first two observations

Unnamed: 0,A,B
0,1,2
1,3,4


In [268]:
test_df[::-1]

Unnamed: 0,A,B
1,3,4
0,1,2


### `.loc` and `.iloc`

You can also use the method `.loc` or `.iloc` to select both column and row by its label or index.

* `.loc`: label-based
* `.iloc`: index-based

I would suggest that always use `.iloc` to specify row and column. `.loc` and brackets can be confusing sometimes.

In [269]:
test_df.loc[0:2]

Unnamed: 0,A,B
0,1,2
1,3,4


In [270]:
test_df.loc[1,'A']

3

In [271]:
test_df.loc[[0,1,2]]

Unnamed: 0,A,B
0,1.0,2.0
1,3.0,4.0
2,,


In [272]:
test_df.iloc[1,1]

4

In [273]:
test_df.iloc[:,1]

0    2
1    4
Name: B, dtype: int64

In [274]:
test_df.iloc[::-1,1]

1    4
0    2
Name: B, dtype: int64

In [275]:
# test_df.iloc[[0,1,2]] ### error

### Import Data

In [276]:
iris_df = pd.read_csv("iris.data.csv", header = None)
print(iris_df)

       0    1    2    3               4
0    5.1  3.5  1.4  0.2     Iris-setosa
1    4.9  3.0  1.4  0.2     Iris-setosa
2    4.7  3.2  1.3  0.2     Iris-setosa
3    4.6  3.1  1.5  0.2     Iris-setosa
4    5.0  3.6  1.4  0.2     Iris-setosa
5    5.4  3.9  1.7  0.4     Iris-setosa
6    4.6  3.4  1.4  0.3     Iris-setosa
7    5.0  3.4  1.5  0.2     Iris-setosa
8    4.4  2.9  1.4  0.2     Iris-setosa
9    4.9  3.1  1.5  0.1     Iris-setosa
10   5.4  3.7  1.5  0.2     Iris-setosa
11   4.8  3.4  1.6  0.2     Iris-setosa
12   4.8  3.0  1.4  0.1     Iris-setosa
13   4.3  3.0  1.1  0.1     Iris-setosa
14   5.8  4.0  1.2  0.2     Iris-setosa
15   5.7  4.4  1.5  0.4     Iris-setosa
16   5.4  3.9  1.3  0.4     Iris-setosa
17   5.1  3.5  1.4  0.3     Iris-setosa
18   5.7  3.8  1.7  0.3     Iris-setosa
19   5.1  3.8  1.5  0.3     Iris-setosa
20   5.4  3.4  1.7  0.2     Iris-setosa
21   5.1  3.7  1.5  0.4     Iris-setosa
22   4.6  3.6  1.0  0.2     Iris-setosa
23   5.1  3.3  1.7  0.5     Iris-setosa


In [277]:
type(iris_df)

pandas.core.frame.DataFrame

In [278]:
iris_df.columns

Int64Index([0, 1, 2, 3, 4], dtype='int64')

In [279]:
iris_df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [280]:
iris_df.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


### Indexing

In [281]:
# iris_df[0] ### error

In [282]:
iris_df.iloc[:3,:5] # select until row index 3 and column index 5

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa


### Basic Speculation

#### Head and Tail

In [283]:
iris_df['sepal_length'].head()

0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal_length, dtype: float64

In [284]:
iris_df.iloc[:].tail(7)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


#### Extract Values into An `numpy.ndarray` Object

In [285]:
iris_df.head(3).values

array([[5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
       [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
       [4.7, 3.2, 1.3, 0.2, 'Iris-setosa']], dtype=object)