### Outline 
10. NumPy
11. Pandas

## 10. NumPy
- NumPy (numeric python) is one of the foundamental packages for numerical computation and data handling in Python.

In [3]:
import numpy as np
# np? # check documentation
# np. # press TAB to see available entries

### Making a NumPy array
- A NumPy array is the multidimensional container of items.
- One can make an array in the following way.

In [103]:
array1 = np.array([1, 2, 3])

- There are some other ways to make an array.

In [102]:
array2 = np.array(list(range(10)))
array3 = np.zeros(10)
array4 = np.ones(10)
array5 = np.full((2,3),1.0)
array6 = np.arange(0,10,3)
array7 = np.linspace(0,1,10) 
array8 = np.random.randn(2,2)

### Types
- An array can contain only one type. If there are multiple types, they are converted to a single type.
    - The order of the types: strings > floats > integers > booleans.
- Check the type of the first item in each array. Hint: Use `myarray[index]`.

In [101]:
array1 = np.array([1, 1.0, "1", True])
array2 = np.array([1, 1.0, True])
array3 = np.array([1, True])

- You can also explicity set the type of an array, if possible.
- Check the type of the third item of `array1`.
    - NumPy has more types than Python's built-in types such as `int8` and `float64`.
- Change the third item to `"abc"` and execute it again.

In [24]:
array1 = np.array([1, 1.0, "1", True], dtype='float64')

### Checking Dimension, Shape, Size, and Length
- You can check the dimension of an array using `np.ndim(myarray)`, the shape of an array using `np.shape(myarray)`, the size (= the number of items) of an array using `np.size(myarray)`, and the length of an array using `len(myarray)`.
    - You can also use `ndim`, `shape`, and `size` as a method like `myarray.ndim`, `myarray.shape`, and `myarray.size`. (They are called attributes to instances.)
- A two dimensional NumPy array is also called a 2D NumPy array.
- Check the dimension, shape, and size of `array1` and `array2`.

In [7]:
array1 = np.array([[1, 2 ,3], [4, 5, 6]])
array2 = np.array([range(i,i + 3) for i in [2, 4, 6]])

### Getting items
- To get the items of an array, use `myarray[index]`, `myarray[index1][index2]`, etc.
    - You can also write like `myarray[index1, index2]`, which is perhaps more common.
- You can use **slicing** as well (see `python_basics1_ipynb`).
    - Use `myarray[:, index]` to select all rows of a specific column and `myarray[index, :]` to select all columns of a specific row.
- Get the first row and the second column of `array1`.
- Get the all columns of the first row of `array1`.

In [99]:
array1 = np.array([[1, 2, 3], [4, 5, 6]])

#### - Masking
- You can use operators like `==`, `>`, and `!=` to get items of an array.
- For example, when you write `myarray > 0`, this returns a boolean array where `True` means that the associated item satisfies the condition `> 0`.
- You can use `&` and `|`, but cannot use `and` and `or`.
    - The reason: The former operators perform multiple Boolean evaluations, while the latter operators only do a single Boolean evaluation.

In [None]:
array1 = np.array([[1, -1, -5], [2, -4, -3]])
print(array1)
print(array1 > 0)
print((array1 > -3) & (array1 < 2))
# print((array1 > -3) and (array1 < 2)) # this returns an error

- A boolean array can be cleverly used.

In [None]:
array1 = np.array([[1, -1, -5], [2, -4, -3]])
print(np.sum(array1 > 0)) # count the number of positive items 
print(np.any(array1 > 0)) # check whether any item in array1 is positive
print(np.all(array1 > 0)) # check whether all items in array1 are positive

- If you apply a boolean array to the original `myarray`, you can get the items that satisfy the condition.
- This is called a **masking** operation where the condition (e.g. `myarray > 0`) is called a **mask**.
- Get all the items that satisfy `array1 > 0` in `array1`. 
- Calculate the sum of them.

In [50]:
array1 = np.array([[1, -1, -5], [2, -4, -3]])

#### - Fancy Indexing
- You can apply a list or an array of indices to get items of an array.
- This style of selection is called **fancy indexing**.
- You write like `myarray[list/array]`.
- Apply `ind` and `ind_array` to `array1`, respectively.

In [97]:
array1 = np.array([i + 1 for i in range(6)])
ind = [0, 3] # a list of indices
ind_array = np.array([[0, 3], [1, 2]]) # an array of indices

### Changing items
- Similar to a list, you can change an item of an array using `myarray[index1, index2] = new_item`.
- Change `1` to `10` in `array1`.

In [96]:
array1 = np.array([[1, 2, 3], [4, 5, 6]])

#### - Reshaping an array
- To reshape an array, use `reshape()`.

In [95]:
array1 = np.array([[1, 2, 3], [4, 5, 6]])
array1.reshape(3,2)

#### - Concatenating arrays
- To combine multiple arrays, use `concatenate()`, `hstack()`, or `vstack()`.

In [93]:
array1 = np.array([[1, 2], [3, 4]])
array2 = np.array([[5, 6], [7, 8]])
array3 = np.concatenate([array1, array2], axis=0)
array4 = np.hstack([array1, array2])
array5 = np.vstack([array1, array2])

#### - Splitting an array
- To split an array, use `split()`, `hsplit()`, or `vsplit()`.

In [92]:
array1 = np.array([[1, 2, 3], [4, 5, 6]])
array2, array3 = np.split(array1, [1], axis=0)
array4, array5, array6 = np.hsplit(array1, [1, 2])
array7, array8 = np.vsplit(array1, [1])

### Vectorized computation
- A list does not accept any item-by-item calculation but a NumPy array does!
- When you try to make a loop for an item-by-item calculation, check whether it is possible to rewrite it using a vectorized expression.

In [104]:
height = [1.80, 1.76, 1.64]
weight = [80, 75, 60]
np_height = np.array(height) # convert a list to a NumPy array
np_weight = np.array(weight)
#bmi = weight / height ** 2 # this returns an error
bmi = np_weight / np_height ** 2

#### - UFuncs
- A Ufunc (universal function) is made to quickly make an item-by-item calculation.
- One writes like `np.abs(myarray)`.
- Useful Ufuncs include: `abs()`, `log()`, and `round()`.

In [None]:
array1 = np.array([-1, 2, -5, 1])
print(np.abs(array1))

array2 = np.random.normal(1.75,0.20,50)
print(np.round(array2, 2))

### Summarizing items
- Useful functions for summarizing items include: `min()`, `max()`, `mean()`, `median()`, `std()`, `sum()`, and `corrcoef()`.
- One writes like `np.min(myarray)`.
    - One can also write like `myarray.min()` for some operations. Try it with `corrcoef()`.

In [None]:
array1 = np.array([[1, 2, 3], [-1, -2, -3]])
print(np.min(array1)) # min of all items
#print(array1.min()) # alternative way of writing it
print(np.min(array1[:, 2])) # min of column 2
print(np.min(array1, axis=1)) # min of each row
print(np.min(array1, axis=0)) # min of each column

## 11. Pandas
- Pandas is a very useful Python package made for handling data. It's built on NumPy.

In [1]:
import pandas as pd
# pd? # check documentation
# pd. # press TAB to see available entries

- There are several objects in Pandas. They are `Series`, `DataFrame`, and `Index`.
    - `Series`: A **one-dimensional array** of indexed data
    - `DataFrame`: A **two-dimensional array** with indices (for row) and (column) names (for column)
    - `Index`: An index object itself

### Series
- A Pandas Series is like a generalized one-dimensional NumPy array.
    - It is generalized in the sense that it can have explicit indices.

In [None]:
series1 = pd.Series([0.1, 0.2, 0.3], index=['a', 'b', 'c'])
print(series1.values)
print(series1.index)
print(series1.head(2)) # get first two items 
print(series1.tail(2)) # get last two items
print(series1[0:2])

- It is also like a generalized Python's dictionary type.
    - It is generalized in the sense that array-style item selection such as slicing and fancy indexing is possible.

In [None]:
dic1 = {'tom':1.75, 'jerry':1.82, 'spike':1.65}
series1 = pd.Series(dic1)
print(series1)
print(series1['tom']) # indexing
print(series1[['tom','jerry']]) # fancy indexing
print(series1['tom':'jerry']) # slicing

### DataFrame
- A Pandas DataFrame is like a generalized two-dimensional NumPy array.
    - It is generalized in the sense that it can have explicit indices and column names.    

In [None]:
dic1 = {'tom':1.75, 'jerry':1.82, 'spike':1.65}
dic2 = {'tom':65, 'jerry':72, 'spike':58}
data1 = pd.DataFrame({'height': dic1, 'weight':dic2})
print(data1)
print(data1.values) # get values
print(data1.index) # get indices
print(data1.columns) # get column names
print(data1.head(2)) # get two items from top
print(data1.tail(2)) # get two items from bottom
print(data1.T) # transpose data

- It is also like a generalized Python's dictionary type.
    - It is generalized in the sense that array-style item selection such as slicing and fancy indexing is possible.
- You write like: 
    - `mydata[col]` for indexing (not `mydata[row]`).
    - `mydata[[col,col]]` for fancy indexing (not `mydata[[row,row]]`).
    - `mydata[row:row]` for slicing (not `mydata[col:col]`).
    - In other words, (fancy) indexing refers to columns, while slicing refers to rows.

In [None]:
list1 = [{'var1':x, 'var2':x + 1, 'var3':x + 2} for x in range(3)]
data1 = pd.DataFrame(list1, index=['a', 'b', 'c'])
print(data1['var1']) # (dictionary-style) indexing
# print(data1.var1) # attribute-style access
print(data1[['var1','var2']]) # fancy indexing
print(data1['a':'b']) # slicing

- You can make a DataFrame from (a list of) dictionaries, Pandas Series, and NumPy arrays.

In [91]:
array1 = np.random.rand(3,3)
data1 = pd.DataFrame(array1, columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])

### Checking Dimension, Shape, Size, and Length
- Similar to NumPy, you can use `ndim`, `shape`, `size`, and `len()` for DataFrames.
    - To get the number of rows, use `len(mydata)` or `len(mydata.index)`. For columns, use `len(mydata.columns)`.
- You can also use `mydata.info()` to get more detailed information.
- Check the dimension, shape, size, and length of `data1`.
- Also, check the detailed information of `data1` using `info()`.

In [36]:
data1 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])

### Getting items
- You can use indexing, slicing, masking, and fancy indexing for Pandas DataFrames.

In [None]:
series1 = pd.Series(np.random.rand(3), index=['a', 'b', 'c'])
data1 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
print(data1)
print(series1[series1 > 0.5]) 
print(data1[data1 > 0.5])

- You can also use **indexers** for selecting items: `loc` and `iloc`.
    - `loc` refers to explicit indices, while `iloc` refers to implicit indices.
    - You write like `mydata.loc[inclusive:INCLUSIVE]` and `mydata.iloc[inclusive:EXCLUSIVE]` .
    - There is another option `ix`, but is deprecated.

In [None]:
data1 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
print(data1)
print(data1.loc['a':'b'])
print(data1.iloc[0:2])

- Array-style access becomes possible if you use `loc` or `iloc`.
- Try the following example with/without `loc` or `iloc`.

In [None]:
data1 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
print(data1.loc[:, 'var1':'var2'])
print(data1.loc['a':'b', :])
print(data1.iloc[:, 0:2])
print(data1.iloc[0:2, :])

- Combining masking with fancy indexing? Of course.

In [None]:
data1 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
print(data1)
print(data1.loc[data1['var2'] > 0.2, ['var1', 'var3']])
print(data1.loc[(data1['var2'] > 0.2) & (data1['var3'] < 0.6), :])

### Sorting items
- To sort, use `mydata.sort_values[by=row/col]`.
    - Available options: `axis`, `ascending`, etc.
- You can also use `mydata.sort_index` to sort by index.

In [None]:
data1 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
print(data1)
print(data1.sort_values(by='var3'))
print(data1.sort_values(by='var3', ascending=False))
print(data1.sort_values(by=['var1', 'var2']))
print(data1.sort_values(by='a', axis=1))

### Changing and adding items
- You can change and add items by specifying columns and/or rows.

In [None]:
data1 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
data1.loc[data1['var2'] > 0.2, ['var1','var3']] = -1/3
print(data1)
data1['var4'] = 1/9
data1.loc['d'] = -1/2
print(data1)

- To rename column names/indices, write `mydata.rename(columns={old1: new1, ...}, index={old1: new1, ...})`.
    - You need to define a new object to reflect the change.
- Change `'varA'` to `'var1'` and `'d'` to `'a'` in `data1`.

In [37]:
data1 = pd.DataFrame(np.random.rand(3,3), columns=['varA', 'var2', 'var3'], index=['d', 'b', 'c'])

### Deleting items
- Deleting items is like subsetting.
    - You can either take a subset or use `mydata.drop(row)` or `mydata.drop(col, axis=1)`.
    - You need to define a new object to reflect the change.
- You can also permanently delete columns using `del`.

In [None]:
data1 = pd.DataFrame(np.random.rand(4,4), columns=['var1', 'var2', 'var3', 'var4'], index=['a', 'b', 'c', 'd'])
data1 = data1.loc[:, 'var1':'var3']
print(data1)
data1 = data1.drop('var3', axis=1)
print(data1)
del data1['var2']
print(data1)

### Vectorized computation
- Similar to NumPy, an item-by-item calculation is possible.

In [None]:
data1 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
data1['var4'] = data1['var1'] / data1['var2']
print(data1)

### Summarizing items
- Similar to NumPy, you can summarize items using `min`, `max`, `mean`, `std`, `sum`, etc.
    - You write like `mydata.sum()`.

In [None]:
data1 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
print(data1.sum()) # for each column, or print(data1.sum(axis=0))
print(data1.sum(axis=1)) # for each row

- You can also use `describe()` to get a summary table.
- Get the summary table of `data1`.

In [38]:
data1 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])