# Numpy and Pandas

## Numpy

NumPy, short for Numerical Python, is the fundamental package required for high performance scientific computing and data analysis.

Why Numpy?

* Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations
* Common array algorithms like sorting, unique, and set operations
* Efficient descriptive statistics and aggregating/summarizing data
* Data alignment and relational data manipulations for merging and joining together heterogeneous data sets
* Expressing conditional logic as array expressions instead of loops with if-elif- else branches
* Group-wise data manipulations (aggregation, transformation, function applica- tion)


In [2]:
import numpy as np

## ndarray

In [3]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)

In [4]:
arr1.shape

(5,)

In [5]:
arr1.dtype

dtype('float64')

In [6]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [None]:
np.empty((1,2,3
))

As you expect it from Python: 
* `[idx]`
* `[begin:end:stepsize]`
  * Default values
    * begin = 0
    * end = last element
    * stepsize = 1
    * colons are optional
* Negativ indizes are counted from the last element.
  * `-i` is the short form of  `n - i` with `n` begin the number of elements in the array 

In [2]:
X = np.random.randn(3, 5)
X

array([[-0.40038669,  1.11829379, -1.07499997, -0.20100177,  1.24745862],
       [-0.65041872,  1.03785775, -0.91727384,  1.84144098,  0.57501708],
       [-0.98545769, -1.35682004, -0.61046597,  1.43044743,  0.14167787]])

Looks like a list of lists. And indeed, if we use a single index into the array, we will obtain rows:

In [3]:
X[0]

array([-0.40038669,  1.11829379, -1.07499997, -0.20100177,  1.24745862])

In [4]:
X[0, 1]

1.118293792421547

In [5]:
X[0, 0:2]

array([-0.40038669,  1.11829379])

In [6]:
X[0, :]

array([-0.40038669,  1.11829379, -1.07499997, -0.20100177,  1.24745862])

In [7]:
X[-1, :]

array([-0.98545769, -1.35682004, -0.61046597,  1.43044743,  0.14167787])

## Boolean Indexing

**Boolean indexing** allows you to select data subsets of an array that satisfy a given condition.

**Boolean Index Mask** defines a boolean numpy array of type `bool` where an element is selected (True) or not (False) depending on the value of the index mask at the position each element

In [8]:
#simple example
arr = np.array([10, 20])
idx = np.array([True, False])
arr[idx]

array([10])

In [9]:
#creating test data
arr_2d = np.random.randn(5)
arr_2d

array([ 1.10493847, -0.17036643,  0.376578  ,  2.38746806, -0.04587725])

In [10]:
#getting a boolean index array
arr_2d < 0

array([False,  True, False, False,  True])

In [11]:
#using a boolean index array inplace
arr_2d[arr_2d < 0]

array([-0.17036643, -0.04587725])

In [12]:
#complex boolean expressions
arr_2d[(arr_2d > -0.5) & (arr_2d < 0)]

array([-0.17036643, -0.04587725])

In [13]:
#setting the value based on a boolean indexing array
arr_2d[arr_2d < 0] = 0
arr_2d

array([1.10493847, 0.        , 0.376578  , 2.38746806, 0.        ])

###list-of-locations indexing

In [14]:
#the data. 18 elements in 6 rows and 3 columns
arr = np.arange(18).reshape(6,3)
arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17]])

In [15]:
# fancy selection of rows in a particular order
arr[[0,4,4]]

array([[ 0,  1,  2],
       [12, 13, 14],
       [12, 13, 14]])

In [16]:
#select elements [5,2], [3,1],[1,0]
arr[[5,3,1],[2,1,0]]

array([17, 10,  3])

## Universal Functions

| Function                | Description                                                                                                                                          |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| abs, fabs               | Compute the absolute value element-wise for integer, floating point, or complex values. Use fabs as a faster alternative for non-complex-valued data |
| sqrt                    | Compute the square root of each element. Equivalent to arr ** 0.5                                                                                    |
| square                  | Compute the square of each element. Equivalent to arr ** 2                                                                                           |
| exp                     | Compute the exponent ex of each element                                                                                                              |
| log, log10, log2, log1p | Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively                                                                    |
| sign                    | Compute the sign of each element: 1 (positive), 0 (zero), or -1 (negative)                                                                           |
| ceil                    | Compute the ceiling of each element, i.e. the smallest integer greater than or equal to each element                                                 |
| floor                   | Compute the floor of each element, i.e. the largest integer less than or equal to each element 

In [3]:
 arr = np.arange(10)

In [None]:
np.sqrt(arr)

In [4]:
np.exp(arr)

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

### Mathematical and Statistical Methods

A set of mathematical functions which compute statistics about an entire array or about the data along an axis are accessible as array methods

In [6]:
arr = np.random.randn(5, 4)

In [7]:
arr.mean()

-0.12010048164196137

In [8]:
arr.sum()

-2.4020096328392273

In [9]:
arr.mean(axis=1)

array([ 0.07695205,  0.06638701,  0.34703253, -0.69824656, -0.39262745])

## Matrix algebra

What about matrix mutiplication? There are two ways. We can either use the `dot` function, which applies a matrix-matrix, matrix-vector, or inner vector multiplication to its two arguments: 

In [27]:
Y = np.random.randn(5, 4)

In [28]:
X, Y

(array([[-0.40038669,  1.11829379, -1.07499997, -0.20100177,  1.24745862],
        [-0.65041872,  1.03785775, -0.91727384,  1.84144098,  0.57501708],
        [-0.98545769, -1.35682004, -0.61046597,  1.43044743,  0.14167787]]),
 array([[ 0.19959726, -0.52817142, -0.37757318,  0.14013078],
        [ 0.36362986,  0.42031749,  0.10042907,  0.30814386],
        [ 0.20266257,  1.96200371,  0.09868671, -0.1448721 ],
        [-0.39149893, -0.9814651 , -0.17533847, -1.31771535],
        [ 2.25418205,  1.12518   ,  1.96686984, -1.95197939]]))

In [29]:
Z = np.dot(X, Y)
Z

array([[ 2.99955747,  0.17324903,  2.64622835, -1.72592403],
       [ 0.63694822, -2.18024467,  1.06739733, -3.18736334],
       [-1.05444445, -2.49206222,  0.20342282, -2.62922406]])

In [30]:
X.shape, Y.shape, Z.shape

((3, 5), (5, 4), (3, 4))

In [31]:
Z = X.dot(Y) #same as above

In [32]:
# Matrix-Vector multiplication

In [33]:
A = np.array([[1,1],[2,2]])
A

array([[1, 1],
       [2, 2]])

In [34]:
v1 = np.arange(0, 2)
v1

array([0, 1])

In [35]:
np.dot(A, v1)

array([1, 2])

#### Matrix multiplication via casts

Alternatively, we can cast the array objects to the type `matrix`. This changes the behavior of the standard arithmetic operators `+, -, *` to use matrix algebra.

In [36]:
M = np.matrix(A)
v = np.matrix(v1).T # make it a column vector (T is the transpose operation)
v

matrix([[0],
        [1]])

In [37]:
M*M

matrix([[3, 3],
        [6, 6]])

In [38]:
M*v

matrix([[1],
        [2]])

In [39]:
# inner product
v.T * v

matrix([[1]])

In [40]:
# with matrix objects, standard matrix algebra applies
v + M*v

matrix([[1],
        [3]])

# Getting started with Pandas

Pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric applications.

## Pandas data structures

In [2]:
import pandas as pd

### Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.

In [3]:
obj = pd.Series([4, 7, -5, 3])

In [4]:
obj.values

array([ 4,  7, -5,  3])

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

* Create a Series with an index identifying each data point:
* Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values:

In [8]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [9]:
obj2['a']

-5

In [10]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

Series can function as a dict

In [11]:
 'b' in obj2

True

You can build a series from a dict

In [16]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [17]:
obj3 = pd.Series(sdata, index=states)

In [18]:
obj3

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

## DataFrame

* A DataFrame represents a tabular, spreadsheet-like data structure containing an or- dered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).
* The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index).

### Create a Dataframe 

In [20]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)

In [22]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


The columns can be specified

In [23]:
frame1 = pd.DataFrame(data, columns=['state', 'pop'])

In [25]:
frame1.columns

Index(['state', 'pop'], dtype='object')

### Import a dataframe

In [32]:
df_adm = pd.read_csv('csvs/ADMISSIONS.csv')

In [34]:
df_adm.head()

Unnamed: 0,row_id,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,religion,marital_status,ethnicity,edregtime,edouttime,diagnosis,hospital_expire_flag,has_chartevents_data
0,12258,10006,142345,2164-10-23 21:09:00,2164-11-01 17:15:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME HEALTH CARE,Medicare,,CATHOLIC,SEPARATED,BLACK/AFRICAN AMERICAN,2164-10-23 16:43:00,2164-10-23 23:00:00,SEPSIS,0,1
1,12263,10011,105331,2126-08-14 22:32:00,2126-08-28 18:59:00,2126-08-28 18:59:00,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,DEAD/EXPIRED,Private,,CATHOLIC,SINGLE,UNKNOWN/NOT SPECIFIED,,,HEPATITIS B,1,1
2,12265,10013,165520,2125-10-04 23:36:00,2125-10-07 15:13:00,2125-10-07 15:13:00,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,DEAD/EXPIRED,Medicare,,CATHOLIC,,UNKNOWN/NOT SPECIFIED,,,SEPSIS,1,1
3,12269,10017,199207,2149-05-26 17:19:00,2149-06-03 18:42:00,,EMERGENCY,EMERGENCY ROOM ADMIT,SNF,Medicare,,CATHOLIC,DIVORCED,WHITE,2149-05-26 12:08:00,2149-05-26 19:45:00,HUMERAL FRACTURE,0,1
4,12270,10019,177759,2163-05-14 20:43:00,2163-05-15 12:00:00,2163-05-15 12:00:00,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,DEAD/EXPIRED,Medicare,,CATHOLIC,DIVORCED,WHITE,,,ALCOHOLIC HEPATITIS,1,1


You can check all the column names

In [60]:
df_adm.columns

Index(['row_id', 'subject_id', 'hadm_id', 'admittime', 'dischtime',
       'deathtime', 'admission_type', 'admission_location',
       'discharge_location', 'insurance', 'language', 'religion',
       'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis',
       'hospital_expire_flag', 'has_chartevents_data'],
      dtype='object')