# Numpy 

NumPy, short for Numerical Python, is the fundamental package required for high performance scientific computing and data analysis.

Why Numpy?

* Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations
* Common array algorithms like sorting, unique, and set operations
* Efficient descriptive statistics and aggregating/summarizing data
* Data alignment and relational data manipulations for merging and joining together heterogeneous data sets
* Expressing conditional logic as array expressions instead of loops with if-elif- else branches
* Group-wise data manipulations (aggregation, transformation, function applica- tion)


Import the library, we can alias the originnal name like below

import numpy as np

## ndarray

*  A Multidimensional Array Object
*  A generic multidimensional container for **homogeneous** data


Creating the ndarry 


In [56]:
# From phthon list
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)

In [57]:
arr1.shape

(5,)

In [6]:
# Create an empty array
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [60]:
# Create a multidimensional empty array
np.empty((1,2,3
))

array([[[0., 0., 0.],
        [0., 0., 0.]]])

## Basic indexing and slicing

As you expect it from Python: 
* `[idx]`
* `[begin:end:stepsize]`
  * Default values
    * begin = 0
    * end = last element
    * stepsize = 1
    * colons are optional
* Negativ indizes are counted from the last element.
  * `-i` is the short form of  `n - i` with `n` begin the number of elements in the array 

We create a randomized narray 

In [2]:
X = np.random.randn(3, 5)
X

array([[-0.40038669,  1.11829379, -1.07499997, -0.20100177,  1.24745862],
       [-0.65041872,  1.03785775, -0.91727384,  1.84144098,  0.57501708],
       [-0.98545769, -1.35682004, -0.61046597,  1.43044743,  0.14167787]])

Looks like a list of lists. And indeed, if we use a single index into the array, we will obtain rows:

The first row

In [3]:
X[0]

array([-0.40038669,  1.11829379, -1.07499997, -0.20100177,  1.24745862])

The first row, second column

In [4]:
X[0, 1]

1.118293792421547

The first two column of the first row

In [5]:
X[0, 0:2]

array([-0.40038669,  1.11829379])

Q: Could you retrieve the first 3 elements of the last row?

In [77]:
# Your code

## Boolean Indexing

**Boolean indexing** allows you to select data subsets of an array that satisfy a given condition.

**Boolean Index Mask** defines a boolean numpy array of type `bool` where an element is selected (True) or not (False) depending on the value of the index mask at the position each element

In [8]:
#simple example
arr = np.array([10, 20])
idx = np.array([True, False])
arr[idx]

array([10])

In [9]:
#creating test data
arr_2d = np.random.randn(5)
arr_2d

array([ 1.10493847, -0.17036643,  0.376578  ,  2.38746806, -0.04587725])

In [10]:
#getting a boolean index array
arr_2d < 0

array([False,  True, False, False,  True])

In [11]:
#using a boolean index array inplace
arr_2d[arr_2d < 0]

array([-0.17036643, -0.04587725])

In [12]:
#complex boolean expressions
arr_2d[(arr_2d > -0.5) & (arr_2d < 0)]

array([-0.17036643, -0.04587725])

In [13]:
#setting the value based on a boolean indexing array
arr_2d[arr_2d < 0] = 0
arr_2d

array([1.10493847, 0.        , 0.376578  , 2.38746806, 0.        ])

## Universal Functions

| Function                | Description                                                                                                                                          |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| abs, fabs               | Compute the absolute value element-wise for integer, floating point, or complex values. Use fabs as a faster alternative for non-complex-valued data |
| sqrt                    | Compute the square root of each element. Equivalent to arr ** 0.5                                                                                    |
| square                  | Compute the square of each element. Equivalent to arr ** 2                                                                                           |
| exp                     | Compute the exponent ex of each element                                                                                                              |
| log, log10, log2, log1p | Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively                                                                    |
| sign                    | Compute the sign of each element: 1 (positive), 0 (zero), or -1 (negative)                                                                           |
| ceil                    | Compute the ceiling of each element, i.e. the smallest integer greater than or equal to each element                                                 |
| floor                   | Compute the floor of each element, i.e. the largest integer less than or equal to each element 

In [76]:
 # Preparing the test data
 arr = 10* np.random.rand(10)
 arr

array([2.28990449, 9.04799767, 3.2915055 , 1.55574984, 7.05520455,
       9.37876505, 3.65223823, 6.2686591 , 7.98948308, 2.45177094])

In [None]:
np.sqrt(arr)

In [4]:
np.exp(arr)

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

Q: Can you calculate the ceiling and floor of each element in arr

In [None]:
# Your code

### Mathematical and Statistical Methods

A set of mathematical functions which compute statistics about an entire array or about the data along an axis are accessible as array methods

| Method         | Description                                                                                                           |
|----------------|-----------------------------------------------------------------------------------------------------------------------|
| sum            | Sum of all the elements in the array or along an axis. Zero-length arrays have sum 0.                                 |
| mean           | Arithmetic mean. Zero-length arrays have NaN mean.                                                                    |
| std, var       | Standard deviation and variance, respectively, with optional degrees of freedom adjust- ment (default denominator n). |
| min, max       | Minimum and maximum.                                                                                                  |
| argmin, argmax | Indices of minimum and maximum elements, respectively.                                                                |
| cumsum         | Cumulative sum of elements starting from 0                                                                            |
| cumprod        | Cumulative product of elements starting from 1  

In [79]:
arr = np.random.randn(5, 4)
arr

array([[ 0.24402909, -0.10710692, -1.69824272, -1.14598319],
       [ 1.2945923 , -0.61233272, -1.07752928,  0.31053422],
       [ 0.48618324,  0.03837849, -0.69905448, -1.09340228],
       [ 0.46073165, -0.07068223,  0.21529544,  1.39293254],
       [ 0.91574104,  0.70246449,  0.21456679, -0.18463992]])

In [7]:
arr.mean()

-0.12010048164196137

In [8]:
arr.sum()

-2.4020096328392273

In [9]:
arr.mean(axis=1)

array([ 0.07695205,  0.06638701,  0.34703253, -0.69824656, -0.39262745])

Q: Can you calculate the maxist and minist value of this arr?

In [80]:
# Your code

# Pandas

Pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric applications.

![pandas-data-structure](pics/pandas-data-structure.png)


## Pandas data structures

In [5]:
import pandas as pd

### Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.

In [3]:
obj = pd.Series([4, 7, -5, 3])

In [4]:
obj.values

array([ 4,  7, -5,  3])

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

* Create a Series with an index identifying each data point:
* Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values:

In [8]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [9]:
obj2['a']

-5

In [10]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

Series can function as a dict

In [11]:
 'b' in obj2

True

You can build a series from a python dictionary

In [81]:

pop_data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [83]:
obj3 = pd.Series(pop_data, index=states)

In [84]:
obj3

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Q: What's the population of Oregon?

In [86]:
# Your code

## DataFrame

* A DataFrame represents a tabular, spreadsheet-like data structure containing an or- dered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).
* The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index).

### Create a Dataframe 

In [20]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)

In [22]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


The columns can be specified

In [23]:
frame1 = pd.DataFrame(data, columns=['state', 'pop'])

If you want to retrieve all the column names

In [25]:
frame1.columns

Index(['state', 'pop'], dtype='object')

### Import a dataframe

In [6]:
df_adm = pd.read_csv('csvs/ADMISSIONS.csv')

Take a look at the first 5 lines using head() method

In [34]:
df_adm.head()

Unnamed: 0,row_id,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,religion,marital_status,ethnicity,edregtime,edouttime,diagnosis,hospital_expire_flag,has_chartevents_data
0,12258,10006,142345,2164-10-23 21:09:00,2164-11-01 17:15:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME HEALTH CARE,Medicare,,CATHOLIC,SEPARATED,BLACK/AFRICAN AMERICAN,2164-10-23 16:43:00,2164-10-23 23:00:00,SEPSIS,0,1
1,12263,10011,105331,2126-08-14 22:32:00,2126-08-28 18:59:00,2126-08-28 18:59:00,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,DEAD/EXPIRED,Private,,CATHOLIC,SINGLE,UNKNOWN/NOT SPECIFIED,,,HEPATITIS B,1,1
2,12265,10013,165520,2125-10-04 23:36:00,2125-10-07 15:13:00,2125-10-07 15:13:00,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,DEAD/EXPIRED,Medicare,,CATHOLIC,,UNKNOWN/NOT SPECIFIED,,,SEPSIS,1,1
3,12269,10017,199207,2149-05-26 17:19:00,2149-06-03 18:42:00,,EMERGENCY,EMERGENCY ROOM ADMIT,SNF,Medicare,,CATHOLIC,DIVORCED,WHITE,2149-05-26 12:08:00,2149-05-26 19:45:00,HUMERAL FRACTURE,0,1
4,12270,10019,177759,2163-05-14 20:43:00,2163-05-15 12:00:00,2163-05-15 12:00:00,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,DEAD/EXPIRED,Medicare,,CATHOLIC,DIVORCED,WHITE,,,ALCOHOLIC HEPATITIS,1,1


Let’s have a look at data dimensionality, feature names, and feature types.

In [87]:
df_adm.shape

(129, 19)

The table contains 129 rows and 19 columns

You can print all the column names

In [60]:
df_adm.columns

Index(['row_id', 'subject_id', 'hadm_id', 'admittime', 'dischtime',
       'deathtime', 'admission_type', 'admission_location',
       'discharge_location', 'insurance', 'language', 'religion',
       'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis',
       'hospital_expire_flag', 'has_chartevents_data'],
      dtype='object')

Using info() method to output some general information about the dataframe 

In [88]:
df_adm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129 entries, 0 to 128
Data columns (total 19 columns):
row_id                  129 non-null int64
subject_id              129 non-null int64
hadm_id                 129 non-null int64
admittime               129 non-null object
dischtime               129 non-null object
deathtime               40 non-null object
admission_type          129 non-null object
admission_location      129 non-null object
discharge_location      129 non-null object
insurance               129 non-null object
language                81 non-null object
religion                128 non-null object
marital_status          113 non-null object
ethnicity               129 non-null object
edregtime               92 non-null object
edouttime               92 non-null object
diagnosis               129 non-null object
hospital_expire_flag    129 non-null int64
has_chartevents_data    129 non-null int64
dtypes: int64(5), object(14)
memory usage: 19.3+ KB


## Indexing

| Type                         | Notes                                                                                                                                                                                                    |
|------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| obj\[val\]                     | Select single column or sequence of columns from the DataFrame. Special case con- veniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion). |
|obj.loc\[val\]|        .loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found.    |
| obj.iloc\[val\]                  | .iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing  |                                                                                                                                                                        

### Select by label

In [9]:
df_adm['row_id']

0      12258
1      12263
2      12265
3      12269
4      12270
       ...  
124    41055
125    41070
126    41087
127    41090
128    41092
Name: row_id, Length: 129, dtype: int64

In [19]:
df_adm.loc[:,'row_id']

0      12258
1      12263
2      12265
3      12269
4      12270
       ...  
124    41055
125    41070
126    41087
127    41090
128    41092
Name: row_id, Length: 129, dtype: int64

### Select by positionn

In [21]:
df_adm.iloc[5,2]

103770

Slicing

In [23]:
df_adm.iloc[1:5, 2:4]

Unnamed: 0,hadm_id,admittime
1,105331,2126-08-14 22:32:00
2,165520,2125-10-04 23:36:00
3,199207,2149-05-26 17:19:00
4,177759,2163-05-14 20:43:00


### Boolean Indexing

You can use boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses

In [7]:
df_adm.columns

Index(['row_id', 'subject_id', 'hadm_id', 'admittime', 'dischtime',
       'deathtime', 'admission_type', 'admission_location',
       'discharge_location', 'insurance', 'language', 'religion',
       'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis',
       'hospital_expire_flag', 'has_chartevents_data'],
      dtype='object')

In [31]:
# &
df_adm[(df_adm['marital_status']=='SINGLE') & (df_adm['discharge_location'] == 'HOME')]

Unnamed: 0,row_id,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,religion,marital_status,ethnicity,edregtime,edouttime,diagnosis,hospital_expire_flag,has_chartevents_data
46,12368,10117,187023,2138-06-05 17:23:00,2138-06-11 10:16:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,Private,,CATHOLIC,SINGLE,UNKNOWN/NOT SPECIFIED,2138-06-05 11:42:00,2138-06-05 21:20:00,FEVER,0,1
101,40512,42292,138503,2162-01-16 13:56:00,2162-01-19 13:45:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,Private,ENGL,CATHOLIC,SINGLE,WHITE,2162-01-16 11:28:00,2162-01-16 16:12:00,PNEUMONIA/HYPOGLCEMIA/SYNCOPE,0,1
124,41055,44083,198330,2112-05-28 15:45:00,2112-06-07 16:50:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,Private,ENGL,CATHOLIC,SINGLE,WHITE,2112-05-28 13:16:00,2112-05-28 17:30:00,PERICARDIAL EFFUSION,0,1
127,41090,44222,192189,2180-07-19 06:55:00,2180-07-20 13:00:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,Medicare,ENGL,CATHOLIC,SINGLE,WHITE,2180-07-19 04:50:00,2180-07-19 08:23:00,BRADYCARDIA,0,1


In [33]:
# ~
df_adm[~(df_adm['marital_status']=='SINGLE')]

Unnamed: 0,row_id,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,religion,marital_status,ethnicity,edregtime,edouttime,diagnosis,hospital_expire_flag,has_chartevents_data
0,12258,10006,142345,2164-10-23 21:09:00,2164-11-01 17:15:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME HEALTH CARE,Medicare,,CATHOLIC,SEPARATED,BLACK/AFRICAN AMERICAN,2164-10-23 16:43:00,2164-10-23 23:00:00,SEPSIS,0,1
2,12265,10013,165520,2125-10-04 23:36:00,2125-10-07 15:13:00,2125-10-07 15:13:00,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,DEAD/EXPIRED,Medicare,,CATHOLIC,,UNKNOWN/NOT SPECIFIED,,,SEPSIS,1,1
3,12269,10017,199207,2149-05-26 17:19:00,2149-06-03 18:42:00,,EMERGENCY,EMERGENCY ROOM ADMIT,SNF,Medicare,,CATHOLIC,DIVORCED,WHITE,2149-05-26 12:08:00,2149-05-26 19:45:00,HUMERAL FRACTURE,0,1
4,12270,10019,177759,2163-05-14 20:43:00,2163-05-15 12:00:00,2163-05-15 12:00:00,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,DEAD/EXPIRED,Medicare,,CATHOLIC,DIVORCED,WHITE,,,ALCOHOLIC HEPATITIS,1,1
5,12277,10026,103770,2195-05-17 07:39:00,2195-05-24 11:45:00,,EMERGENCY,EMERGENCY ROOM ADMIT,REHAB/DISTINCT PART HOSP,Medicare,,OTHER,,WHITE,2195-05-17 01:49:00,2195-05-17 08:29:00,STROKE/TIA,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
118,40993,43881,172454,2104-09-24 17:31:00,2104-09-30 16:17:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME HEALTH CARE,Private,ENGL,NOT SPECIFIED,MARRIED,WHITE,2104-09-24 12:07:00,2104-09-24 18:50:00,ACUTE PULMONARY EMBOLISM,0,1
119,40994,43881,167021,2104-10-24 09:44:00,2104-11-01 11:59:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,Private,ENGL,NOT SPECIFIED,MARRIED,WHITE,2104-10-24 07:17:00,2104-10-24 11:10:00,UPPER GI BLEED,0,1
120,40998,43909,167612,2152-10-09 19:05:00,2152-10-09 22:33:00,2152-10-09 22:33:00,EMERGENCY,EMERGENCY ROOM ADMIT,DEAD/EXPIRED,Medicare,RUSS,UNOBTAINABLE,,WHITE,2152-10-09 17:00:00,2152-10-09 19:54:00,PNEUMONIA;TELEMETRY,1,1
121,41005,43927,110958,2175-10-02 12:30:00,2175-10-06 15:00:00,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,SNF,Medicare,ENGL,CATHOLIC,WIDOWED,WHITE,,,CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS...,0,1


Q: Could you find the admission records whose admission_type are 'EMERGENCY'？

In [89]:
# Your code

Q: And how many are them?

In [90]:
# Your code

## Descriptive Statistics

The describe method shows basic statistical characteristics of each numerical feature (int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

In [34]:
df_adm.describe()

Unnamed: 0,row_id,subject_id,hadm_id,hospital_expire_flag,has_chartevents_data
count,129.0,129.0,129.0,129.0,129.0
mean,28036.44186,28010.410853,152343.44186,0.310078,0.992248
std,14036.548988,16048.502883,27858.788248,0.464328,0.088045
min,12258.0,10006.0,100375.0,0.0,0.0
25%,12339.0,10088.0,128293.0,0.0,1.0
50%,39869.0,40310.0,157235.0,0.0,1.0
75%,40463.0,42135.0,174739.0,1.0,1.0
max,41092.0,44228.0,199395.0,1.0,1.0


In [36]:
df_adm['row_id'].max()

41092

uique() returns a sorted unique values

In [38]:
df_adm['admission_type'].unique()

array(['EMERGENCY', 'ELECTIVE', 'URGENT'], dtype=object)

For categorical (type object) and boolean (type bool) features we can use the value_counts method to demonstrate the 

In [37]:
df_adm['admission_type'].value_counts()

EMERGENCY    119
ELECTIVE       8
URGENT         2
Name: admission_type, dtype: int64

Q: Can you find the top 3 diagnoses among all the admission records?

In [97]:
# Your code


---

## *Apply

Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument:

In [45]:
import numpy as np
df_adm['row_id'].apply(np.mean) # same as df_adm['row_id'].mean()

0      12258.0
1      12263.0
2      12265.0
3      12269.0
4      12270.0
        ...   
124    41055.0
125    41070.0
126    41087.0
127    41090.0
128    41092.0
Name: row_id, Length: 129, dtype: float64

Define a function yourself

In [46]:
def subtract_and_divide(x, sub, divide=1):
    return (x - sub) / divide

In [50]:
df_adm['row_id'].apply(subtract_and_divide, args=(1000,8))


0      1407.250
1      1407.875
2      1408.125
3      1408.625
4      1408.750
         ...   
124    5006.875
125    5008.750
126    5010.875
127    5011.250
128    5011.500
Name: row_id, Length: 129, dtype: float64

lambda 

In [51]:
df_adm['row_id'].apply(lambda x: (x-1000)/8)

0      1407.250
1      1407.875
2      1408.125
3      1408.625
4      1408.750
         ...   
124    5006.875
125    5008.750
126    5010.875
127    5011.250
128    5011.500
Name: row_id, Length: 129, dtype: float64