# <center> STATS 607 - LECTURE 9
## <center> 10/03/2018

Pandas is a Python library that provides indexed, column-oriented data structures in Python. There is a lot of Pandas documentation available on the web, including the [official documentation](http://pandas.pydata.org/pandas-docs/stable/), the [API reference](http://pandas.pydata.org/pandas-docs/stable/api.html), and the [cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html).

The main Pandas data structures are the Series which holds a one-dimensional sequence of values, and the DataFrame, which holds a rectangular, two-dimensional dataset. Typically, the Series is used to store data for a single variable, or for a times series, while the DataFrame is used to store a dataset, in which the columns are variables and the rows are cases.

In addition to these data structures, Pandas also contains a large number of functions and methods to manipulate and summarize Series and DataFrame objects. Pandas data structures can sometimes, but not always, be used interchangeably with, or in combination with Numpy ndarrays and other Python data structures, as we will see further below.

In [1]:
# Import necessary modules.
import numpy as np
import pandas as pd

In [2]:
# Print versions of Python and modules using which this notebook was built.
print('Numpy version: ', np.__version__)
print('Pandas version: ', pd.__version__)

Numpy version:  1.15.2
Pandas version:  0.23.4


## Pandas Series - A one dimensional labeled data structure

### Example 1

Lets create a pandas series from a list:

In [3]:
s = pd.Series(data = [1,3,5,np.NaN,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [4]:
s.index # Access the index of the series.

RangeIndex(start=0, stop=6, step=1)

In [5]:
s.values # Access the values of the series.

array([ 1.,  3.,  5., nan,  6.,  8.])

In [6]:
s.dtype # Access the type of the series, similar to a Numpy array object.

dtype('float64')

In [7]:
s = pd.Series(data = [1,3,5,np.NaN,6,8], index = list('abcdef')) # Specify an index in the creation of the series.
s

a    1.0
b    3.0
c    5.0
d    NaN
e    6.0
f    8.0
dtype: float64

Pandas objects support two kinds of indexing: position based indexing and label based indexing. So: 1) you have fast access to its elements based on their position; 2) even if you make modifications to the series such as changing the order of its elements, you can still access the same elements by its label based index. As the following instructions demonstrate, one can access elements of the series above by position or by label:

In [8]:
print(s[2])
print(s['d'])

5.0
nan


### Example 2

Lets create a pandas series from a dictionary:

In [9]:
d = {'A':20,'B':40,'C':60,'D':55} # Lets first create a dictionary.
d['B']

40

In [10]:
d[1] # Remember that order does not matter in dictionaries.

KeyError: 1

In [13]:
d[:'B'] # For the same reason as above, you cannot do slicing on it.

TypeError: unhashable type: 'slice'

In [14]:
s = pd.Series(d) # Create a pandas series out of the dictionary.
s

A    20
B    40
C    60
D    55
dtype: int64

In [15]:
s[1]

40

In [16]:
s['B':]

B    40
C    60
D    55
dtype: int64

Note that values on a pandas series need to be of the same type, so if one of the values is a string, there is going to be an 'upcast' of the type.

In [17]:
d = {'A':20,'B':40,'C':60,'D':'55'} # Notice that the value associated with 'D' is '55' (a string).
s = pd.Series(d)
s['B':]

B    40
C    60
D    55
dtype: object

### Example 3

Now lets look at some of the operations you can do with pandas series:

In [18]:
# These are state pops as July 2016.
statePopsDict = {'California': 39250017, 'Texas': 27862596, 'Florida': 20612439, 
                  'New York': 19745289, 'Ohio': 11614373, 'Michigan': 9928300}

In [19]:
statePops = pd.Series(statePopsDict)
statePops

California    39250017
Texas         27862596
Florida       20612439
New York      19745289
Ohio          11614373
Michigan       9928300
dtype: int64

In [20]:
statePops['Texas'] # Retrieves the population of 'Texas'.

27862596

In [21]:
statePops[['California','Ohio']] # Retrieves the population of 'California' and 'Ohio'.

California    39250017
Ohio          11614373
dtype: int64

In [22]:
statePops/1e6 # Retrieves a pandas series in units of millions.

California    39.250017
Texas         27.862596
Florida       20.612439
New York      19.745289
Ohio          11.614373
Michigan       9.928300
dtype: float64

In [23]:
statePops[statePops > 20e6] # Filter states using a boolean index.

California    39250017
Texas         27862596
Florida       20612439
dtype: int64

In [24]:
# These are state areas - we will use it to find out what is the state population per unit area.
stateAreaDict = {'California': 163696, 'Alaska': 665384, 'Arizona': 113990, 
                  'New York': 54554, 'Ohio': 44825, 'Michigan': 96713}

In [25]:
stateArea = pd.Series(stateAreaDict)
stateArea

California    163696
Alaska        665384
Arizona       113990
New York       54554
Ohio           44825
Michigan       96713
dtype: int64

Notice that NaN are introduced in mismatching labeled positions for operations of the following kind:

In [26]:
statePops / stateArea # Calculates the state population per unit area.

Alaska               NaN
Arizona              NaN
California    239.773831
Florida              NaN
Michigan      102.657347
New York      361.940261
Ohio          259.104808
Texas                NaN
dtype: float64

In [27]:
statePops.name = 'State Population' # You can also add a name to the series.
statePops.index.name = 'State' # And also give a label for the index of the series.
statePops

State
California    39250017
Texas         27862596
Florida       20612439
New York      19745289
Ohio          11614373
Michigan       9928300
Name: State Population, dtype: int64

In [28]:
statePops.index = [name.upper() for name in statePops.index]  # The index can be modified.
statePops

CALIFORNIA    39250017
TEXAS         27862596
FLORIDA       20612439
NEW YORK      19745289
OHIO          11614373
MICHIGAN       9928300
Name: State Population, dtype: int64

In [29]:
statePops.sort_index() # Return a copy with indices in sorted order.

CALIFORNIA    39250017
FLORIDA       20612439
MICHIGAN       9928300
NEW YORK      19745289
OHIO          11614373
TEXAS         27862596
Name: State Population, dtype: int64

In [30]:
statePops.sort_values()  # return a copy with values in sorted order.

MICHIGAN       9928300
OHIO          11614373
NEW YORK      19745289
FLORIDA       20612439
TEXAS         27862596
CALIFORNIA    39250017
Name: State Population, dtype: int64

In [31]:
statePops # Notice that none of the modifications above changed 'statePops'

CALIFORNIA    39250017
TEXAS         27862596
FLORIDA       20612439
NEW YORK      19745289
OHIO          11614373
MICHIGAN       9928300
Name: State Population, dtype: int64

## Pandas Dataframes - A two dimensional labeled data structure

### Example 1

In [32]:
dates = pd.date_range('20130101', periods=6) # Returns a fixed frequency DatetimeIndex.

In [33]:
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [34]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) # Note the columns keyword argument.

In [35]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.484735,-0.435565,0.76508,1.170851
2013-01-02,-1.577203,0.778749,-0.663895,0.636429
2013-01-03,0.101133,-1.903688,-0.593115,0.454841
2013-01-04,0.470261,2.726863,-1.90186,-0.001302
2013-01-05,0.83906,-1.014997,0.865013,0.680557
2013-01-06,-0.115438,-0.158306,1.923634,-1.521699


In [36]:
df.shape

(6, 4)

In [37]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [38]:
df.values # Note this returns a bidimensional Numpy array.

array([[-4.84734859e-01, -4.35565412e-01,  7.65080445e-01,
         1.17085107e+00],
       [-1.57720328e+00,  7.78749095e-01, -6.63894852e-01,
         6.36429080e-01],
       [ 1.01133382e-01, -1.90368828e+00, -5.93114943e-01,
         4.54840908e-01],
       [ 4.70260660e-01,  2.72686260e+00, -1.90186016e+00,
        -1.30169201e-03],
       [ 8.39059691e-01, -1.01499723e+00,  8.65012736e-01,
         6.80557467e-01],
       [-1.15438117e-01, -1.58305673e-01,  1.92363411e+00,
        -1.52169858e+00]])

In [39]:
df.dtypes # Reports the type of each column.

A    float64
B    float64
C    float64
D    float64
dtype: object

In [40]:
df.describe() # Returns a statistical summary of the pandas dataframe.

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.12782,-0.001157,0.06581,0.236613
std,0.845367,1.606938,1.372413,0.940811
min,-1.577203,-1.903688,-1.90186,-1.521699
25%,-0.392411,-0.870139,-0.6462,0.112734
50%,-0.007152,-0.296936,0.085983,0.545635
75%,0.377979,0.544485,0.84003,0.669525
max,0.83906,2.726863,1.923634,1.170851


### Example 2

In [41]:
df = pd.DataFrame({ 'A' : 1.,
                    'B' : pd.Timestamp('20130102'), # Pandas replacement for datetime.
                    'C' : pd.Series(1, index=list(range(2,6)), dtype='float32'),
                    'D' : np.array([3] * 4, dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })

In [42]:
df

Unnamed: 0,A,B,C,D,E,F
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo
4,1.0,2013-01-02,1.0,3,test,foo
5,1.0,2013-01-02,1.0,3,train,foo


In [43]:
df.shape

(4, 6)

In [44]:
df.index

Int64Index([2, 3, 4, 5], dtype='int64')

In [45]:
df.values

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

In [46]:
df.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [47]:
df.describe() # Note that this only describes the numerical columns of the data (int, float, ...).

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,1.0,1.0,3.0
std,0.0,0.0,0.0
min,1.0,1.0,3.0
25%,1.0,1.0,3.0
50%,1.0,1.0,3.0
75%,1.0,1.0,3.0
max,1.0,1.0,3.0


### Example 3

In [48]:
#Defines column names to read from the hospital.csv file and creates a dataframe with the data named 'patients'
col_names=['id','name','sex','age','wgt','smoke','sys','dia']
patients = pd.read_csv('hospital.csv', usecols=col_names)
patients

Unnamed: 0,id,name,sex,age,wgt,smoke,sys,dia
0,YPL-320,SMITH,m,38,176,1,124,93
1,GLI-532,JOHNSON,m,43,163,0,109,77
2,PNI-258,WILLIAMS,f,38,131,0,125,83
3,MIJ-579,JONES,f,40,133,0,117,75
4,XLK-030,BROWN,f,49,119,0,122,80
5,TFP-518,DAVIS,f,46,142,0,121,70
6,LPD-746,MILLER,f,33,142,1,130,88
7,ATA-945,WILSON,m,40,180,0,115,82
8,VNL-702,MOORE,m,28,183,0,115,78
9,LQW-768,TAYLOR,f,31,132,0,118,86


In [49]:
patients.shape # Obtains the number of lines and columns of the dataframe.

(100, 8)

In [50]:
patients.dtypes # Obtains the dataframe main types.

id       object
name     object
sex      object
age       int64
wgt       int64
smoke     int64
sys       int64
dia       int64
dtype: object

In [51]:
patients.head(10) # Displays first lines of the dataframe.

Unnamed: 0,id,name,sex,age,wgt,smoke,sys,dia
0,YPL-320,SMITH,m,38,176,1,124,93
1,GLI-532,JOHNSON,m,43,163,0,109,77
2,PNI-258,WILLIAMS,f,38,131,0,125,83
3,MIJ-579,JONES,f,40,133,0,117,75
4,XLK-030,BROWN,f,49,119,0,122,80
5,TFP-518,DAVIS,f,46,142,0,121,70
6,LPD-746,MILLER,f,33,142,1,130,88
7,ATA-945,WILSON,m,40,180,0,115,82
8,VNL-702,MOORE,m,28,183,0,115,78
9,LQW-768,TAYLOR,f,31,132,0,118,86


In [52]:
patients.tail(3) # Displays last lines of the dataframe.

Unnamed: 0,id,name,sex,age,wgt,smoke,sys,dia
97,MEZ-469,GRIFFIN,m,49,186,0,119,74
98,BEZ-311,DIAZ,m,45,172,1,136,93
99,ZZB-405,HAYES,m,48,177,0,114,86


In [53]:
patients.index.values # Returns a numpy array with the index values.

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [54]:
patients.columns.values # Returns a numpy array with the column values.

array(['id', 'name', 'sex', 'age', 'wgt', 'smoke', 'sys', 'dia'],
      dtype=object)

In [55]:
patients.values[0:5,] # Returns the first five lines of the bidimensional numpy array.

array([['YPL-320', 'SMITH', 'm', 38, 176, 1, 124, 93],
       ['GLI-532', 'JOHNSON', 'm', 43, 163, 0, 109, 77],
       ['PNI-258', 'WILLIAMS', 'f', 38, 131, 0, 125, 83],
       ['MIJ-579', 'JONES', 'f', 40, 133, 0, 117, 75],
       ['XLK-030', 'BROWN', 'f', 49, 119, 0, 122, 80]], dtype=object)

In [56]:
patients.describe() # Provides a statistical summary of the patients data.

Unnamed: 0,age,wgt,smoke,sys,dia
count,100.0,100.0,100.0,100.0,100.0
mean,38.28,154.0,0.34,122.78,82.96
std,7.215416,26.571421,0.476095,6.71284,6.932459
min,25.0,111.0,0.0,109.0,68.0
25%,32.0,130.75,0.0,117.75,77.75
50%,39.0,142.5,0.0,122.0,81.5
75%,44.0,180.25,1.0,127.25,89.0
max,50.0,202.0,1.0,138.0,99.0


In [57]:
patients.describe(include='all') # Provides a statistical summary of the patients data (includes non-numerical data).

Unnamed: 0,id,name,sex,age,wgt,smoke,sys,dia
count,100,100,100,100.0,100.0,100.0,100.0,100.0
unique,100,100,2,,,,,
top,HJQ-495,SCOTT,f,,,,,
freq,1,1,53,,,,,
mean,,,,38.28,154.0,0.34,122.78,82.96
std,,,,7.215416,26.571421,0.476095,6.71284,6.932459
min,,,,25.0,111.0,0.0,109.0,68.0
25%,,,,32.0,130.75,0.0,117.75,77.75
50%,,,,39.0,142.5,0.0,122.0,81.5
75%,,,,44.0,180.25,1.0,127.25,89.0


In [58]:
patients.sort_index(axis=1).head() # Sorts the data along the specified axis.

Unnamed: 0,age,dia,id,name,sex,smoke,sys,wgt
0,38,93,YPL-320,SMITH,m,1,124,176
1,43,77,GLI-532,JOHNSON,m,0,109,163
2,38,83,PNI-258,WILLIAMS,f,0,125,131
3,40,75,MIJ-579,JONES,f,0,117,133
4,49,80,XLK-030,BROWN,f,0,122,119


In [59]:
patients.sort_values(by=['age','sex'],ascending=[False,False]).head(10) # Sorts the data by age and then sex in a specified order.

Unnamed: 0,id,name,sex,age,wgt,smoke,sys,dia
19,XBA-581,ROBINSON,m,50,172,0,125,76
54,DAU-529,REED,m,50,186,1,129,89
50,FLJ-908,STEWART,m,49,170,1,129,95
97,MEZ-469,GRIFFIN,m,49,186,0,119,74
4,XLK-030,BROWN,f,49,119,0,122,80
87,GGU-691,HUGHES,f,49,123,1,128,96
15,KOQ-996,MARTIN,m,48,181,1,130,92
93,FCD-425,GONZALES,m,48,174,0,123,79
99,ZZB-405,HAYES,m,48,177,0,114,86
20,BKD-785,CLARK,f,48,133,0,121,75


In [60]:
patients=patients.drop(['name'],axis=1) # Deidentifies the data by removing the 'name column on the dataframe.
patients.head()

Unnamed: 0,id,sex,age,wgt,smoke,sys,dia
0,YPL-320,m,38,176,1,124,93
1,GLI-532,m,43,163,0,109,77
2,PNI-258,f,38,131,0,125,83
3,MIJ-579,f,40,133,0,117,75
4,XLK-030,f,49,119,0,122,80


In [61]:
patients=patients.set_index(patients['id'].values) # Sets the row index of the dataframe equal to the values on the 'id column.
patients.head()

Unnamed: 0,id,sex,age,wgt,smoke,sys,dia
YPL-320,YPL-320,m,38,176,1,124,93
GLI-532,GLI-532,m,43,163,0,109,77
PNI-258,PNI-258,f,38,131,0,125,83
MIJ-579,MIJ-579,f,40,133,0,117,75
XLK-030,XLK-030,f,49,119,0,122,80


In [62]:
patients=patients.drop('id',axis=1) # Removes the 'id column from the dataframe
patients.head()

Unnamed: 0,sex,age,wgt,smoke,sys,dia
YPL-320,m,38,176,1,124,93
GLI-532,m,43,163,0,109,77
PNI-258,f,38,131,0,125,83
MIJ-579,f,40,133,0,117,75
XLK-030,f,49,119,0,122,80


In [63]:
patients.dtypes # Obtains the main types on the dataframe

sex      object
age       int64
wgt       int64
smoke     int64
sys       int64
dia       int64
dtype: object

#### Indexing and Slicing

In [64]:
patients['smoke'].head() # You can access data from just one column of the dataframe.

YPL-320    1
GLI-532    0
PNI-258    0
MIJ-579    0
XLK-030    0
Name: smoke, dtype: int64

In [65]:
patients.smoke.head() # Another way of accessing the column data.

YPL-320    1
GLI-532    0
PNI-258    0
MIJ-579    0
XLK-030    0
Name: smoke, dtype: int64

In [66]:
type(patients['smoke']) # Notice the type returned.

pandas.core.series.Series

In [67]:
patients['smoke'].describe() # Summarizes just the column 'smoke'.

count    100.000000
mean       0.340000
std        0.476095
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: smoke, dtype: float64

In [68]:
patients['smoke']=patients['smoke'].astype('category') # Convert the 'smoke column from integer to object type.
patients.dtypes

sex        object
age         int64
wgt         int64
smoke    category
sys         int64
dia         int64
dtype: object

In [69]:
patients['smoke'].describe() #Summarizes just the column 'smoke.

count     100
unique      2
top         0
freq       66
Name: smoke, dtype: int64

In [70]:
patients.head()

Unnamed: 0,sex,age,wgt,smoke,sys,dia
YPL-320,m,38,176,1,124,93
GLI-532,m,43,163,0,109,77
PNI-258,f,38,131,0,125,83
MIJ-579,f,40,133,0,117,75
XLK-030,f,49,119,0,122,80


In [71]:
patients[:3] # Displays the first three lines of the dataframe.

Unnamed: 0,sex,age,wgt,smoke,sys,dia
YPL-320,m,38,176,1,124,93
GLI-532,m,43,163,0,109,77
PNI-258,f,38,131,0,125,83


In [72]:
patients.iat[1,1] # Fast access to one single element of the dataframe. 'i' indicates position based indexing.

43

In [73]:
patients.at['GLI-532','age'] # Fast access to one single element of the dataframe. indexing is based on label.

43

In [74]:
# Retrieve the element in the first row and first column of the dataframe (specificed using a position based index).
patients.iloc[0,:]

sex        m
age       38
wgt      176
smoke      1
sys      124
dia       93
Name: YPL-320, dtype: object

In [75]:
# Retrieve the element in the row and column of the dataframe (specified using a label based index).
patients.loc[:,'age'].head()

YPL-320    38
GLI-532    43
PNI-258    38
MIJ-579    40
XLK-030    49
Name: age, dtype: int64

In [76]:
patients.head()

Unnamed: 0,sex,age,wgt,smoke,sys,dia
YPL-320,m,38,176,1,124,93
GLI-532,m,43,163,0,109,77
PNI-258,f,38,131,0,125,83
MIJ-579,f,40,133,0,117,75
XLK-030,f,49,119,0,122,80


In [77]:
# Displays two specific lines and columns of the data (two ways of going about it).
print(patients.index[[0,1]])
print(patients.columns.get_indexer(['sex','smoke']))

print(patients.loc[patients.index[[0,1]],['sex','smoke']])
print(patients.iloc[[0,1],patients.columns.get_indexer(['sex','smoke'])])

Index(['YPL-320', 'GLI-532'], dtype='object')
[0 3]
        sex smoke
YPL-320   m     1
GLI-532   m     0
        sex smoke
YPL-320   m     1
GLI-532   m     0


In [78]:
(patients['age']>48).head(10) # Check which patients are over the age of 48 (can be used as a boolean index).

YPL-320    False
GLI-532    False
PNI-258    False
MIJ-579    False
XLK-030     True
TFP-518    False
LPD-746    False
ATA-945    False
VNL-702    False
LQW-768    False
Name: age, dtype: bool

In [79]:
patients.loc[patients.age>48,:] # Creates a boolean index and uses it to identify those with age greater than 48.

Unnamed: 0,sex,age,wgt,smoke,sys,dia
XLK-030,f,49,119,0,122,80
XBA-581,m,50,172,0,125,76
FLJ-908,m,49,170,1,129,95
DAU-529,m,50,186,1,129,89
GGU-691,f,49,123,1,128,96
MEZ-469,m,49,186,0,119,74


#### Apply Functions


In [80]:
patients.head() # Displays the first few lines of the dataframe.

Unnamed: 0,sex,age,wgt,smoke,sys,dia
YPL-320,m,38,176,1,124,93
GLI-532,m,43,163,0,109,77
PNI-258,f,38,131,0,125,83
MIJ-579,f,40,133,0,117,75
XLK-030,f,49,119,0,122,80


In [81]:
patients.describe(include='all') # Summarize the dataframe.

Unnamed: 0,sex,age,wgt,smoke,sys,dia
count,100,100.0,100.0,100.0,100.0,100.0
unique,2,,,2.0,,
top,f,,,0.0,,
freq,53,,,66.0,,
mean,,38.28,154.0,,122.78,82.96
std,,7.215416,26.571421,,6.71284,6.932459
min,,25.0,111.0,,109.0,68.0
25%,,32.0,130.75,,117.75,77.75
50%,,39.0,142.5,,122.0,81.5
75%,,44.0,180.25,,127.25,89.0


In [82]:
patients.mean() # Obtains the mean of each one of the numerical columns on the dataframe.

age     38.28
wgt    154.00
sys    122.78
dia     82.96
dtype: float64

In [83]:
numColNames = patients.select_dtypes('int64').columns # Select the numerical columns of the data.
numColNames

Index(['age', 'wgt', 'sys', 'dia'], dtype='object')

In [84]:
patients.loc[:,numColNames].apply(np.mean, axis=0) # Obtains the mean of each one of the columns on the dataframe.

age     38.28
wgt    154.00
sys    122.78
dia     82.96
dtype: float64

In [85]:
patients.loc[:,numColNames].apply(np.cumsum).head() # Obtains the cumulative sum along the columns.

Unnamed: 0,age,wgt,sys,dia
YPL-320,38,176,124,93
GLI-532,81,339,233,170
PNI-258,119,470,358,253
MIJ-579,159,603,475,328
XLK-030,208,722,597,408


In [86]:
# Obtains the difference between the max and min for each one of the columns.
patients.loc[:,numColNames].apply(lambda x: x.max() - x.min())

age    25
wgt    91
sys    29
dia    31
dtype: int64

In [87]:
patients['age'].max()-patients['age'].min() # Confirms the difference above for the column age.

25

In [88]:
patients.loc[:,numColNames].apply(lambda x: (x-np.mean(x))/np.std(x)).head() # Centers and standardizes columns.

Unnamed: 0,age,wgt,sys,dia
YPL-320,-0.039001,0.832128,0.182657,1.455556
GLI-532,0.65745,0.340416,-2.063124,-0.864055
PNI-258,-0.039001,-0.869952,0.332376,0.005799
MIJ-579,0.239579,-0.794304,-0.865374,-1.154006
XLK-030,1.493193,-1.323841,-0.116781,-0.429128


In [89]:
patients.head() # Note that I've not made any modifications on the patients dataframe with the operations above.

Unnamed: 0,sex,age,wgt,smoke,sys,dia
YPL-320,m,38,176,1,124,93
GLI-532,m,43,163,0,109,77
PNI-258,f,38,131,0,125,83
MIJ-579,f,40,133,0,117,75
XLK-030,f,49,119,0,122,80
