In this lecture, we go over some hard problems of data manipulations and explain the techniques in a more applied setting.

Our first example comes from the healthcare industry. In healthcare, data scientists are often faced with administrative data (billing data). Examples include diagnosis codes, POA codes, procedure codes, revenue codes and admission/discharge dates etc. Often, when we deal with heathcare data, we are often faced with many categorical variables. Before running any machine learning algorithms, we must convert these categorical variables into indicators/flags with 1 or 0. When there are a large number of columns that need to be dealt with (say we receive data that has 99 diagnosis codes, each code could be different or the same), we face an issue here. 

In [1]:
import numpy as np
import pandas as pd

Let's create a pseudo-dataset. Suppose we have a snippet of the healthcare data like below. We have three diagnosis fields (dx1-dx3). For simplicity, they are now represented by pseudo-diagnosis codes using English letters 'a', 'b', 'c'. etc. In real data, these can be either real diagnosis codes (e.g. A06.89), real procedure codes, POA codes, or revenue codes etc. Here for simplicity and for illustrative purposes, we only use generic 'a', 'b', 'c' without loss of generality. You can think of each record as one encounter in the inpatient setting:

In [2]:
my_data = np.array([[111, 5.5, 'a', 'b', 'c', 1.3],
                    [111, 3.0, 'a', 'b', 'e', 3.2],
                    [222, 1.2, 'b', 'a', 'c', 2.4],
                    [222, 3.5, 'a', 'NA', 'd', 1.5],
                    [333, 7.2, 'a', 'NA', 'NA', 3.5],
                    [333, 2.1, 'a', 'b', 'NA', 6.6],
                    [333, 2.2, 'a', 'b', 'd', 6.9],
                    [333, 2.1, 'a', 'b', 'd', 9.2]])
df=pd.DataFrame(data=my_data, columns=['PatientID', 'x', 'dx1', 'dx2', 'dx3', 'y'])
df

Unnamed: 0,PatientID,x,dx1,dx2,dx3,y
0,111,5.5,a,b,c,1.3
1,111,3.0,a,b,e,3.2
2,222,1.2,b,a,c,2.4
3,222,3.5,a,,d,1.5
4,333,7.2,a,,,3.5
5,333,2.1,a,b,,6.6
6,333,2.2,a,b,d,6.9
7,333,2.1,a,b,d,9.2


The simpliest way to get dummy variables is to use the get_dummies() methods. We first create a table for each diagnosis code and then later concatenate them based on row index, creating a final table that simply contains all diagnosis code flags (called 'DX' here in our example):

In [3]:
dx1= pd.get_dummies(df['dx1'], prefix='dx1')
dx2 = pd.get_dummies(df['dx2'], prefix='dx2')
dx3 = pd.get_dummies(df['dx3'], prefix='dx3')
# print('dx1:\n', dx1, '\n')
# print('dx2:\n', dx2, '\n')
# print('dx3:\n', dx3, '\n')
DX = pd.concat([dx1, dx2, dx3], axis=1)
del dx1,dx2,dx3
DX

Unnamed: 0,dx1_a,dx1_b,dx2_NA,dx2_a,dx2_b,dx3_NA,dx3_c,dx3_d,dx3_e
0,1,0,0,0,1,0,1,0,0
1,1,0,0,0,1,0,0,0,1
2,0,1,0,1,0,0,1,0,0
3,1,0,1,0,0,0,0,1,0
4,1,0,1,0,0,1,0,0,0
5,1,0,0,0,1,1,0,0,0
6,1,0,0,0,1,0,0,1,0
7,1,0,0,0,1,0,0,1,0


Now let's roll up the encounters for each patient by adding up all of the diagnosis codes in the sense that we want to get the count of each diagnosis code (if there is no diagnosis code, or NA, then we count NA as a dummy diagnosis code). First, let's now combine the diagnosis table with the original data and do a group-by analysis to sum up all the counts of the diagnosis codes per patient:

In [4]:
mydata = pd.concat([df,DX], axis=1)
mydata.drop(['dx1','dx2','dx3'], axis=1, inplace=True)
varlist1=mydata.columns.tolist()
varlist2=[v for v in varlist1 if str(v).startswith('dx')] # column names that are diagnosis
varlist3=[v for v in varlist1 if not str(v).startswith('dx') and v != 'PatientID'] # column names that are not diagnosis
print(varlist1, varlist2, varlist3, sep='\n')
df2=mydata.groupby(['PatientID'])[varlist2].sum()
df2

['PatientID', 'x', 'y', 'dx1_a', 'dx1_b', 'dx2_NA', 'dx2_a', 'dx2_b', 'dx3_NA', 'dx3_c', 'dx3_d', 'dx3_e']
['dx1_a', 'dx1_b', 'dx2_NA', 'dx2_a', 'dx2_b', 'dx3_NA', 'dx3_c', 'dx3_d', 'dx3_e']
['x', 'y']


Unnamed: 0_level_0,dx1_a,dx1_b,dx2_NA,dx2_a,dx2_b,dx3_NA,dx3_c,dx3_d,dx3_e
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
111,2,0,0,0,2,0,1,0,1
222,1,1,1,1,0,0,1,1,0
333,4,0,1,0,3,2,0,2,0


Above, we created the 'df2' object which contains all the diagnosis codes for each patient, and all information has been rolled up on the patient ID level. However, the diagnosis codes flags are still indicated by diagnosis sequence (e.g. principal/secondary/tertiary diagnosis codes etc.). We need to further combine them by getting rid of the coding hierarchy and just get the count:

In [5]:
dxset1=set(df['dx1'].tolist())
dxset2=set(df['dx2'].tolist())
dxset3=set(df['dx3'].tolist())
varlist4=dxset1|dxset2|dxset3
print('All diag codes in all diag codes fields: ', varlist4)
for j in varlist4:
    idx = df2.columns.str.endswith(str(j))
    df2[str(j)]=df2.iloc[:,idx].sum(axis=1)
df2.drop(varlist2, axis=1, inplace=True)
df2

All diag codes in all diag codes fields:  {'d', 'c', 'NA', 'a', 'e', 'b'}


Unnamed: 0_level_0,d,c,NA,a,e,b
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
111,0,1,0,2,1,2
222,1,1,1,2,0,1
333,2,0,3,4,0,3


Now let's merge all the data together. Let's pick the maximum value of x and y for each patient. Our final dataset (see 'df4' below) is characterized by the fact that 1) One row has one patient; 2) We have diagnosis codes counts for each type of diagnosis codes, regardless whether it's principal, secondary, or tertiary etc.; 3) The other variables (x and y in our example) is indicated by its maximum value throughout its patient history. This type of datasets is often used in deep learning problems:

In [6]:
df3=df.groupby(['PatientID'])[varlist3].max()
df4=pd.concat([df2,df3], axis=1)
df4

Unnamed: 0_level_0,d,c,NA,a,e,b,x,y
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
111,0,1,0,2,1,2,5.5,3.2
222,1,1,1,2,0,1,3.5,2.4
333,2,0,3,4,0,3,7.2,9.2


Now, let's learn how to do the first-dot type of syntax in SAS. Specifically, we want to get the first row of each group, here is a dataset we need to work on. We want to group this by ["id","value"] and get the first row of each group:

In [7]:
df5 = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
                    'value': ["first","second","third","first",
                              "second","first","second","third",
                              "fourth","first","second","first",
                              "first","second","third","first","second"]})
print(df5, '\n')
print(df5.groupby('id').first(), '\n')
print(df5.groupby('id').last())

    id   value
0    1   first
1    1  second
2    1   third
3    2   first
4    2  second
5    3   first
6    3  second
7    3   third
8    3  fourth
9    4   first
10   4  second
11   5   first
12   6   first
13   6  second
14   6   third
15   7   first
16   7  second 

    value
id       
1   first
2   first
3   first
4   first
5   first
6   first
7   first 

     value
id        
1    third
2   second
3   fourth
4   second
5    first
6    third
7   second


The melt() method is very similar to the 'proc transpose' in SAS, essentially transposing the data to gather columns into rows:

In [8]:
print(mydata, '\n')
print(pd.melt(mydata))

  PatientID    x    y  dx1_a  dx1_b  dx2_NA  dx2_a  dx2_b  dx3_NA  dx3_c  \
0       111  5.5  1.3      1      0       0      0      1       0      1   
1       111  3.0  3.2      1      0       0      0      1       0      0   
2       222  1.2  2.4      0      1       0      1      0       0      1   
3       222  3.5  1.5      1      0       1      0      0       0      0   
4       333  7.2  3.5      1      0       1      0      0       1      0   
5       333  2.1  6.6      1      0       0      0      1       1      0   
6       333  2.2  6.9      1      0       0      0      1       0      0   
7       333  2.1  9.2      1      0       0      0      1       0      0   

   dx3_d  dx3_e  
0      0      0  
1      0      1  
2      0      0  
3      1      0  
4      0      0  
5      0      0  
6      1      0  
7      1      0   

     variable value
0   PatientID   111
1   PatientID   111
2   PatientID   222
3   PatientID   222
4   PatientID   333
..        ...   ...
91      dx3

Lastly, let's learn how to manipulate texts. Suppose we have a dataset that contains many texts, our goal is to replace some of the substrings in the texts for every row. Suppose I have a list of weekdays. I want to replace all of them with 'blah-blah'. We need to create a user-defined function to do this.

In [9]:
remove_tokens=['monday','tuesday','wednesday','thursday','friday','saturday', 'friday-saturday']
df0 = np.array([[1, "today is monday, and after today is Wednesday;\n wednesday? \r don't you love Wednesday!"],
                [2, "yesterday is thursday; tomorrow is saturday, we, we are so excited, so excited"],
                [3, "which seat should I take"],
                [4, "I am Rebecca Black;\nI like singing Friday, or Friday-Saturday"],
                [5, ""],
                [6, "---------------"]])
df=pd.DataFrame(data=df0, columns=['Ticket Number', 'message'])
df                 

Unnamed: 0,Ticket Number,message
0,1,"today is monday, and after today is Wednesday;..."
1,2,"yesterday is thursday; tomorrow is saturday, w..."
2,3,which seat should I take
3,4,"I am Rebecca Black;\nI like singing Friday, or..."
4,5,
5,6,---------------


In [10]:
def clean_text(s):
    for i in remove_tokens:
        s=s.lower().replace(i, 'blah')
    return s

In [11]:
df['clean_msg']=df['message'].apply(clean_text)

for j in range(6):
    print(j, ':', df.clean_msg.iloc[j],'\n')

0 : today is blah, and after today is blah;
 blah?  don't you love blah! 

1 : yesterday is blah; tomorrow is blah, we, we are so excited, so excited 

2 : which seat should i take 

3 : i am rebecca black;
i like singing blah, or blah-blah 

4 :  

5 : --------------- 



Notice that there is a problem, because we expect the string friday-saturday to be replaced by 'blah'. So the trick here is to sort the list based on its length:

In [12]:
remove_tokens.sort(key=lambda xs: len(xs), reverse=True)
print(remove_tokens)

def clean_text2(s):
    for i in remove_tokens:
        s=s.lower().replace(i, 'blah')
    return s

['friday-saturday', 'wednesday', 'thursday', 'saturday', 'tuesday', 'monday', 'friday']


In [13]:
df['clean_msg']=df['message'].apply(clean_text)

for j in range(6):
    print(j, ':', df.clean_msg.iloc[j],'\n')

0 : today is blah, and after today is blah;
 blah?  don't you love blah! 

1 : yesterday is blah; tomorrow is blah, we, we are so excited, so excited 

2 : which seat should i take 

3 : i am rebecca black;
i like singing blah, or blah 

4 :  

5 : --------------- 



Additional Resources:
   - https://stackoverflow.com/questions/28236305/how-do-i-sum-values-in-a-column-that-match-a-given-condition-using-pandas
   - https://stackoverflow.com/questions/11587782/creating-dummy-variables-in-pandas-for-python
   - http://pandas.pydata.org/pandas-docs/stable/merging.html
   - https://stackoverflow.com/questions/34797774/summing-up-values-for-rows-per-columns-starting-with-col
   - https://stackoverflow.com/questions/35101426/return-subset-of-list-that-matches-condition
   - https://stackoverflow.com/questions/44327999/python-pandas-merge-multiple-dataframes
   - https://stackoverflow.com/questions/20067636/pandas-dataframe-get-first-row-of-each-group
   - https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf