Import the data and remove rows without time stamp.

In [1]:
import pandas as pd
df = pd.read_csv('observations.csv')

In [2]:
#Change to datetime type
df = df.dropna(subset = ['Visit date [EUPATH_0000091]'])
df['Visit date [EUPATH_0000091]'] = pd.to_datetime(df['Visit date [EUPATH_0000091]'])

In [3]:
def non_na_rate(df):
    '''
    For each column in a data frame, return the proportion of non-NaN data (in %)
    
    Args:
    df: the data frame
    
    Returns:
    na_rate: a series which indices are columns of the data frame and values are corresponding non-NaN rates
    '''
    import numpy as np
    keys = (df.keys()).tolist()
    n_col = len(keys)
    n_row = len(df)
    p = np.zeros(n_col)
    for i in range(n_col):
        column = keys[i]
        p[i] = df[column].count()*100/n_row
    na_rate = pd.Series(p,index = keys)
    na_rate = na_rate.sort_values(ascending = False)
    return na_rate

As we deal with NaN's differently according to the proportion of NaN in a column and the 'importance' of a column, we display the non-NaN rate of our data.

In [4]:
non_na_series = non_na_rate(df)
non_na_series

Observation_Id                                                          100.000000
Household_Id                                                            100.000000
Malaria diagnosis and parasite status [EUPATH_0000338]                  100.000000
Malaria treatment [EUPATH_0000740]                                      100.000000
Fever, subjective duration (days) [EUPATH_0000164]                      100.000000
Febrile [EUPATH_0000097]                                                100.000000
Age at visit (years) [EUPATH_0000113]                                   100.000000
Days since enrollment [EUPATH_0000191]                                  100.000000
Malaria diagnosis [EUPATH_0000090]                                      100.000000
Subjective fever [EUPATH_0000100]                                       100.000000
Participant_Id                                                          100.000000
Temperature (C) [EUPATH_0000110]                                        100.000000
Visi

We group the columns by their different uses, basically we want to predict 'diagnosis' from 'attribute', predict 'medication' from diagnosis, and study the column dependency relationships for columns all in 'diagnosis' or 'attribute'.

In [5]:
classification = {'ID':['Observation_Id','Participant_Id','Household_Id'],\
                 'attribute':['Fever, subjective duration (days) [EUPATH_0000164]','Febrile [EUPATH_0000097]','Age at visit (years) [EUPATH_0000113]','Days since enrollment [EUPATH_0000191]','Subjective fever [EUPATH_0000100]','Temperature (C) [EUPATH_0000110]','Visit type [EUPATH_0000311]','Vomiting [HP_0002013]','Anorexia [SYMP_0000523]','Abdominal pain [HP_0002027]','Fatigue [SYMP_0019177]','Diarrhea [DOID_13250]','Cough [SYMP_0000614]','Diarrhea duration (days) [EUPATH_0000157]','Headache [HP_0002315]','Cough duration (days) [EUPATH_0000156]','Jaundice [HP_0000952]','Vomiting duration (days) [EUPATH_0000165]','Seizures [SYMP_0000124]','Joint pains [SYMP_0000064]','Muscle aches [EUPATH_0000252]','Jaundice duration (days) [EUPATH_0000160]','Seizures duration (days) [EUPATH_0000163]','Anorexia duration (days) [EUPATH_0000155]','ITN last night [EUPATH_0000216]','Abdominal pain duration (days) [EUPATH_0000154]','Headache duration (days) [EUPATH_0000159]','Joint pains duration (days) [EUPATH_0000161]','Fatigue duration (days) [EUPATH_0000158]','Muscle aches duration (days) [EUPATH_0000162]','Asexual Plasmodium parasite density, by microscopy [EUPATH_0000092]','Plasmodium gametocytes present, by microscopy [EUPATH_0000207]','Asexual Plasmodium parasites present, by microscopy [EUPATH_0000048]','Weight (kg) [EUPATH_0000732]','Height (cm) [EUPATH_0010075]','Hemoglobin (g/dL) [EUPATH_0000047]','Submicroscopic Plasmodium present, by LAMP [EUPATH_0000487]'],\
                 'time_stamp':['Visit date [EUPATH_0000091]'],\
                 'prob_unimportant':['Non-malaria medication [EUPATH_0000059]','Other diagnosis [EUPATH_0000317]','Hospital admission date [EUPATH_0000319]','Admitting hospital [EUPATH_0000318]','Other medical complaint [EUPATH_0020002]','Diagnosis at hospitalization [EUPATH_0000638]','Hospital discharge date [EUPATH_0000320]'],\
                 'medication':['Malaria treatment [EUPATH_0000740]',],\
                 'diagnosis':['Malaria diagnosis and parasite status [EUPATH_0000338]','Malaria diagnosis [EUPATH_0000090]','Complicated malaria [EUPATH_0000040]','Basis of complicated diagnosis [EUPATH_0000316]','Severe malaria criteria [EUPATH_0000046]']}


From the result of 'non_na_series', we find that columns with non-NaN rate > 97% (we call them 'good columns') are 'important' columns and even after removing all rows with NaN in one or more of the good columns, we still have a considerable amount of samples to study. Hence,we decide to remove all rows with NaN in one or more of the good columns from the data frame, the new data frame is called "df_c".

In [6]:
good_columns = (non_na_series[non_na_series>97].keys()).tolist()
df_c = df.dropna(subset = good_columns).copy()

Firstly we want to deal with 'attribute' columns, we further divide them into columns with 'float64' type data and 'object' type data. We investigate the non-NaN rate 'attribute-object' columns, only three of them containing NaN values, and NaN in these columns can contain some information (probably indicating a test not done, and the reason for not done is worth investigating), so we turn NaN into string 'NaN'.

## Do 'integer encoding' on some categorical data

In [7]:
df_c_attribute = {'float':df_c[classification['attribute']].select_dtypes(include = ['float64']).copy(),\
                  'object':df_c[classification['attribute']].select_dtypes(include = ['object']).copy()}
                                
display(non_na_rate(df_c_attribute['object']))
df_attribute_c2 = df_c[classification['attribute']].copy()
filled_attribute_object = df_c_attribute['object'].copy().fillna('NaN')

Cough [SYMP_0000614]                                                    100.000000
Diarrhea [DOID_13250]                                                   100.000000
Subjective fever [EUPATH_0000100]                                       100.000000
Visit type [EUPATH_0000311]                                             100.000000
Vomiting [HP_0002013]                                                   100.000000
Anorexia [SYMP_0000523]                                                 100.000000
Abdominal pain [HP_0002027]                                             100.000000
Fatigue [SYMP_0019177]                                                  100.000000
Febrile [EUPATH_0000097]                                                100.000000
Headache [HP_0002315]                                                   100.000000
Jaundice [HP_0000952]                                                   100.000000
Seizures [SYMP_0000124]                                                 100.000000
Join

In [8]:
class cat_imputation:
    def __init__(self,df):
        self.col = (df.keys()).tolist()
        self.df = df
    
    def get_df(self):
        return self.df
    
    def get_column(self,column):
        return self.df[column]
    
    def count_col_cate(self):
        for column in self.col:
            a = self.df[column].value_counts()
            display(a)
        
    
    def all_category(self):
        '''
        returns all the categories in all columns of the data, useful when we want to encode the data frame with \
        ordinary encoder.
        '''
        s = set()
        for column in self.col:
            a = self.df[column].groupby(self.df[column]).count()
            cat = (a.keys()).tolist()
            s = s|set(cat)
        return s
            

In [9]:
t = cat_imputation(filled_attribute_object)
t.all_category()
# t.column_print('Plasmodium gametocytes present, by microscopy [EUPATH_0000207]')

{'NaN',
 'Negative',
 'No',
 'Positive',
 'Scheduled visit',
 'Unable to assess',
 'Unscheduled visit',
 'Yes',
 'no result'}

We encode the 'attribute-object' data frame according to replace_dic, and the encoded data frame is filed_attribute_object

In [10]:
replace_dic = {'Negative':-1,'No':-1,'Positive':1,'Scheduled visit':1,'Unable to assess':0.5,'Unscheduled visit':-1,\
               'Yes':1,'no result':0}
filled_attribute_object.replace(replace_dic,inplace = True)

def partial_sub(df,part_df):
    #substitue the columns in df which are also in part_df with values in part_df
    df.loc[:,(part_df.keys()).tolist()]=part_df
    return df
    
df_attribute_c2 = partial_sub(df_attribute_c2,filled_attribute_object)
df_attribute_c2 = df_attribute_c2.dropna()
display(df_attribute_c2) # The categorical entries are turned into integer, labeled as the replace_dic

df_attri_imputed = {'object':filled_attribute_object,\
                   'float':df_c_attribute['float'].copy().dropna()} 
#For 'attribute-float', we temporarily drop them, \
#as they are not so useful as 'attribute-object' when predicting diagnosis result, we mainly use 'attribute-float'
#to study the column dependency within 'attribute' columns

Unnamed: 0,"Fever, subjective duration (days) [EUPATH_0000164]",Febrile [EUPATH_0000097],Age at visit (years) [EUPATH_0000113],Days since enrollment [EUPATH_0000191],Subjective fever [EUPATH_0000100],Temperature (C) [EUPATH_0000110],Visit type [EUPATH_0000311],Vomiting [HP_0002013],Anorexia [SYMP_0000523],Abdominal pain [HP_0002027],...,Joint pains duration (days) [EUPATH_0000161],Fatigue duration (days) [EUPATH_0000158],Muscle aches duration (days) [EUPATH_0000162],"Asexual Plasmodium parasite density, by microscopy [EUPATH_0000092]","Plasmodium gametocytes present, by microscopy [EUPATH_0000207]","Asexual Plasmodium parasites present, by microscopy [EUPATH_0000048]",Weight (kg) [EUPATH_0000732],Height (cm) [EUPATH_0010075],Hemoglobin (g/dL) [EUPATH_0000047],"Submicroscopic Plasmodium present, by LAMP [EUPATH_0000487]"
1,0.0,-1,37.24,1271.0,-1,36.7,1,-1,-1,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,72.0,164.0,13.4,
2,0.0,-1,37.47,1355.0,-1,37.0,1,-1,-1,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,72.0,164.0,14.1,
3,0.0,-1,37.01,1187.0,-1,36.9,1,-1,-1,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,73.0,164.0,13.4,
4,0.0,-1,36.02,824.0,-1,36.2,1,-1,-1,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,74.0,164.0,13.8,
5,0.0,-1,35.25,545.0,-1,36.9,1,-1,-1,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,75.0,164.0,13.3,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48708,0.0,-1,3.36,110.0,-1,37.0,1,-1,-1,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,13.0,90.0,9.7,-1
48710,0.0,-1,3.20,54.0,-1,37.3,1,-1,-1,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,12.0,89.0,13.1,-1
48711,0.0,-1,3.67,223.0,-1,36.8,1,-1,-1,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,15.0,94.0,12.7,-1
48712,0.0,-1,3.28,82.0,-1,37.7,1,-1,-1,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,13.0,90.0,12.2,-1


## Deal with Diagnosis columns
For NaN in Diagnosis columns, we don't want to remove them but deal them as a category 'NaN', as it really means something. For example, the Complicated Malaria is left 'NaN' because the Malaria diagnosis is No so there is no point of doing Complicated Malaria diagnosis. 

In [11]:
#We firstly want to turn NaN into String type
df_c3 = partial_sub(df,df_attribute_c2)
df_c3_dia = df_c3[classification['diagnosis']].astype(str)
df_c3 = partial_sub(df_c3,df_c3_dia)
df_c3_attri_dia = df_c3[classification['attribute']+classification['diagnosis']].dropna()

In [39]:
def RF_diagnosis(df,column):
    #Do random forests for diagnosis columns from attribute columns
    #the df should contain both attribute and the diagnosis column to be predicted
    
    X = df[classification['attribute']]
    from sklearn.preprocessing import OrdinalEncoder
    enc_mal = OrdinalEncoder()
    data_mal = df[column].values.reshape(-1,1)
    data_mal = data_mal.astype(str)
    enc_mal.fit(data_mal)
    target_array = enc_mal.transform(data_mal)
    
    from sklearn.tree import DecisionTreeClassifier
    
    tree_simple = DecisionTreeClassifier()
    display(X[:100])
    display(target_array.shape)
    tree_simple.fit(X,target_array)
    
    from sklearn.tree import export_graphviz
    file_name = 'simple_tree_'+column+'.dot'
    #To view the dot graph, type the following
    #dot -Tpng 'simple_tree_Malaria diagnosis [EUPATH_0000090].dot' -o 'simple_tree_Malaria diagnosis [EUPATH_0000090].png'
    # in the terminal (make sure you are in the directory of the dot file)
    export_graphviz(tree_simple, out_file =file_name,feature_names=(X.keys()).tolist(),\
                    class_names = enc_mal.categories_[0],rounded = True,filled = True)

In [30]:
np.any(X.isnull())

False

In [37]:
X[:100].shape

(100, 42)

In [40]:
X = df_c3_attri_dia
RF_diagnosis(X,'Malaria diagnosis [EUPATH_0000090]')


Unnamed: 0,"Fever, subjective duration (days) [EUPATH_0000164]",Febrile [EUPATH_0000097],Age at visit (years) [EUPATH_0000113],Days since enrollment [EUPATH_0000191],Subjective fever [EUPATH_0000100],Temperature (C) [EUPATH_0000110],Visit type [EUPATH_0000311],Vomiting [HP_0002013],Anorexia [SYMP_0000523],Abdominal pain [HP_0002027],...,Joint pains duration (days) [EUPATH_0000161],Fatigue duration (days) [EUPATH_0000158],Muscle aches duration (days) [EUPATH_0000162],"Asexual Plasmodium parasite density, by microscopy [EUPATH_0000092]","Plasmodium gametocytes present, by microscopy [EUPATH_0000207]","Asexual Plasmodium parasites present, by microscopy [EUPATH_0000048]",Weight (kg) [EUPATH_0000732],Height (cm) [EUPATH_0010075],Hemoglobin (g/dL) [EUPATH_0000047],"Submicroscopic Plasmodium present, by LAMP [EUPATH_0000487]"
1,0.0,-1.0,37.24,1271.0,-1.0,36.7,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,72.0,164.0,13.4,
2,0.0,-1.0,37.47,1355.0,-1.0,37.0,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,72.0,164.0,14.1,
3,0.0,-1.0,37.01,1187.0,-1.0,36.9,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,73.0,164.0,13.4,
4,0.0,-1.0,36.02,824.0,-1.0,36.2,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,74.0,164.0,13.8,
5,0.0,-1.0,35.25,545.0,-1.0,36.9,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,75.0,164.0,13.3,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124,0.0,-1.0,32.11,1443.0,-1.0,37.5,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,74.0,163.0,12.2,
125,0.0,-1.0,30.68,920.0,-1.0,36.4,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,66.0,163.0,13.8,
127,0.0,-1.0,29.69,560.0,-1.0,36.9,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,64.0,163.0,12.4,-1
128,2.0,1.0,29.79,594.0,1.0,36.8,-1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,64.5,163.0,14.3,


(30060, 1)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

In [16]:
df_diagnosis = cat_imputation(df_c[classification['diagnosis']].fillna('no data'))
df_diagnosis.count_col_cate()

Blood smear negative / LAMP negative    17048
Blood smear negative / LAMP not done    11154
Symptomatic malaria                      5958
Blood smear not indicated                5911
Blood smear negative / LAMP positive     4181
Blood smear positive / no malaria        2988
Blood smear indicated but not done          2
Name: Malaria diagnosis and parasite status [EUPATH_0000338], dtype: int64

No     41284
Yes     5958
Name: Malaria diagnosis [EUPATH_0000090], dtype: int64

no data    41284
No          5874
Yes           84
Name: Complicated malaria [EUPATH_0000040], dtype: int64

no data                            47158
only parasite density > 500,000       46
only danger signs                     24
severe malaria                        14
Name: Basis of complicated diagnosis [EUPATH_0000316], dtype: int64

no data                    47228
severe anemia                  5
generalized convulsions        5
respiratory distress           2
cerebral malaria               2
Name: Severe malaria criteria [EUPATH_0000046], dtype: int64

In [18]:
def dic_to_arrays(d, fillna = True):
    #Designed for sub_category as in the next section, the returned [list1, list2] can be used as the hieraching
    #multi-column index for data cleaning
    l = len(d)
    list1 = []
    list2 = []
    key_length = np.zeros(l)
    for i in range(l):
        k = list(d.keys())[i]
        
        list2 += d[k]
        if fillna == True:
            list2 += ['NaN']
            list1 += [k]*(len(d[k])+1)
        else:
            list1 += [k]*(len(d[k]))
        
    return list1, list2
        

In [20]:
def category_split(s,sub_category = {'Blood smear':['Blood smear negative','not indicated','indicated but not done','Blood smear positive'],\
                                     'malaria':['Symptomatic malaria','no malaria'],\
                                     'LAMP':['LAMP negative','LAMP not done','LAMP positive']}):
    
    
    list1,list2 = dic_to_arrays(sub_category)
    column_index = pd.MultiIndex.from_arrays([list1, list2])
    r_l = len(s)
    c_l = len(list2)
    k = list(sub_category.keys())
    
    final_df = pd.DataFrame(np.zeros((r_l,c_l)),columns = column_index)
#     list_temp = [0]*c_l
    
#     v_count = s.value_counts
    for j in list(s.index):
#         display(j)
        for k1 in k:
#             display(final_df)
#             display(s[j])
            if k1 in s[j]:
                for k2 in sub_category[k1]:
#                     display(k2)
#                     display(k2 in s[j])
                    if k2 in s[j]:
                        final_df[k1,k2][j] = 1
#                         display('put 1')
            else:
                final_df[k1,'NaN'][j]=1
                
        
    
#     for i in range(c_l):
#         c = k[i]
#         list_temp[i] = np.where(s.str.contains(c))[0]
#     for i in range(c_l):
#         c = k[i]
#         for j in range(r_l):  
#             if j not in list_temp[i]:
#                 final_df.iloc[j,i]=np.array([0,0,0,1])
#             else:
#                 for n in range(len(sub_category[c])):
#                     ini = [0,0,0,0]
                    
#                     category = (sub_category[c])[n]
# #                     display(category)
#                     ini[n]=1
#                     rows = np.where(s.str.contains(category))[0]
#                     for r in rows:
#                         final_df.iloc[r,i] = ini
    return final_df    
        
    

In [21]:
def OHE_df(df):
    r_l = len(df)
    col = (df.keys()).tolist()
    
    dic = dict()
    first_iteration = True
    
    from sklearn.preprocessing import OneHotEncoder
    
    col_sub_cat= []
    for column in col:
        enc = OneHotEncoder(handle_unknown='ignore')
        diag = np.array(df_diagnosis.get_column(column).values).reshape(-1,1)
        enc.fit(diag)
        col_sub_cat.append(enc.categories_[0].tolist())
        a = enc.transform(diag).toarray()
        if first_iteration:
            diag_array = a
            first_iteration = False
        else:
            diag_array = np.concatenate((diag_array,a),axis=1)
    
    
    for column_number in range(len(col)):
        column = col[column_number]
        dic = merge_two_dicts(dic,{column:col_sub_cat[column_number]})
    list1,list2 = dic_to_arrays(dic,fillna = False)
    
    column_index = pd.MultiIndex.from_arrays([list1, list2])       
    final_df = pd.DataFrame(diag_array,columns = column_index)
    return final_df


In [22]:
def merge_two_dicts(x, y):
  z = x.copy()   # start with x's keys and values
  z.update(y)    # modifies z with y's keys and values & returns None
  return z

In [23]:
ooc = df.loc[:,classification['diagnosis'][1:]]
ooc.keys()

Index(['Malaria diagnosis [EUPATH_0000090]',
       'Complicated malaria [EUPATH_0000040]',
       'Basis of complicated diagnosis [EUPATH_0000316]',
       'Severe malaria criteria [EUPATH_0000046]'],
      dtype='object')

In [24]:
def ToOneHotEncoder(series, name):
        a = series.value_counts()
        dic = {name:list(a.index)}
        return dic

        

## Cleaned Dataset: with hierarching index and one-hot encoded categorical column

In [26]:
import numpy as np
u = OHE_df(ooc)
u

Unnamed: 0_level_0,Malaria diagnosis [EUPATH_0000090],Malaria diagnosis [EUPATH_0000090],Complicated malaria [EUPATH_0000040],Complicated malaria [EUPATH_0000040],Complicated malaria [EUPATH_0000040],Basis of complicated diagnosis [EUPATH_0000316],Basis of complicated diagnosis [EUPATH_0000316],Basis of complicated diagnosis [EUPATH_0000316],Basis of complicated diagnosis [EUPATH_0000316],Severe malaria criteria [EUPATH_0000046],Severe malaria criteria [EUPATH_0000046],Severe malaria criteria [EUPATH_0000046],Severe malaria criteria [EUPATH_0000046],Severe malaria criteria [EUPATH_0000046]
Unnamed: 0_level_1,No,Yes,No,Yes,no data,no data,only danger signs,"only parasite density > 500,000",severe malaria,cerebral malaria,generalized convulsions,no data,respiratory distress,severe anemia
0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47237,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
47238,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
47239,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
47240,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [258]:
def RF_diagnosis(column):
    from sklearn.preprocessing import OrdinalEncoder
    enc_mal = OrdinalEncoder()
    data_mal = df_diagnosis.get_column(column).values.reshape(-1,1)
    data_mal = data_mal.astype(str)
    enc_mal.fit(data_mal)
    target_array = enc_mal.transform(data_mal)
    
    from sklearn.tree import DecisionTreeClassifier
    X = df_attri_imputed['object']
    tree_simple = DecisionTreeClassifier()
    tree_simple.fit(X,target_array)
    
    from sklearn.tree import export_graphviz
    file_name = 'simple_tree_'+column+'.dot'
    export_graphviz(tree_simple, out_file =file_name,feature_names=(X.keys()).tolist(),\
                    class_names = enc_mal.categories_[0],rounded = True,filled = True)

In [28]:
RF_diagnosis(X,'Basis of complicated diagnosis [EUPATH_0000316]')


Unnamed: 0,"Fever, subjective duration (days) [EUPATH_0000164]",Febrile [EUPATH_0000097],Age at visit (years) [EUPATH_0000113],Days since enrollment [EUPATH_0000191],Subjective fever [EUPATH_0000100],Temperature (C) [EUPATH_0000110],Visit type [EUPATH_0000311],Vomiting [HP_0002013],Anorexia [SYMP_0000523],Abdominal pain [HP_0002027],...,Joint pains duration (days) [EUPATH_0000161],Fatigue duration (days) [EUPATH_0000158],Muscle aches duration (days) [EUPATH_0000162],"Asexual Plasmodium parasite density, by microscopy [EUPATH_0000092]","Plasmodium gametocytes present, by microscopy [EUPATH_0000207]","Asexual Plasmodium parasites present, by microscopy [EUPATH_0000048]",Weight (kg) [EUPATH_0000732],Height (cm) [EUPATH_0010075],Hemoglobin (g/dL) [EUPATH_0000047],"Submicroscopic Plasmodium present, by LAMP [EUPATH_0000487]"
1,0.0,-1.0,37.24,1271.0,-1.0,36.7,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,72.0,164.0,13.4,
2,0.0,-1.0,37.47,1355.0,-1.0,37.0,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,72.0,164.0,14.1,
3,0.0,-1.0,37.01,1187.0,-1.0,36.9,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,73.0,164.0,13.4,
4,0.0,-1.0,36.02,824.0,-1.0,36.2,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,74.0,164.0,13.8,
5,0.0,-1.0,35.25,545.0,-1.0,36.9,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,75.0,164.0,13.3,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124,0.0,-1.0,32.11,1443.0,-1.0,37.5,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,74.0,163.0,12.2,
125,0.0,-1.0,30.68,920.0,-1.0,36.4,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,66.0,163.0,13.8,
127,0.0,-1.0,29.69,560.0,-1.0,36.9,1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,64.0,163.0,12.4,-1
128,2.0,1.0,29.79,594.0,1.0,36.8,-1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,-1,-1,64.5,163.0,14.3,


(30060, 1)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').