<p style="font-family: Arial; font-size:4.15em; color:navy; font-style:bold"><br>
Intelligent Data Analysis</p><br>

## Part III Decision Trees  
Pavol Grofčík  
Dennis Sobolev

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import math
import json as js
import seaborn as sns
import scipy.stats as stats
import datetime as dt
import re
import statsmodels.api as sm
from sklearn import neighbors
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import Normalizer,LabelEncoder, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
#from graphviz import Source    Dennis you need to install this one to show your Decision Tree
from IPython.display import SVG

#Importing our preprocessing scripts - Preprocesing_Scripts.py
import Preprocesing_Scripts as ps

#Filter out warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

%matplotlib inline

<p style="font-family: Arial; font-size:2.95em; color:gold; font-style:bold"><br>
Content</p><br>

## Following sections

In this NB we provide our finish goal - **prediction of health state** of patients based on hormone values and other factors.
The prediction is made by **Decision Trees** and at the end of our NB we conclude different approaches using different techniques dealing with missing values.

**The following sections include:**

* **Loading and preprocessing train & valid datasets**  
* **Manual classifying based on chosen parameters found out in Analysis**  
* **Traning a model and doing cross validation**  
* **Comparison of models and their scores using different approach**  
* **Statistics & final results**

# I. Loading and preprocessing train/valid datasets

In [2]:
#In this phase we compare two techniques - median/mean method with dealing with missing values
#Lists with train & valid datasets preprocessed with the chosen technique
Medians = []
Means = []

In [3]:
#Loading train sample
df_personal = pd.read_csv("Datasets/personal_train.csv", index_col=0)
df_other = pd.read_csv("Datasets/other_train.csv", index_col=0)
df_personal.head()

Unnamed: 0,name,address,age,sex,date_of_birth
0,Terry Terry,"11818 Lori Crossing Apt. 802\nPughstad, DC 78165",68.0,M,1949-11-16
1,Edith Boudreaux,"PSC 4657, Box 5446\nAPO AP 58412",75.0,F,1943-08-10
2,Stephen Lalk,Unit 9759 Box 9470\nDPO AP 45549,67.0,M,1951-05-28
3,Abraham Bruce,"137 Lewis Flat Suite 762\nWest Elizabeth, AL 3...",34.0,?,1984-02-13
4,Janet Washington,"995 Frank Stravenue\nSouth Matthewport, TX 81402",65.0,F,1953/06/24


In [4]:
#The same process applied to valid sample
df_val_personal = pd.read_csv("Datasets/personal_valid.csv", index_col=0)
df_val_other = pd.read_csv("Datasets/other_valid.csv", index_col=0)
df_val_other.head()

Unnamed: 0,name,address,query hyperthyroid,FTI measured,education,lithium,TT4,T4U,capital-loss,capital-gain,...,hypopituitary,medical_info,on antithyroid medication,referral source,education-num,occupation,TBG measured,TBG,race,FTI
0,Lisa Maguire,"3124 Brian Green Suite 008\nEast Jennifer, RI ...",f,t,Assoc-voc,f,140.0,0.88,0.0,0.0,...,f,"{'query hypothyroid':'f','T4U measured':'t','p...",f,SVI,-1100.0,Prof-specialty,f,?,White,160.0
1,Margie Hunter,"4549 Richardson Ridge\nMooremouth, ID 75154",f,t,Some-college,f,139.0,0.86,0.0,0.0,...,f,"{'query hypothyroid':'f','T4U measured':'t','p...",f,SVI,10.0,Exec-managerial,f,?,White,161.0
2,Starla Totman,"53807 John Brooks Apt. 773\nDonaldchester, MI ...",f,t,Some-college,f,131.0,1.18,0.0,0.0,...,f,"{'query hypothyroid':'f','T4U measured':'t','p...",f,other,10.0,Adm-clerical,f,?,White,
3,Mark Deberry,"791 Kylie Island Suite 926\nCruzbury, MN 75035",f,t,Assoc-acdm,f,50.0,0.25,0.0,0.0,...,f,"{'query hypothyroid':'f','T4U measured':'t','p...",f,other,12.0,Tech-support,f,?,White,205.0
4,Jesica Gonzalez,Unit 1922 Box 3740\nDPO AP 33674,f,t,Some-college,f,67.0,0.96,,0.0,...,f,"{'query hypothyroid':'f','T4U measured':'t','p...",f,other,10.0,Adm-clerical,f,?,White,69.0


## Merging and preprocessing datasets

In [5]:
#In this section we need to preprocess train & valid datasets twice - using mean and median method for both part
df_merged = pd.merge(df_personal,df_other, on=["name", "address"], how="inner", sort = True, copy = True)

In [6]:
df_merged_valid = pd.merge(df_val_personal, df_val_other, on=["name", "address"], how = "inner", sort = True, copy = True)

In [7]:
df_merged[df_merged["name"] == "Flora Jackson"].head()

Unnamed: 0,name,address,age,sex,date_of_birth,query hyperthyroid,FTI measured,education,lithium,TT4,...,hypopituitary,medical_info,on antithyroid medication,referral source,education-num,occupation,TBG measured,TBG,race,FTI
754,Flora Jackson,"4798 Carter Turnpike\nJosemouth, WV 19340",39.0,F,1979-02-26 00 00 00,t,t,HS-grad,f,84.0,...,,"{'query hypothyroid':'f','T4U measured':'t','p...",f,other,9.0,Adm-clerical,f,?,White,85
755,Flora Jackson,"4798 Carter Turnpike\nJosemouth, WV 19340",39.0,F,1979-02-26 00 00 00,t,t,HS-grad,f,84.0,...,f,"{'query hypothyroid':'f','T4U measured':'t','p...",,other,9.0,Adm-clerical,,?,White,85


In [8]:
#We create common mask to identify duplicates
mask = df_merged.duplicated(subset=["name", "address"])

In [9]:
#Here we select only the rows that are duplicated
duplicates = df_merged[mask == True]

In [10]:
#Here we merge our datasets into one separate for median/mean method
ps.fill_from_duplicates(df=df_merged, duplicates=duplicates, indexes=duplicates.index)
ps.fill_from_duplicates(df=df_merged_valid, duplicates=duplicates, indexes=duplicates.index)

Done


In [11]:
df_merged[df_merged["name"] == "Flora Jackson"]

Unnamed: 0,name,address,age,sex,date_of_birth,query hyperthyroid,FTI measured,education,lithium,TT4,...,hypopituitary,medical_info,on antithyroid medication,referral source,education-num,occupation,TBG measured,TBG,race,FTI
741,Flora Jackson,"4798 Carter Turnpike\nJosemouth, WV 19340",39.0,F,1979-02-26 00 00 00,t,t,HS-grad,f,84.0,...,f,"{'query hypothyroid':'f','T4U measured':'t','p...",f,other,9.0,Adm-clerical,f,?,White,85


In [12]:
#Formatting the date
df_merged["date_of_birth"] = df_merged["date_of_birth"].apply(lambda x: ps.format_date(date=x))
df_merged_valid["date_of_birth"] = df_merged_valid["date_of_birth"].apply(lambda x: ps.format_date(date=x))

In [13]:
#Parsing JSON medical info
df_merged = ps.add_columns(data=df_merged, json_col = "medical_info")
df_merged_valid = ps.add_columns(data = df_merged_valid, json_col = "medical_info")

In [14]:
#Noramalizing categorical columns
ps.normalize_columns_booleans(data = df_merged, df_new= df_merged)
ps.normalize_columns_booleans(data = df_merged_valid, df_new= df_merged_valid)

Normalized query hyperthyroid
Normalized FTI measured
Normalized lithium
Normalized tumor
Normalized sick
Normalized TT4 measured
Normalized goitre
Normalized hypopituitary
Normalized on antithyroid medication
Normalized TBG measured
Normalized query hypothyroid
Normalized T4U measured
Normalized pregnant
Normalized thyroid surgery
Normalized TSH measured
Normalized query on thyroxine
Normalized I131 treatment
Normalized on thyroxine
Normalized T3 measured
Normalized psych
Normalized query hyperthyroid
Normalized FTI measured
Normalized lithium
Normalized tumor
Normalized sick
Normalized TT4 measured
Normalized goitre
Normalized hypopituitary
Normalized on antithyroid medication
Normalized TBG measured
Normalized query hypothyroid
Normalized T4U measured
Normalized pregnant
Normalized thyroid surgery
Normalized TSH measured
Normalized query on thyroxine
Normalized I131 treatment
Normalized on thyroxine
Normalized T3 measured
Normalized psych


In [15]:
#Special kind of normalization
#Stripping the class column for both - train and valid dataset
df_merged["class"] = df_merged["class"].str.replace("_","")
df_merged["class"] = df_merged["class"].str.replace("|","")
df_merged[["class", "class_num"]] = df_merged["class"].str.split(".",expand = True)

df_merged_valid["class"] = df_merged_valid["class"].str.replace("_", "")
df_merged_valid["class"] = df_merged_valid["class"].str.replace("|", "")
df_merged_valid[["class", "class_num"]] = df_merged_valid["class"].str.split(".", expand = True)

In [16]:
df_merged["class"].head()

0                     negative
1    increased binding protein
2                     negative
3                     negative
4                     negative
Name: class, dtype: object

In [17]:
df_merged[df_merged.columns[15:]].head()

Unnamed: 0,T3,fnlwgt,hours-per-week,relationship,sick,workclass,TT4 measured,class,marital-status,goitre,...,T4U measured,pregnant,thyroid surgery,TSH measured,query on thyroxine,I131 treatment,on thyroxine,T3 measured,psych,class_num
0,2.0,240810.0,45.0,Husband,f,private,t,negative,Married-civ-spouse,f,...,t,f,f,t,f,f,f,t,f,511
1,3.0,55743.0,45.0,Wife,f,Private,t,increased binding protein,Married-civ-spouse,f,...,t,f,f,t,f,f,f,t,f,1799
2,2.0,193868.0,50.0,Not-in-family,f,Self-emp-inc,t,negative,Never-married,f,...,t,f,f,t,f,f,f,t,f,2741
3,1.6,119069.0,40.0,Husband,f,Self-emp-not-inc,t,negative,Married-civ-spouse,f,...,t,f,f,t,f,f,f,t,f,542
4,1.8,195189.0,40.0,Unmarried,f,Private,t,negative,Divorced,f,...,t,f,f,t,f,f,f,t,f,72


## Here name cols that you want to use in your manual prediction, because the same cols we will use to build a model and compare it with your manual one  that are not used as follows below

# Check it out I do it for yours selected cols

# Creating mean and median datasets

In [18]:
df_mean=df_merged.copy()
df_med=df_merged.copy()

df_mean_valid=df_merged_valid.copy()
df_med_valid=df_merged_valid.copy()

In [19]:
#Here we fill missing values by chosen technique for columns we will use in next phase
Filled_cols = ["TT4", "T4U", "TSH", "T3"]

#Iterating over columns and filling it 
for column in Filled_cols:

    ps.fill_numeric_miss_values(data=df_mean, column =column, method="mean")
    ps.fill_numeric_miss_values(data=df_mean_valid, column = column, method="mean")
    ps.fill_numeric_miss_values(data=df_med, column = column, method= "med")
    ps.fill_numeric_miss_values(data=df_med_valid, column = column, method= "med")


MEAN  Value is: 109.32809956917185
Filling missing values by MEAN done!
MEAN  Value is: 105.93978723404254
Filling missing values by MEAN done!
MED  Value is: 109.32809956917185
Filling missing values by MED done!
MED  Value is: 105.93978723404254
Filling missing values by MED done!
MEAN  Value is: 0.9961508491508485
Filling missing values by MEAN done!
MEAN  Value is: 0.9874944320712697
Filling missing values by MEAN done!
MED  Value is: 0.9961508491508485
Filling missing values by MED done!
MED  Value is: 0.9874944320712697
Filling missing values by MED done!
MEAN  Value is: 4.501820448877802
Filling missing values by MEAN done!
MEAN  Value is: 6.209628603104214
Filling missing values by MEAN done!
MED  Value is: 4.501820448877802
Filling missing values by MED done!
MED  Value is: 6.209628603104214
Filling missing values by MED done!
MEAN  Value is: 2.024237288135594
Filling missing values by MEAN done!
MEAN  Value is: 1.9864356435643564
Filling missing values by MEAN done!
MED  Valu

In [20]:
#Function calculates current age from date of birth
def calculate_age(born):
    today = dt.datetime.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

In [21]:
#Script for parsing date to appropriate value
def parse_date(df, date_col):
    date = df[date_col].copy()
    
    years = 100
    days_per_year = 365.24
    
    for i in range(0,len(date)):
        date[i] = str(date[i])
        if(len(date[i].split("-")[0]) > 2):
            date[i] = date[i][2:]
            if(len(date[i]) > 2):
                date[i] = dt.datetime.strptime(date[i], "%y-%m-%d")
                date[i] = date[i] - dt.timedelta(days = (years*days_per_year))
                
        else:
            if(len(date[i]) > 2):
                date[i] = dt.datetime.strptime(date[i], "%y-%m-%d")
                date[i] = date[i] - dt.timedelta(days = (years*days_per_year))

        
        
    df[date_col] = date  

In [22]:
#Parsing dates
parse_date(df_mean, "date_of_birth")
parse_date(df_med, "date_of_birth")
#parse_date(df_med_valid, "date_of_birth")
#parse_date(df_mean_valid, "date_of_birth")

In [23]:
df_mean_valid["date_of_birth"].unique()

array(['1985-08-10', '1959-09-19', '1989-04-07', '1990-04-21',
       '1939-10-01', '62-04-21', '1949-12-23', '1953-01-25', '1960-08-29',
       '1964-08-26', '1943-02-07', '1993-12-09', '1985-02-27',
       '2002-02-08', '1960-06-19', '1972-01-19', '1952-05-12',
       '1949-01-12', '1975-12-06', '1986-07-13', '1980-09-19',
       '1983-06-22', '1980-03-08', '1999-01-25', '1946-12-12', '48-03-24',
       '1992-05-04', '1935-06-18', '1958-12-15', '1944-05-23',
       '1985-10-07', '1997-02-16', '1935-07-04', '1929-12-02',
       '1951-07-26', '1997-09-12', '1987-05-17', '91-03-20', '1950-05-12',
       '1958-03-01', '1953-05-06', '1943-01-31', '1982-06-15',
       '1953-12-13', '1939-08-27', '48-03-13', '1965-09-08', '1952-12-25',
       '1997-09-30', '1997-09-06', '1976-10-30', '1944-03-01',
       '1952-04-11', '1979-09-07', '1970-06-11', '1949-07-29',
       '1982-06-22', '1968-05-09', '1972-06-17', '92-12-01', '1999-08-23',
       '1953-04-19', '1990-08-18', '1965-11-07', '1985-12-

## One hot encoding

In [24]:
#Check what number for "class" represents which diagnosis
df_med[df_med['class']=='negative']['name'].head(1)

0    Aaron Johansen
Name: name, dtype: object

In [25]:
df_med[df_med['class']=='increased binding protein']['name'].head(1)

1    Abigail Martinez
Name: name, dtype: object

In [26]:
df_med[df_med['class']=='decreased binding protein']['name'].head(1)

483    Daniel Horner
Name: name, dtype: object

In [27]:
#Selecting columns that will be one-hot encoded
Labeled_colums = ["pregnant", "psych", "I131 treatment", "goitre", 
                 "tumor", "thyroid surgery", "class", "sex", "on thyroxine",
                 "query hyperthyroid", "query hypothyroid"]

a=pd.DataFrame()

#Label encoding for all columns
for col in Labeled_colums:
    lbl_encoder = LabelEncoder()    
    
    df_med[col] = df_med[col].apply(lambda x: str(x))
    df_med_valid[col] = df_med_valid[col].apply(lambda x: str(x))
    df_mean[col] = df_mean[col].apply(lambda x: str(x))
    df_mean_valid[col] = df_mean_valid[col].apply(lambda x: str(x))
    
    df_med[col + "_en"] = lbl_encoder.fit_transform(df_med[col])
    a[col+"_en"] = lbl_encoder.fit_transform(df_med[col])
    df_med_valid[col + "_en"] = lbl_encoder.fit_transform(df_med_valid[col])
    df_mean[col + "_en"] = lbl_encoder.fit_transform(df_mean[col])
    df_mean_valid[col + "_en"] = lbl_encoder.fit_transform(df_mean_valid[col])
    
    print(col + " encoded!")

pregnant encoded!
psych encoded!
I131 treatment encoded!
goitre encoded!
tumor encoded!
thyroid surgery encoded!
class encoded!
sex encoded!
on thyroxine encoded!
query hyperthyroid encoded!
query hypothyroid encoded!


In [28]:
#After column label encoding we

Labeled_colums = ["pregnant", "psych", "I131 treatment", "goitre", 
                 "tumor", "thyroid surgery", "sex", "on thyroxine",
                 "query hyperthyroid", "query hypothyroid"]

for col in Labeled_colums:
    ohe = OneHotEncoder()
    
    Xcol = ohe.fit_transform(df_mean[col].values.reshape(-1,1)).toarray()
    dfOneHot = pd.DataFrame(Xcol, columns = [str(col)+str(int(i)) for i in range(Xcol.shape[1])])
    pd.concat([df_mean, dfOneHot], axis=1)

In [29]:
#Check encoded values for "class" column
df_mean[df_mean['name']=='Aaron Johansen'][['class','class_en']]

Unnamed: 0,class,class_en
0,negative,2


In [30]:
df_mean[df_mean['name']=='Abigail Martinez'][['class','class_en']]

Unnamed: 0,class,class_en
1,increased binding protein,1


In [31]:
df_mean[df_mean['name']=='Daniel Horner'][['class','class_en']]

Unnamed: 0,class,class_en
483,decreased binding protein,0


## Here is the encoded data, so we can build a model 

In [32]:
df_mean.head()

Unnamed: 0,name,address,age,sex,date_of_birth,query hyperthyroid,FTI measured,education,lithium,TT4,...,psych_en,I131 treatment_en,goitre_en,tumor_en,thyroid surgery_en,class_en,sex_en,on thyroxine_en,query hyperthyroid_en,query hypothyroid_en
0,Aaron Johansen,"744 Sandoval Causeway\nEast Robertburgh, NC 69901",59.0,M,1959-08-06 00:00:00,f,t,Assoc-acdm,f,118.0,...,0,0,0,0,0,2,2,0,0,0
1,Abigail Martinez,"59351 Craig Courts\nGordonbury, WI 53797",18.0,F,1899-12-09 00:00:00,f,t,Bachelors,f,143.0,...,0,0,0,0,0,1,1,0,0,0
2,Abraham Bruce,"137 Lewis Flat Suite 762\nWest Elizabeth, AL 3...",34.0,?,1884-02-13 00:00:00,f,t,HS-grad,f,95.0,...,0,0,0,1,0,2,0,0,0,0
3,Abraham Hicks,"031 Wood Wall Apt. 152\nVictorburgh, CA 40253",29.0,M,1889-10-04 00:00:00,f,t,Bachelors,f,135.0,...,0,0,0,0,0,2,2,0,0,0
4,Ada Jeffries,"66827 Ortiz Radial\nWest Justin, IL 04779",70.0,F,1948-04-25 00:00:00,f,t,Some-college,f,122.0,...,0,0,0,0,0,2,1,0,0,0


# II. Manual Prediction

According to medical compendium for thyroid diagnostics, parameters we ultimately need are: TSH, TT4, T4U and T3 hormones. T3U could also be useful, but unfortunately, we don't have T3U measurements in our data.

Here are the rules for categorizing:
> 1. If **TSH <0.4**:  
>> Check the level of T3 and T4. If **T3 >2.8** _nanomole/liter_ OR/AND **TT4 >140** _nanomole/liter_ OR/AND **T4U >65** _nanomole/liter_, then the diagnosis is highly close to **hyperthyroidism**.  
> 2. If **TSH >4.0**:  
>> Check the level of T4. If **TT4 <60** _nanomole/liter_ OR/AND **T4U <20** _nanomole/liter_, then the diagnosis is highly close to **hypothyroidism**.  

Normal values for the hormones are:

> **TSH**: 0.4-4.0 nanomole/liter  
> **TT4**: 60-140 nanomole/liter  
> **T4U**: 20-65 nanomole/liter  
> **T3**: 0.9-2.8 nanomole/liter  

In order to avoid false positive predictions due to very strict bounds, let's give a 5-15% deviation for each parameter.

In [55]:
def manual_predict(dframe=None):
    decreased=[]
    increased=[]
    predicted=[]
    for index, row in dframe.iterrows():
        if row['TSH'] < 0.4:
            if row['TT4'] > 140 and row['T4U'] > 0.65 or row['T3'] > 2.8:
                #increased.append(index)
                predicted.append("increased binding protein")
            else:
                predicted.append("negative")
        elif row['TSH'] > 4:
            if row['TT4'] < 60 and row['T4U'] < 0.2 or row['T3'] < 0.9:
                #decreased.append(index)
                predicted.append("decreased binding protein")
            else:
                predicted.append("negative")
        else:
            predicted.append("negative")
    print(len(predicted),len(dframe))
    #x=0
    #y=0
    #for i in range(len(dframe)):
    #    if x<len(decreased) and i==decreased[x]:
    #        dframe['class_predicted_manual'].iloc[i]="decreased binding protein"
    #        x+=1
    #    elif y<len(increased) and i==increased[y]:
    #        dframe['class_predicted_manual'].iloc[i]="increased binding protein"
    #        y+=1
    return predicted

### Counting Accuracy, Precision and Recall:

In [59]:
def apr_manual(dfr=None):
    dframe=None
    dframe=dfr
    true_positive=0
    true_negative=0
    false_positive=0
    false_negative=0
    accuracy=0
    precision=0
    recall=0
    for index, row in dframe.iterrows():
        if row['class']==row['class_predicted_manual']:
            if row['class']=="negative":
                true_negative+=1
            else:
                true_positive+=1
        else:
            if row['class']=="negative" and (row['class_predicted_manual']=="increased binding protein" 
                                             or row['class_predicted_manual']=="decreased binding protein"):
                false_positive+=1
            elif row['class_predicted_manual']=="negative" and (row['class']=="increased binding protein" 
                                             or row['class']=="decreased binding protein"):
                false_negative+=1
    
    accuracy = (true_positive+true_negative)/(true_positive+true_negative+false_positive+false_negative)
    precision = true_positive/(true_positive+false_positive)
    recall = true_positive/(true_positive+false_negative)
    
    print("Accuracy:",accuracy,"Precision:",precision,"Recall:",recall)

In [64]:
pred_manual=manual_predict(df_med)
df_med['class_predicted_manual']=pred_manual
apr_manual(df_med)

2237 2237
Accuracy: 0.8815377738042021 Precision: 0.14666666666666667 Recall: 0.3113207547169811


In [65]:
pred_manual=manual_predict(df_mean)
df_mean['class_predicted_manual']=pred_manual
apr_manual(df_mean)

2237 2237
Accuracy: 0.8815377738042021 Precision: 0.14666666666666667 Recall: 0.3113207547169811


# III. Decision Tree model

In [77]:
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold
df_target=df_med.copy()

features = ["TSH","TT4","T4U","T3","pregnant_en", "psych_en", "I131 treatment_en", "goitre_en", 
                 "tumor_en", "thyroid surgery_en", "sex_en", "on thyroxine_en",
                 "query hyperthyroid_en", "query hypothyroid_en"]

kf = StratifiedKFold(n_splits=10)
fold_count = 0
score=[]
for train, test in kf.split(df_target[features],df_target['class']):

    print("Processing fold %s" % fold_count)
    train_fold = df_target.iloc[train]
    test_fold = df_target.iloc[test]
    
#    # find best features
#    corr = train_fold.corr()['class'][train_fold.corr()['class'] < 1].abs()
#    corr.sort(ascending=False)
#    features = corr.index[[0,1]].values
    
    # Get training examples
    train_fold_input = train_fold[features].values
    train_fold_output = train_fold['class']
    
    # Fit logistic regression
    tree = DecisionTreeClassifier(criterion="entropy")
    tree.fit(train_fold_input, train_fold_output)
    
    # Check MSE on test set
    pred = tree.predict(test_fold[features])
    score.append(classification_report(test_fold['class'], pred, target_names=test_fold['class'].unique()))
    
    # Done with the fold
    fold_count += 1

print(DataFrame(score))

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Processing fold 0
Processing fold 1
Processing fold 2
Processing fold 3
Processing fold 4
Processing fold 5
Processing fold 6
Processing fold 7
Processing fold 8
Processing fold 9


ValueError: Number of classes, 3, does not match size of target_names, 2. Try specifying the labels parameter

In [38]:
#Here we train our decision tree model
Tree = DecisionTreeClassifier(criterion="entropy")
#df[pd.notnull(df['class'])]
df_med_train=df_med.copy()
df_med_valid =df_med_valid.copy()
df_mean_train =df_mean.copy()
df_mean_valid =df_mean_valid.copy()

In [40]:
#Here we build our Tree using encoded columns
df_med_train["class_predicted_tree"]='negative'
df_mean_train["class_predicted_tree"]='negative'
Labeled_colums = ["TSH","TT4","T4U","T3","pregnant_en", "psych_en", "I131 treatment_en", "goitre_en", 
                 "tumor_en", "thyroid surgery_en", "sex_en", "on thyroxine_en",
                 "query hyperthyroid_en", "query hypothyroid_en"]
Predicted_col = ["class"] #This is the column we want to predict- does not have to be encoded (One-hot encoding)
X_med = df_med_train[Labeled_colums]
Y_med = df_med_train[Predicted_col]
X_mean = df_mean_train[Labeled_colums]
Y_mean = df_mean_train[Predicted_col]
Tree.fit(X=X_med,y=Y_med) #Here we train our model
#Tree.fit(X=X_mean,y=Y_mean)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [41]:
df_med_train[df_med_train.columns[22:]].head()

Unnamed: 0,class,marital-status,goitre,native-country,hypopituitary,medical_info,on antithyroid medication,referral source,education-num,occupation,...,goitre_en,tumor_en,thyroid surgery_en,class_en,sex_en,on thyroxine_en,query hyperthyroid_en,query hypothyroid_en,class_predicted_manual,class_predicted_tree
0,negative,Married-civ-spouse,f,United-States,f,"{'query hypothyroid':'f','T4U measured':'t','p...",f,SVHC,12.0,Craft-repair,...,0,0,0,2,2,0,0,0,negative,negative
1,increased binding protein,Married-civ-spouse,f,United-States,f,"{'query hypothyroid':'f','T4U measured':'t','p...",f,other,13.0,Exec-managerial,...,0,0,0,1,1,0,0,0,negative,negative
2,negative,Never-married,f,United-States,f,"{'query hypothyroid':'f','T4U measured':'t','p...",f,STMW,9.0,Sales,...,0,1,0,2,0,0,0,0,negative,negative
3,negative,Married-civ-spouse,f,United-States,f,"{'query hypothyroid':'f','T4U measured':'t','p...",f,other,13.0,Adm-clerical,...,0,0,0,2,2,0,0,0,negative,negative
4,negative,Divorced,f,United-States,f,"{'query hypothyroid':'f','T4U measured':'t','p...",f,SVI,10.0,Other-service,...,0,0,0,2,1,0,0,0,negative,negative


In [42]:
prdtree=Tree.predict(df_med_valid[Labeled_colums])

In [46]:
df_med_valid['class_predicted_tree_en']=prdtree
##df_med_valid['class_predicted_tree']=lbl_encoder.inverse_transform(df_med_valid['class_predicted_tree_en'])
#for i in range(len(df_med)):
#    df_med['class_predicted_tree_en']

In [47]:
df_med_valid[df_med_valid['class']=='nan']

Unnamed: 0,name,address,age,sex,date_of_birth,query hyperthyroid,FTI measured,education,lithium,TT4,...,I131 treatment_en,goitre_en,tumor_en,thyroid surgery_en,class_en,sex_en,on thyroxine_en,query hyperthyroid_en,query hypothyroid_en,class_predicted_tree_en
278,Dorothy Mosier,Unit 4826 Box 7150\nDPO AA 16892,29,F,89-03-12,f,t,HS-grad,f,105.939787,...,0,0,0,0,2,1,0,0,0,negative
334,Evelyn Pittman,"9397 Rice Island Suite 974\nMartinmouth, GA 24315",54,F,1963-11-17,f,t,HS-grad,f,60.0,...,0,1,0,0,2,1,0,0,0,negative
404,Helen Leblanc,"322 Brown Springs\nRobertstad, ND 59568",37,F,1981-05-10,f,t,Assoc-acdm,f,84.0,...,0,0,0,0,2,1,0,0,0,negative
442,James Shelton,"809 Kristin Neck\nLeslieview, WY 39222",34,M,1984-06-07,f,t,Some-college,,96.0,...,0,0,0,0,2,2,0,0,0,negative
683,Marvin Gaudio,"09527 Flynn Ways\nMartinburgh, OK 06507",23,M,1995-06-06,,t,HS-grad,f,102.0,...,0,0,0,0,2,2,0,1,0,negative
688,Mary Campbell,"116 Breanna Vista\nMahoneyberg, NJ 47403",69,F,49-04-27,f,t,Some-college,f,105.939787,...,0,1,0,0,2,1,0,0,0,negative
818,Regina Loftin,Unit 4047 Box 0400\nDPO AE 56453,68,F,1950-06-22,f,t,Some-college,f,105.939787,...,0,0,0,0,2,1,0,0,0,negative
883,Santos Young,"7397 Angel Union\nMarkberg, AK 34042",50,F,1968-04-29,,t,Bachelors,f,25.0,...,0,0,0,0,2,1,0,1,0,negative


In [None]:
df_med_valid[['name','class','class_en','class_predicted_tree_en']].head(50)

In [None]:
# Graph vizualization
from sklearn.tree import export_graphviz
from graphviz import Source
from IPython.display import SVG

graph = Source(export_graphviz(cls, 
                               out_file=None,
                               feature_names=encoded.columns,
                               class_names=['no', 'yes'],
                               filled = True))

display(SVG(graph.pipe(format='svg')))

from IPython.display import HTML
style = "<style>svg{width:70% !important;height:70% !important;}</style>"
HTML(style)

In [None]:
def apr_automated(dframe):
    true_positive=0
    true_negative=0
    false_positive=0
    false_negative=0
    accuracy=0
    precision=0
    recall=0
    for index, row in dframe.iterrows():
        if row['class']==row['class_predicted_tree']:
            if row['class']=="negative":
                true_negative+=1
            else:
                true_positive+=1
        else:
            if row['class']=="negative" and (row['class_predicted_tree']=="increased binding protein" 
                                             or row['class_predicted_tree']=="decreased binding protein"):
                false_positive+=1
            elif row['class_predicted_tree']=="negative" and (row['class']=="increased binding protein" 
                                             or row['class']=="decreased binding protein"):
                false_negative+=1
    
    accuracy = (true_positive+true_negative)/(true_positive+true_negative+false_positive+false_negative)
    precision = true_positive/(true_positive+false_positive)
    recall = true_positive/(true_positive+false_negative)
    
    print("Accuracy:",accuracy,"Precision:",precision,"Recall:",recall)

# Conclusion

During the final phase we created a pipeline to clean data from all future datasets by using our own functions, stored in separate script file and in this IPYNB. After that, we created two methods to predict values in possible future datasets:
> 1) By manual classification.  
> 2) By decision tree predictor row by row.  

Also we created a method to calculate Accuracy, Prediction and Recall values for each type of predictionm which may be used upon datasets to control the process: apr_manual and apr_automated.