## Predicting Chronic Kidney Disease in Patients
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus on steps exploring data, building models and evaluating the models we build.

There are three links you may find important:
- [A set of chronic kidney disease (CKD) data and other biological factors](./chronic_kidney_disease_full.csv).
- [The CKD data dictionary](./chronic_kidney_disease_header.txt).
- [An article comparing the use of k-nearest neighbors and support vector machines on predicting CKD](./chronic_kidney_disease.pdf).

In [143]:
#note: impute some plus threshold none values e.g. >25% of the rows missing?

## Step 1: Define the problem.

Suppose you're working for Mayo Clinic, widely recognized to be the top hospital in the United States. In your work, you've overheard nurses and doctors discuss test results, then arrive at a conclusion as to whether or not someone has developed a particular disease or condition. For example, you might overhear something like:

> **Nurse**: Male 57 year-old patient presents with severe chest pain. FDP _(short for fibrin degradation product)_ was elevated at 13. We did an echo _(echocardiogram)_ and it was inconclusive.

> **Doctor**: What was his interarm BP? _(blood pressure)_

> **Nurse**: Systolic was 140 on the right; 110 on the left.

> **Doctor**: It's an aortic dissection! Get to the OR _(operating room)_ now!

> _(intense music playing)_

In this fictitious scenario, you might imagine the doctor going through a series of steps like a [flowchart](https://en.wikipedia.org/wiki/Flowchart), or a series of if-this-then-that steps to diagnose a patient. The first steps made the doctor ask what the interarm blood pressure was. Because interarm blood pressure took on the values it took on, the doctor diagnosed the patient with an aortic dissection.

Your goal, as a research biostatistical data scientist at the nation's top hospital, is to develop a medical test that can improve upon our current diagnosis system for [chronic kidney disease (CKD)](https://www.mayoclinic.org/diseases-conditions/chronic-kidney-disease/symptoms-causes/syc-20354521).

**Real-world problem**: Develop a medical diagnosis test that is better than our current diagnosis system for CKD.

**Data science problem**: Develop a medical diagnosis test that reduces both the number of false positives and the number of false negatives.

---

## Step 2: Obtain the data.

### 1. Read in the data.

In [146]:
#imports
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [147]:
df = pd.read_csv('chronic_kidney_disease_full.csv')

### 2. Check out the data dictionary. What are a few features or relationships you might be interested in checking out?

In [149]:
# Answer here: #I would use all the columns to check for "class" which is:  CKD or not CKD. 
#CKD is chronic kidney disease, not CKD no chronic kidney disease. 

""" The attributes in this KFT dataset age, blood pressure, specific gravity, albumin, sugar, red blood cells, pus cell, pus cell clumps, bacteria, 
blood glucose random, blood urea, serum creatinine, 
sodium, potassium, haemoglobin, packed cell volume, white blood cell count, red blood cell count, 
hypertension, diabetes mellitus, coronary artery disease, appetite, pedal edema, anaemia, class. """ 

df.columns

Index(['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
       'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'class'],
      dtype='object')

---

## Step 3: Explore the data.

### 3. How much of the data is missing from each column?

In [151]:
# Answer here:
df.isnull().sum() #all columns except 'class' have nulls

age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wbcc     106
rbcc     131
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
class      0
dtype: int64

In [152]:
df.shape

(400, 25)

In [153]:
df['rbc'].unique() #152/400 is a lot of data lost. 

array([nan, 'normal', 'abnormal'], dtype=object)

In [154]:
df['rbcc'].unique() #131/400 is a lot of data lost.

array([5.2, nan, 3.9, 4.6, 4.4, 5. , 4. , 3.7, 3.8, 3.4, 2.6, 2.8, 4.3,
       3.2, 3.6, 4.1, 4.9, 2.5, 4.2, 4.5, 3.1, 4.7, 3.5, 6. , 2.1, 5.6,
       2.3, 2.9, 2.7, 8. , 3.3, 3. , 2.4, 4.8, 5.4, 6.1, 6.2, 6.3, 5.1,
       5.8, 5.5, 5.3, 6.4, 5.7, 5.9, 6.5])

### 4. Suppose that I dropped every row that contained at least one missing value. (In the context of analysis with missing data, we call this a "complete case analysis," because we keep only the complete cases!) How many rows would remain in our dataframe? What are at least two downsides to doing this?

In [156]:
# Answer here:

#If ALL the rows are dropped due to missing values, the data lost would be large, and equals 60.5%. The total remaining rows would be: 
total_nans = df.isna().any(axis=1).sum()
total_nans
percent_data_loss = (total_nans/len(df))*100
percent_data_loss
print(f'The remaining rows are {total_nans} and the percentage of data lost is {percent_data_loss}%.')

The remaining rows are 242 and the percentage of data lost is 60.5%.


In [157]:
# Downsides to doing this:
#1. Large percent of data lost, which reduces the model's ability to capture actual patterns in the data, lowering performance or accuracy.
#2. introducing class imbalance and representation in the model, thus maybe introducing bias.

### 5. Thinking critically about how our data were gathered, it's likely that these records were gathered by doctors and nurses. Brainstorm three potential areas (in addition to the missing data we've already discussed) where this data might be inaccurate or imprecise.

In [159]:
# Answer here:

#1. contaminated samples: the sample was compromised by other substances so that the count was invalid.

#2. insufficient sample, e.g. there was not enough blood for the samples to be drawn from the patient due to limitations such as old age to do all 22 tests.

#3. incomplete testing. e.g. the patient changed hospitals for treatment and left halfway the experiment.

---

## Step 4: Model the data.

### 6. Suppose that I want to construct a model where no person who has chronic kidney disease (CKD) will ever be told that they do not have CKD. What (very simple, no machine learning needed) model can I create that will never tell a person with CKD that they do not have CKD?

> Hint: Don't think about `statsmodels` or `scikit-learn` here.

In [161]:
# Answer here:
#we would construct a confusion matrix where false negatives are 0. people with the disease, will be all falsely told they are in the positive class (CKD=1)


### 7. In problem 6, what common classification metric did we optimize for? Did we minimize false positives or negatives?

In [163]:
# Answer here: CKD= 1, no-CKD = 0 

# we are minimizing false positives.


### 8. Thinking ethically, what is at least one disadvantage to the model you described in problem 6?

In [165]:
# Answer here:
# People with CKD will not know they have the disease so they will not change any habits to extend their life or get proper, timely treatment.


### 9. Suppose that I want to construct a model where a person who does not have CKD will ever be told that they do have CKD. What (very simple, no machine learning needed) model can I create that will accomplish this?

In [167]:
# Answer here: 
#I will create a confusion matrix with minimized false negatives.

### 10. In problem 9, what common classification metric did we optimize for? Did we minimize false positives or negatives?

In [169]:
# Answer here: false negatives

### 11. Thinking ethically, what is at least one disadvantage to the model you described in problem 9?

In [171]:
# Answer here: 
#the person will stress over the disease unnecessarily. 

### 12. Construct a logistic regression model in `sklearn` predicting class from the other variables. You may scale, select/drop, and engineer features as you wish - build a good model! Make sure, however, that you include at least one categorical/dummy feature and at least one quantitative feature.

> Hint: Remember to do a train/test split!

In [173]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Import Standard Scaler
from sklearn.preprocessing import StandardScaler

# Import Logistic Regression model
from sklearn.linear_model import LogisticRegression

# Import metrics 
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay,\
accuracy_score, roc_curve,  RocCurveDisplay, roc_auc_score, recall_score, \
precision_score, f1_score, classification_report

import warnings
# Suppress FutureWarning messages
warnings.simplefilter(action='ignore', category=FutureWarning)

In [174]:
df.isnull().sum().sort_values(ascending=False)

rbc      152
rbcc     131
wbcc     106
pot       88
sod       87
pcv       71
pc        65
hemo      52
su        49
sg        47
al        46
bgr       44
bu        19
sc        17
bp        12
age        9
ba         4
pcc        4
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
class      0
dtype: int64

In [175]:
"""['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
       'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'class']"""

"['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',\n       'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc', 'htn', 'dm', 'cad',\n       'appet', 'pe', 'ane', 'class']"

 ## DATA dictionary reminder
Relevant Information:
		age		-	age	
		bp		-	blood pressure
		sg		-	specific gravity
		al		-   	albumin
		su		-	sugar
		rbc		-	red blood cells
		pc		-	pus cell
		pcc		-	pus cell clumps
		ba		-	bacteria
		bgr		-	blood glucose random
		bu		-	blood urea
		sc		-	serum creatinine
		sod		-	sodium
		pot		-	potassium
		hemo		-	hemoglobin
		pcv		-	packed cell volume
		wc		-	white blood cell count
		rc		-	red blood cell count
		htn		-	hypertension
		dm		-	diabetes mellitus
		cad		-	coronary artery disease
		appet		-	appetite
		pe		-	pedal edema
		ane		-	anemia
		class		-	class	
4.Number of Instances:  400 (250 CKD, 150 notckd)
5.Number of Attributes: 24 + class = 25(11 numeric,14 nominal) 
6.Attribute Information :
	1.Age(numerical)
 	  	age in years
	2.Blood Pressure(numerical)
      	bp in mm/Hg
	3.Specific Gravity(nominal)
  	sg - (1.005,1.010,1.015,1.020,1.025)
	4.Albumin(nominal)
	al - (0,1,2,3,4,5)
	5.Sugar(nominal)
	su - (0,1,2,3,4,5)
	6.Red Blood Cells(nominal)
	rbc - (normal,abnormal)
	7.Pus Cell (nominal)
	pc - (normal,abnormal)
	8.Pus Cell clumps(nominal)
	pcc - (present,notpresent)
	9.Bacteria(nominal)
	ba  - (present,notpresent)
	10.Blood Glucose Random(numerical)		
	bgr in mgs/dl
	11.Blood Urea(numerical)	
	bu in mgs/dl
	12.Serum Creatinine(numerical)	
	sc in mgs/dl
	13.Sodium(numerical)
	sod in mEq/L
	14.Potassium(numerical)	
	pot in mEq/L
	15.Hemoglobin(numerical)
	hemo in gms
	16.Packed  Cell Volume(numerical)
	17.White Blood Cell Count(numerical)
	wc in cells/cumm
	18.Red Blood Cell Count(numerical)	
	rc in millions/cmm
	19.Hypertension(nominal)	
	htn - (yes,no)
	20.Diabetes Mellitus(nominal)	
	dm - (yes,no)
	21.Coronary Artery Disease(nominal)
	cad - (yes,no)
	22.Appetite(nominal)	
	appet	 - (good,poor)
	23.Pedal Edema(nominal)
	pe - (yes,no)	
	24.Anemia(nominal)
	ane	- (yes,no)
	25.Class (nominal)		
	class	 - (ckd,notckd)
7. Missing Attribute Values: Yes
8. Class Distribution: ( 2 classes)
   		Class 	  Number of instances
   		ckd          	  250
   		notckd       	  150

In [177]:
# I decided to impute the median since I would loose too many rows of data. 

In [178]:
df['al'].isnull().sum()

46

In [179]:
df['al'].mean()

1.0169491525423728

In [180]:
df['al'].fillna(df['al'].mean(), inplace=True)

In [181]:
df['al'].unique()

array([1.        , 4.        , 2.        , 3.        , 0.        ,
       1.01694915, 5.        ])

In [182]:
df['sc'].isnull().sum()

17

In [183]:
df['sc'].unique()

array([ 1.2 ,  0.8 ,  1.8 ,  3.8 ,  1.4 ,  1.1 , 24.  ,  1.9 ,  7.2 ,
        4.  ,  2.7 ,  2.1 ,  4.6 ,  4.1 ,  9.6 ,  2.2 ,  5.2 ,  1.3 ,
        1.6 ,  3.9 , 76.  ,  7.7 ,   nan,  2.4 ,  7.3 ,  1.5 ,  2.5 ,
        2.  ,  3.4 ,  0.7 ,  1.  , 10.8 ,  6.3 ,  5.9 ,  0.9 ,  3.  ,
        3.25,  9.7 ,  6.4 ,  3.2 , 32.  ,  0.6 ,  6.1 ,  3.3 ,  6.7 ,
        8.5 ,  2.8 , 15.  ,  2.9 ,  1.7 ,  3.6 ,  5.6 ,  6.5 ,  4.4 ,
       10.2 , 11.5 ,  0.5 , 12.2 ,  5.3 ,  9.2 , 13.8 , 16.9 ,  6.  ,
        7.1 , 18.  ,  2.3 , 13.  , 48.1 , 14.2 , 16.4 ,  2.6 ,  7.5 ,
        4.3 , 18.1 , 11.8 ,  9.3 ,  6.8 , 13.5 , 12.8 , 11.9 , 12.  ,
       13.4 , 15.2 , 13.3 ,  0.4 ])

In [184]:
df['bu'].fillna(df['bu'].mean(), inplace=True)

In [185]:
df['bu'].isnull().sum()

0

In [186]:
df['sc'].fillna(df['sc'].mean(), inplace=True)

In [187]:
df['bu'].unique()

array([ 36.        ,  18.        ,  53.        ,  56.        ,
        26.        ,  25.        ,  54.        ,  31.        ,
        60.        , 107.        ,  55.        ,  72.        ,
        86.        ,  90.        , 162.        ,  46.        ,
        87.        ,  27.        , 148.        , 180.        ,
       163.        ,  57.42572178,  50.        ,  75.        ,
        45.        ,  28.        , 155.        ,  33.        ,
        39.        , 153.        ,  29.        ,  65.        ,
       103.        ,  70.        ,  80.        ,  20.        ,
       202.        ,  77.        ,  89.        ,  24.        ,
        17.        ,  32.        , 114.        ,  66.        ,
        38.        , 164.        , 142.        ,  96.        ,
       391.        ,  15.        , 111.        ,  73.        ,
        19.        ,  92.        ,  35.        ,  16.        ,
       139.        ,  48.        ,  85.        ,  98.        ,
       186.        ,  37.        ,  47.        ,  52.  

In [188]:
df[['al', 'sc', 'bu']]

Unnamed: 0,al,sc,bu
0,1.0,1.2,36.0
1,4.0,0.8,18.0
2,2.0,1.8,53.0
3,4.0,3.8,56.0
4,2.0,1.4,26.0
...,...,...,...
395,0.0,0.5,49.0
396,0.0,1.2,31.0
397,0.0,0.6,26.0
398,0.0,1.0,50.0


In [189]:
#for the dummy variables, I will also impute nans to no

In [190]:
df['dm'].unique()

array(['yes', 'no', nan], dtype=object)

In [191]:
df['htn'].unique()

array(['yes', 'no', nan], dtype=object)

In [192]:
df['cad'].unique()

array(['no', 'yes', nan], dtype=object)

In [193]:
df['dm'].isnull().sum()

2

In [194]:
df['htn'].isnull().sum()

2

In [195]:
df['cad'].isnull().sum()

2

In [196]:
df['dm'].fillna('no', inplace=True)

In [197]:
df['htn'].fillna('no', inplace=True)

In [198]:
df['cad'].fillna('no', inplace=True)

In [199]:
df['dm'].isnull().sum()

0

In [200]:
df['htn'].isnull().sum()

0

In [201]:
df['cad'].isnull().sum()

0

In [202]:
#create feature matrix X 
X_features = df[['al', 'sc', 'bu']]

In [203]:
# get dummies for 'dm, 'htn', 'cad' 
X_dummy = pd.get_dummies(df, columns=['dm', 'htn', 'cad'], drop_first=True)


In [204]:
X_features.head()

Unnamed: 0,al,sc,bu
0,1.0,1.2,36.0
1,4.0,0.8,18.0
2,2.0,1.8,53.0
3,4.0,3.8,56.0
4,2.0,1.4,26.0


In [205]:
X.head()

Unnamed: 0,al,sc,bu,age,bp,sg,al.1,su,rbc,pc,...,pcv,wbcc,rbcc,appet,pe,ane,class,dm_yes,htn_yes,cad_yes
0,1.0,1.2,36.0,48.0,80.0,1.02,1.0,0.0,,normal,...,44.0,7800.0,5.2,good,no,no,ckd,True,True,False
1,4.0,0.8,18.0,7.0,50.0,1.02,4.0,0.0,,normal,...,38.0,6000.0,,good,no,no,ckd,False,False,False
2,2.0,1.8,53.0,62.0,80.0,1.01,2.0,3.0,normal,normal,...,31.0,7500.0,,poor,no,yes,ckd,True,False,False
3,4.0,3.8,56.0,48.0,70.0,1.005,4.0,0.0,normal,abnormal,...,32.0,6700.0,3.9,poor,yes,yes,ckd,False,True,False
4,2.0,1.4,26.0,51.0,80.0,1.01,2.0,0.0,normal,normal,...,35.0,7300.0,4.6,good,no,no,ckd,False,False,False


In [209]:
X=X_features

In [211]:
#baseline is CKD or NOT-CKD in the 'class' column
y = df['class'] 
mean_of_outcome = round(y.value_counts(normalize=True)*100, 2)
mean_of_outcome

class
ckd       62.5
notckd    37.5
Name: proportion, dtype: float64

In [213]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)

In [215]:
# Scale Features
sc = StandardScaler() # transformer

# Fit/transform
X_train_sc = sc.fit_transform(X_train)

# Transform
X_test_sc = sc.transform(X_test)

---

## Step 5: Evaluate the model.

### 13. Based on your logistic regression model constructed in problem 12, interpret the coefficient of one of your quantitative features.

In [218]:
#as xxxxx increases by one unit, the log odds of someone being admitted incerases by 5 (coefff_

### 14. Based on your logistic regression model constructed in problem 12, interpret the coefficient of one of your categorical/dummy features.

In [221]:
#as xxxxx increases by one unit, the log odds of someone being admitted incerases by 5 (coefff_

### 15. Despite being a relatively simple model, logistic regression is very widely used in the real world. Why do you think that's the case? Name at least two advantages to using logistic regression as a modeling technique.

Answer: The advantages of using logistic regression as a modeling technique are 
Logistic regression is a classification algorithm that shares similar properties to linear regression.
The coefficients in a logistic regression model are interpretable. (They represent the change in log-odds caused by the input variables.)
Logistic regression is a very fast model to fit and generate predictions from.

### 16. Does it make sense to generate a confusion matrix on our training data or our test data? Why?

> Hint: Once you've generated your predicted $y$ values and you have your observed $y$ values, then it will be easy to [generate a confusion matrix using sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

In [None]:
# Yes, confusion matrices are useful to see our false positives and negatives, helping visualize the model's performance. 
tp, fn, fp, tn = 
# Confusion Matrix at the default threshold
confusion_matrix(y_test, y_pp['preds_threshold_50'])

### 17. In this hospital case, we want to predict CKD. Do we want to optimize for sensitivity, specificity, or something else? Why? (If you don't think there's one clear answer, that's okay! There rarely is. Be sure to defend your conclusion!)

Answer:

### 18 (BONUS). Write a function that will create an ROC curve for you, then plot the ROC curve.

Here's a strategy you might consider:
1. In order to even begin, you'll need some fit model. Use your logistic regression model from problem 12.
2. We want to look at all values of your "threshold" - that is, anything where .predict() gives you above your threshold falls in the "positive class," and anything that is below your threshold falls in the "negative class." Start the threshold at 0.
3. At this value of your threshold, calculate the sensitivity and specificity. Store these values.
4. Increment your threshold by some "step." Maybe set your step to be 0.01, or even smaller.
5. At this value of your threshold, calculate the sensitivity and specificity. Store these values.
6. Repeat steps 3 and 4 until you get to the threshold of 1.
7. Plot the values of sensitivity and 1 - specificity.

In [None]:
RocCurveDisplay.from_estimator(logreg, X_test_sc, y_test)
plt.plot([0,1], [0,1], label='baseline', linestyle= '--')
plt.legend();

### 19. Suppose you're speaking with the biostatistics lead at Mayo Clinic, who asks you "Why are unbalanced classes generally a problem? Are they a problem in this particular CKD analysis?" How would you respond?

Answer: An unbalanced class is an uneven distribution or skew of the class. Imbalanced class distribution pose a challenge for class prediction.
Slight imbalances are ok such as in this particular CKD (250 people for non-CKD vs. 150 people for CKD), techniques can be used to minimize the imbalances such as:

1. Weighting observations. 2. 
Stratified cross-validation. If we use 
-fold cross-validation entirely randomly, we may run into issues where some of our folds have no observations from the minority class. Stratifying is almost always a good idea3. .
Chanthe ge threshold for classification. By adjusting our classification threshold, we might find a better fit for our particular use-cas4. e.
Bias correction. Gary King wrote a whitepaper on this topic. This is a rigorous approach and while provide good results it's a bit of wo5. rk.
Create synthetic data of minority cl6. ass.
Oversample minority c7. lass.
Undersample majority 8. class.
Combine oversampling majority and undersampling minority c9. lasses.
Optimize for a specific metric.

### 20. Suppose you're speaking with a doctor at Mayo Clinic who, despite being very smart, doesn't know much about data science or statistics. How would you explain why unbalanced classes are generally a problem to this doctor?

Answer: The majority class are the healthy people who do not have CKD. As the model is fed more data from this class, it will predict this class more accurately from the training data, than predicting for the people who have the disease CKD since there are fewer people with the disease.

### 21. Let's create very unbalanced classes just for the sake of this example! Generate very unbalanced classes by [bootstrapping](http://stattrek.com/statistics/dictionary.aspx?definition=sampling_with_replacement) (a.k.a. random sampling with replacement) the majority class.

1. The majority class are those individuals with CKD.
2. Generate a random sample of size 200,000 of individuals who have CKD **with replacement**. (Consider setting a random seed for this part!)
3. Create a new dataframe with the original data plus this random sample of data.
4. Now we should have a dataset with around 200,000 observations, of which only about 0.00075% are non-CKD individuals.

### 22. Build a logistic regression model on the unbalanced class data and evaluate its performance using whatever method(s) you see fit. How would you describe the impact of unbalanced classes on logistic regression as a classifier?
> Be sure to look at how well it performs on non-CKD data.

---

## Step 6: Answer the problem.

At this step, you would generally answer the problem! In this situation, you would likely present your model to doctors or administrators at the hospital and show how your model results in reduced false positives/false negatives. Next steps would be to find a way to roll this model and its conclusions out across the hospital so that the outcomes of patients with CKD (and without CKD!) can be improved!