## Week 4, Lab 2: Predicting Chronic Kidney Disease in Patients
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus on steps exploring data, building models and evaluating the models we build.

There are three links you may find important:
- [A set of chronic kidney disease (CKD) data and other biological factors](./chronic_kidney_disease_full.csv).
- [The CKD data dictionary](./chronic_kidney_disease_header.txt).
- [An article comparing the use of k-nearest neighbors and support vector machines on predicting CKD](./chronic_kidney_disease.pdf).

## Step 1: Define the problem.

Suppose you're working for Mayo Clinic, widely recognized to be the top hospital in the United States. In your work, you've overheard nurses and doctors discuss test results, then arrive at a conclusion as to whether or not someone has developed a particular disease or condition. For example, you might overhear something like:

> **Nurse**: Male 57 year-old patient presents with severe chest pain. FDP _(short for fibrin degradation product)_ was elevated at 13. We did an echo _(echocardiogram)_ and it was inconclusive.

> **Doctor**: What was his interarm BP? _(blood pressure)_

> **Nurse**: Systolic was 140 on the right; 110 on the left.

> **Doctor**: Dammit, it's an aortic dissection! Get to the OR _(operating room)_ now!

> _(intense music playing)_

In this fictitious but [Shonda Rhimes-esque](https://en.wikipedia.org/wiki/Shonda_Rhimes#Grey's_Anatomy,_Private_Practice,_Scandal_and_other_projects_with_ABC) scenario, you might imagine the doctor going through a series of steps like a [flowchart](https://en.wikipedia.org/wiki/Flowchart), or a series of if-this-then-that steps to diagnose a patient. The first steps made the doctor ask what the interarm blood pressure was. Because interarm blood pressure took on the values it took on, the doctor diagnosed the patient with an aortic dissection.

Your goal, as a research biostatistical data scientist at the nation's top hospital, is to develop a medical test that can improve upon our current diagnosis system for [chronic kidney disease (CKD)](https://www.mayoclinic.org/diseases-conditions/chronic-kidney-disease/symptoms-causes/syc-20354521).

**Real-world problem**: Develop a medical diagnosis test that is better than our current diagnosis system for CKD.

**Data science problem**: Develop a medical diagnosis test that reduces both the number of false positives and the number of false negatives.

---

## Step 2: Obtain the data.

### 1. Read in the data.

In [180]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [140]:
pd.set_option('display.max_columns', 30);

In [141]:
df = pd.read_csv('./chronic_kidney_disease_full.csv')

In [142]:
df.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,36.0,1.2,,,15.4,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,18.0,0.8,,,11.3,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,423.0,53.0,1.8,,,9.6,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,56.0,3.8,111.0,2.5,11.2,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,106.0,26.0,1.4,,,11.6,35.0,7300.0,4.6,no,no,no,good,no,no,ckd


### 2. Check out the data dictionary. What are a few features or relationships you might be interested in checking out?

Answer:
- Check the `class` column, which could be the y variable to determine if someone has CKD or not.
- There are NaN values observed. We can dive into it further to investigate. 
- Check the distributions of each variable. There might be certain variables that overwhelmingly take on one value and thus might not be predictive.

---

## Step 3: Explore the data.

### 3. How much of the data is missing from each column?

In [143]:
df.isnull().sum()

age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wbcc     106
rbcc     131
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
class      0
dtype: int64

In [144]:
df.shape

(400, 25)

In [145]:
df.isnull().sum() / df.shape[0] * 100

age       2.25
bp        3.00
sg       11.75
al       11.50
su       12.25
rbc      38.00
pc       16.25
pcc       1.00
ba        1.00
bgr      11.00
bu        4.75
sc        4.25
sod      21.75
pot      22.00
hemo     13.00
pcv      17.75
wbcc     26.50
rbcc     32.75
htn       0.50
dm        0.50
cad       0.50
appet     0.25
pe        0.25
ane       0.25
class     0.00
dtype: float64

For columns `rbc`, `pc`, `sod`, `pot`, `pcv`, `wbcc` and `rbcc`, the percentage of missing values is greater than 15%.

### 4. Suppose that I dropped every row that contained at least one missing value. (In the context of analysis with missing data, we call this a "complete case analysis," because we keep only the complete cases!) How many rows would remain in our dataframe? What are at least two downsides to doing this?

> There's a good visual on slide 15 of [this deck](https://liberalarts.utexas.edu/prc/_files/cs/Missing-Data.pdf) that shows what a complete case analysis looks like if you're interested.

In [146]:
df.dropna(axis = 0, inplace = False).shape

(158, 25)

Answer: 158 rows would remain in our dataframe. It means that we have dropped 242 rows. One downside is that the data becomes so much smaller such that our findings later might not be accurate. Another downside is that the missing values might mean something and we might be able to derive the value by taking the median value, for example.

### 5. Thinking critically about how our data were gathered, it's likely that these records were gathered by doctors and nurses. Brainstorm three potential areas (in addition to the missing data we've already discussed) where this data might be inaccurate or imprecise.

Answer:
- The column appetite has values good and poor. It is not quantifiable. Different people might have different perceptions on good appetite.
- Pus cell column has normal and abnormal. Similarly we are not sure how it is quantified
- Same for column red blood cell

---

## Step 4: Model the data.

### 6. Suppose that I want to construct a model where no person who has CKD will ever be told that they do not have CKD. What (very simple, no machine learning needed) model can I create that will never tell a person with CKD that they do not have CKD?

> Hint: Don't think about `statsmodels` or `scikit-learn` here.

Answer: We can tell everyone that they have CKD. Then, no person would be told they do not have CKD.

### 7. In problem 6, what common classification metric did we optimize for? Did we minimize false positives or negatives?

Answer: Assuming that CKD is the positive class, we want to minimize false negatives, which means that we want to maximize sensitivity.

$$
\begin{align}
\text{Sensitivity} &=& \frac{\text{TP}}{\text{TP + FN}} \\
\Rightarrow 1 &=& \frac{\text{TP}}{\text{TP + FN}} \\
\Rightarrow \text{TP + FN} &=& \text{TP} \\
\Rightarrow \text{FN} &=& 0 \\
\end{align}
$$

### 8. Thinking ethically, what is at least one disadvantage to the model you described in problem 6?

Answer: Telling everyone they have CKD is not ethical because people will likely get expensive treatment options and it causes lots of anxiety.

### 9. Suppose that I want to construct a model where a person who does not have CKD will ever be told that they do have CKD. What (very simple, no machine learning needed) model can I create that will accomplish this?

Answer: We can tell nobody that they have CKD. Thus, no person would ever be told they have CKD.

### 10. In problem 9, what common classification metric did we optimize for? Did we minimize false positives or negatives?

Answer: Assuming that CKD is the positive class, we want to minimized false positives which means that we want to maximizing specificity.

$$
\begin{align}
\text{Specificity} &=& \frac{\text{TN}}{\text{TN + FP}} \\
\Rightarrow 1 &=& \frac{\text{TN}}{\text{TN + FP}} \\
\Rightarrow \text{TN + FP} &=& \text{TN} \\
\Rightarrow \text{FP} &=& 0 \\
\end{align}
$$

### 11. Thinking ethically, what is at least one disadvantage to the model you described in problem 9?

Answer: Telling everyone they do not have CKD is not ethical because those who actually have CKD would not be aware that they have CKD and they do not get treatment.

### 12. Construct a logistic regression model in `sklearn` predicting class from the other variables. You may scale, select/drop, and engineer features as you wish - build a good model! Make sure, however, that you include at least one categorical/dummy feature and at least one quantitative feature.

> Hint: Remember to do a train/test split!

In [147]:
df.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,36.0,1.2,,,15.4,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,18.0,0.8,,,11.3,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,423.0,53.0,1.8,,,9.6,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,56.0,3.8,111.0,2.5,11.2,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,106.0,26.0,1.4,,,11.6,35.0,7300.0,4.6,no,no,no,good,no,no,ckd


In [148]:
df['y'] = [1 if i == 'ckd' else 0 for i in df['class']]

In [149]:
df.columns

Index(['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
       'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'class', 'y'],
      dtype='object')

In [150]:
df.dtypes

age      float64
bp       float64
sg       float64
al       float64
su       float64
rbc       object
pc        object
pcc       object
ba        object
bgr      float64
bu       float64
sc       float64
sod      float64
pot      float64
hemo     float64
pcv      float64
wbcc     float64
rbcc     float64
htn       object
dm        object
cad       object
appet     object
pe        object
ane       object
class     object
y          int64
dtype: object

Use OneHotEncoder to convert the object columns into boolean values.

In [151]:
rbc_abnormal = pd.get_dummies(df['rbc'])['abnormal']
pc_abnormal = pd.get_dummies(df['pc'])['abnormal']
pcc_present = pd.get_dummies(df['pcc'])['present']
ba_present = pd.get_dummies(df['ba'])['present']
htn_yes = pd.get_dummies(df['htn'])['yes']
dm_yes = pd.get_dummies(df['dm'])['yes']
cad_yes = pd.get_dummies(df['cad'])['yes']
appet_poor = pd.get_dummies(df['appet'])['poor']
pe_yes = pd.get_dummies(df['pe'])['yes']
ane_yes = pd.get_dummies(df['ane'])['yes']

In [152]:
quant = df[['age', 'bp', 'sg', 'al', 'su', 'bgr',
                 'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv',
                 'wbcc', 'rbcc']]

qual = pd.DataFrame([ane_yes, pe_yes, rbc_abnormal, pc_abnormal,
                     pcc_present, ba_present, htn_yes, dm_yes,
                     cad_yes, appet_poor], index=['ane_yes', 'pe_yes', 'rbc_abnormal',
                                                  'pc_abnormal', 'pcc_present',
                                                  'ba_present', 'htn_yes', 'dm_yes',
                                                  'cad_yes', 'appet_poor']).T

In [153]:
X = quant.merge(right = qual, left_index = True, right_index = True)
X.head()

Unnamed: 0,age,bp,sg,al,su,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc,ane_yes,pe_yes,rbc_abnormal,pc_abnormal,pcc_present,ba_present,htn_yes,dm_yes,cad_yes,appet_poor
0,48.0,80.0,1.02,1.0,0.0,121.0,36.0,1.2,,,15.4,44.0,7800.0,5.2,False,False,False,False,False,False,True,True,False,False
1,7.0,50.0,1.02,4.0,0.0,,18.0,0.8,,,11.3,38.0,6000.0,,False,False,False,False,False,False,False,False,False,False
2,62.0,80.0,1.01,2.0,3.0,423.0,53.0,1.8,,,9.6,31.0,7500.0,,True,False,False,False,False,False,False,True,False,True
3,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,11.2,32.0,6700.0,3.9,True,True,False,True,True,False,True,False,False,True
4,51.0,80.0,1.01,2.0,0.0,106.0,26.0,1.4,,,11.6,35.0,7300.0,4.6,False,False,False,False,False,False,False,False,False,False


Fill the NaN values with the mean.

In [154]:
X_train, X_test, y_train, y_test = train_test_split(X.fillna(X.mean()),
                                                    df['y'],
                                                    test_size = 0.3, 
                                                    random_state = 42)

In [155]:
parameters = {'C': [0.001, 0.01, 0.1, 1, 10],
              'class_weight': [None, 'balanced'],
              'penalty': ['l1', 'l2']}

In [156]:
import random
random.seed(42)

lr = LogisticRegression(solver = 'liblinear', 
                        max_iter = 1000,
                        random_state = 42)

gs_results = GridSearchCV(estimator = lr,
                          param_grid = parameters,
                          scoring = 'recall',
                          cv = 5).fit(X_train, y_train)

In [157]:
gs_results.best_estimator_.get_params()

{'C': 0.001,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 1000,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l1',
 'random_state': 42,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

Based on the results of this GridSearch, our best model is that which:
- Has an inverse regularization strength of $C = 0.001$.
- Has the `L1` penalty (i.e. Lasso regression).

In [158]:
logit = LogisticRegression(solver = 'liblinear', 
                           max_iter = 1000,
                           C = 0.001,
                           class_weight = None,
                           penalty = 'l1',
                           random_state = 42)

In [159]:
logit.fit(X = X_train,
          y = y_train)

In [160]:
logit.score(X_train, y_train)

0.725

In [161]:
logit.score(X_test, y_test)

0.7416666666666667

---

## Step 5: Evaluate the model.

### 13. Based on your logistic regression model constructed in problem 12, interpret the coefficient of one of your quantitative features.

In [162]:
list(zip(np.exp(logit.coef_[0]),X.columns))

[(1.0, 'age'),
 (1.0, 'bp'),
 (1.0, 'sg'),
 (1.0, 'al'),
 (1.0, 'su'),
 (1.008914067630876, 'bgr'),
 (1.0066357378006605, 'bu'),
 (1.0, 'sc'),
 (0.9891723642997939, 'sod'),
 (1.0, 'pot'),
 (1.0, 'hemo'),
 (1.0, 'pcv'),
 (1.000056892604744, 'wbcc'),
 (1.0, 'rbcc'),
 (1.0, 'ane_yes'),
 (1.0, 'pe_yes'),
 (1.0, 'rbc_abnormal'),
 (1.0, 'pc_abnormal'),
 (1.0, 'pcc_present'),
 (1.0, 'ba_present'),
 (1.0, 'htn_yes'),
 (1.0, 'dm_yes'),
 (1.0, 'cad_yes'),
 (1.0, 'appet_poor')]

As blood glucose random (bgr) increases by 1 unit, someone is 1.008 times likely to have CKD.

### 14. Based on your logistic regression model constructed in problem 12, interpret the coefficient of one of your categorical/dummy features.

If someone's pus cell is abnormal, he/she is 1 time as likely to have CKD.

### 15. Despite being a relatively simple model, logistic regression is very widely used in the real world. Why do you think that's the case? Name at least two advantages to using logistic regression as a modeling technique.

Answer:
- We can see how X affects y.
- Logistic regression usually does not suffer from high variance.

### 16. Does it make sense to generate a confusion matrix on our training data or our test data? Why? Generate it on the proper data.

> Hint: Once you've generated your predicted $y$ values and you have your observed $y$ values, then it will be easy to [generate a confusion matrix using sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

In [163]:
y_preds = logit.predict(X_test)

In [164]:
confusion_matrix(y_test, y_preds)

array([[17, 27],
       [ 4, 72]], dtype=int64)

In [165]:
tn, fp, fn, tp = confusion_matrix(y_test, y_preds).ravel()

In [166]:
print("True Negatives: " + str(tn))
print("False Positives: " + str(fp))
print("False Negatives: " + str(fn))
print("True Positives: " + str(tp))

True Negatives: 17
False Positives: 27
False Negatives: 4
True Positives: 72


### 17. In this hospital case, we want to predict CKD. Do we want to optimize for sensitivity, specificity, or something else? Why? (If you don't think there's one clear answer, that's okay! There rarely is. Be sure to defend your conclusion!)

Answer: We want to optimize sensitivity. 
\begin{align}
\text{Sensitivity} &=& \frac{\text{TP}}{\text{TP + FN}} \\
\end{align}
We want to minimize the false negatives. We do not want a test that says someone does not have CKD when in fact he/she has CKD. He/she might die after being untreated for some time.

### 18 (BONUS). Write a function that will create an ROC curve for you, then plot the ROC curve.

Here's a strategy you might consider:
1. In order to even begin, you'll need some fit model. Use your logistic regression model from problem 12.
2. We want to look at all values of your "threshold" - that is, anything where .predict() gives you above your threshold falls in the "positive class," and anything that is below your threshold falls in the "negative class." Start the threshold at 0.
3. At this value of your threshold, calculate the sensitivity and specificity. Store these values.
4. Increment your threshold by some "step." Maybe set your step to be 0.01, or even smaller.
5. At this value of your threshold, calculate the sensitivity and specificity. Store these values.
6. Repeat steps 3 and 4 until you get to the threshold of 1.
7. Plot the values of sensitivity and 1 - specificity.

### 19. Suppose you're speaking with the biostatistics lead at Mayo Clinic, who asks you "Why are unbalanced classes generally a problem? Are they a problem in this particular CKD analysis?" How would you respond?

Answer: 
- We might not have enough data to learn the pattern of the minority class.
- They might be a problem in this CKD analysis. A 30/70 split might be unbalanced. 

### 20. Suppose you're speaking with a doctor at Mayo Clinic who, despite being very smart, doesn't know much about data science or statistics. How would you explain why unbalanced classes are generally a problem to this doctor?

Answer: When we have very few patients with rare diseases, compared to many patients with a fractured arm, and we model these together, it would be difficult for us to learn more about the rare diseases because we have little/no information

### 21. Let's create very unbalanced classes just for the sake of this example! Generate very unbalanced classes by [bootstrapping](http://stattrek.com/statistics/dictionary.aspx?definition=sampling_with_replacement) (a.k.a. random sampling with replacement) the majority class.

1. The majority class are those individuals with CKD.
2. Generate a random sample of size 200,000 of individuals who have CKD **with replacement**. (Consider setting a random seed for this part!)
3. Create a new dataframe with the original data plus this random sample of data.
4. Now we should have a dataset with around 200,000 observations, of which only about 0.00075% are non-CKD individuals.

In [167]:
ckd_sample = df[df['class'] == 'ckd'].sample(200_000,
                                               replace = True,
                                               random_state = 42)  

In [168]:
df_2 = pd.concat([df, ckd_sample])

In [169]:
df_2.shape

(200400, 26)

In [170]:
df_2.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class,y
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,36.0,1.2,,,15.4,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd,1
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,18.0,0.8,,,11.3,38.0,6000.0,,no,no,no,good,no,no,ckd,1
2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,423.0,53.0,1.8,,,9.6,31.0,7500.0,,no,yes,no,poor,no,yes,ckd,1
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,56.0,3.8,111.0,2.5,11.2,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd,1
4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,106.0,26.0,1.4,,,11.6,35.0,7300.0,4.6,no,no,no,good,no,no,ckd,1


In [171]:
100 * df_2['class'].value_counts() / len(df_2['class'])

class
ckd       99.92515
notckd     0.07485
Name: count, dtype: float64

### 22. Build a logistic regression model on the unbalanced class data and evaluate its performance using whatever method(s) you see fit. How would you describe the impact of unbalanced classes on logistic regression as a classifier?
> Be sure to look at how well it performs on non-CKD data.

In [173]:
quant = df_2[['age', 'bp', 'sg', 'al', 'su', 'bgr',
                 'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv',
                 'wbcc', 'rbcc', 'y']].copy()

quant.loc[:,'rbc_abnormal_2'] = pd.get_dummies(df_2['rbc'])['abnormal']
quant.loc[:,'pc_abnormal_2'] = pd.get_dummies(df_2['pc'])['abnormal']
quant.loc[:,'pcc_present_2'] = pd.get_dummies(df_2['pcc'])['present']
quant.loc[:,'ba_present_2'] = pd.get_dummies(df_2['ba'])['present']
quant.loc[:,'htn_yes_2'] = pd.get_dummies(df_2['htn'])['yes']
quant.loc[:,'dm_yes_2'] = pd.get_dummies(df_2['dm'])['yes']
quant.loc[:,'cad_yes_2'] = pd.get_dummies(df_2['cad'])['yes']
quant.loc[:,'appet_poor_2'] = pd.get_dummies(df_2['appet'])['poor']
quant.loc[:,'pe_yes_2'] = pd.get_dummies(df_2['pe'])['yes']
quant.loc[:,'ane_yes_2'] = pd.get_dummies(df_2['ane'])['yes']

In [174]:
quant.head()

Unnamed: 0,age,bp,sg,al,su,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc,y,rbc_abnormal_2,pc_abnormal_2,pcc_present_2,ba_present_2,htn_yes_2,dm_yes_2,cad_yes_2,appet_poor_2,pe_yes_2,ane_yes_2
0,48.0,80.0,1.02,1.0,0.0,121.0,36.0,1.2,,,15.4,44.0,7800.0,5.2,1,False,False,False,False,True,True,False,False,False,False
1,7.0,50.0,1.02,4.0,0.0,,18.0,0.8,,,11.3,38.0,6000.0,,1,False,False,False,False,False,False,False,False,False,False
2,62.0,80.0,1.01,2.0,3.0,423.0,53.0,1.8,,,9.6,31.0,7500.0,,1,False,False,False,False,False,True,False,True,False,True
3,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,11.2,32.0,6700.0,3.9,1,False,True,True,False,True,False,False,True,True,True
4,51.0,80.0,1.01,2.0,0.0,106.0,26.0,1.4,,,11.6,35.0,7300.0,4.6,1,False,False,False,False,False,False,False,False,False,False


In [175]:
quant.shape

(200400, 25)

In [176]:
quant.columns

Index(['age', 'bp', 'sg', 'al', 'su', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo',
       'pcv', 'wbcc', 'rbcc', 'y', 'rbc_abnormal_2', 'pc_abnormal_2',
       'pcc_present_2', 'ba_present_2', 'htn_yes_2', 'dm_yes_2', 'cad_yes_2',
       'appet_poor_2', 'pe_yes_2', 'ane_yes_2'],
      dtype='object')

In [177]:
X_train, X_test, y_train, y_test = train_test_split(quant.fillna(quant.mean()).drop(['y'],
                                                                                    axis = 1),
                                                    quant['y'],
                                                    test_size = 0.3, 
                                                    random_state = 42)

In [178]:
logit_2 = LogisticRegression(solver = 'liblinear',   # to compare with before
                             max_iter = 1000,
                             C = 10,
                             random_state = 42,
                             penalty = 'l2')

In [179]:
logit_2.fit(X_train, y_train)

In [181]:
print(classification_report(y_test,
                            logit_2.predict(X_test)))

              precision    recall  f1-score   support

           0       1.00      0.79      0.88        56
           1       1.00      1.00      1.00     60064

    accuracy                           1.00     60120
   macro avg       1.00      0.89      0.94     60120
weighted avg       1.00      1.00      1.00     60120



In [183]:
74/76

0.9736842105263158

The recall / sensitivity score is 1. Earlier, the logistic regression model fit on my more balanced data had a sensitivity of 72/76. In this case, it seems as though unbalanced classes actually makes our model perform better. This will usually not happen. We also wouldn't likely just want to compare models on one metric, like sensitivity. Using multiple metrics to compare these models should be done.

---

## Step 6: Answer the problem.

At this step, you would generally answer the problem! In this situation, you would likely present your model to doctors or administrators at the hospital and show how your model results in reduced false positives/false negatives. Next steps would be to find a way to roll this model and its conclusions out across the hospital so that the outcomes of patients with CKD (and without CKD!) can be improved!