<a href="https://colab.research.google.com/github/nalinis07/APT_Class_Copy_Links/blob/MASTER/AT_Lesson_79_Class_Copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 79: Logistic Regression - Heart Disease Prediction

---

### Teacher-Student Activities

In the previous few classes, you learnt how a logistic regression model classifies labels behind the scenes.

In this class, we will continue to build a multivariate logistic regression model to predict whether a patient has heart disease. Let's quickly go through the activities covered in the previous classes and begin this class from **Activity 1: Multivariate Logistic Regression** section.

---

#### Recap

Run the code below

In [None]:
# Import the required modules and load the heart disease dataset. Also, display the first five rows.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

csv_file = 'https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/uci-heart-disease/heart.csv'
df = pd.read_csv(csv_file)
print("\n", df.head(), "\n", df.info(), "\n")

# Print the number of records with and without heart disease
print("Number of records in each label are")
print(df['target'].value_counts())

# Print the percentage of each label
print("\nPercentage of records in each label are")
print(df['target'].value_counts() * 100 / df.shape[0])

# Split the training and testing data
X = df.drop(columns = 'target')
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

    age  sex  cp  trestbps  chol  fbs  ...  exang  oldpeak  slope  ca  thal  target
0   63    1   3       145   233    1  ...      0      2.3      0   0     1       1
1   37    1   2       130   250    0  ...      0   

---

####Activity 1: Multivariate Logistic Regression^

Let's include all the features present in the heart disease dataset to build a multivariate logistic regression model using the `sklearn` module.

In [None]:
# S1.1: Create a multivariate logistic regression model. Also, predict the target values for the train set.



In [None]:
# S1.2: Predict the target values for the test set.


As you can see,
- The FP and FN values in the confusion matrix are low
- The precision and recall values are also good
- The f1-score is also greater than **0.7**

This clearly shows that the decision boundary accurately separates the labels (or classes) with good accuracy.

But this logistic regression model (refer to the object stored in the `lg_clf_1` variable) is created using all the features (or independent variables). It is quite possible that not all features are of imporatance for the classification of the labels in the `target` column. Therefore, we still can improve the model by reducing the number of features to obtain higher f1-scores.

---

#### Activity 2: Data Standardisation^^

As you must have observed, when the logistic regression is applied we got the following warning message shown below quite a few times:
```
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
```

The message is displayed because the **Limited-memory Broyden–Fletcher–Goldfarb–Shanno** (or L-BFGS) algorithm used by the `LogisticRegression` class of the `sklearn.linear_model` module to calculate the optimum value of coefficients (betas) for a regularised cost function ran out of memory to store the results of iterations. The L-BFGS algorithm, unlike gradient descent algorithm, is a second-order (uses second derivatives i.e. $\frac{\partial^2 J}{\partial \beta^2}$ instead of first-order derivatives i.e. $\frac{\partial J}{\partial \beta}$ ) optimiser stores data of the last few iterations only to save memory.

Another reason for the popping-up of the warning message is poorly scaled data. Here are a couple of ways to avoid `ConvergenceWarning` message:

1. Increase the number of iterations i.e. set the value of `max_iter` parameter to 100 i.e. `max_iter = 100` in the `LogisticRegression` constructor.

2. Scale the data using one of the normalisation methods, say standard normalisation.

Therefore, let's create a function `standard_scalar()` to normalise the `X_train` and `X_test` data-frames using standard normalisation method i.e.

$$x_{\text{std}} = \frac{(x_i - \mu)}{\sigma} $$



In [None]:
# S2.1: Normalise the train and test data-frames using the standard normalisation method.


In [None]:
# S2.2: Display descriptive statistics for the normalised values of the features for the test data-frames.


As we can observe in the output, the data is normalised because the mean and standard deviation values for each column are 0 and 1 respectively.

---

####Activity 3: Features Selection Using RFE^^^

Our next task is to select the relevant features from all the features that contribute to a person having a heart disease. The irrelevant features do not help in increasing the accuracy of a prediction model. Secondly, they also increase the training time of a model. You don't want to have either a very few features or too many of them in your prediction model.

So, the question is **how to select features?**

One simpler way is trial and error. You can pick **any one feature** at a time, build a prediction model and evaluate it.

Similarly, you pick **any two features** at a time, a prediction model and evaluate it. For example
- 1, 2
- 1, 3
- 1, 4
etc.

Similarly, you pick **any three features** at a time, a prediction model and evaluate it. For example
- 1, 2, 3
- 1, 2, 4
- 2, 3, 4
etc.

And so on. However, all this is a very time-consuming process to do manually. Instead, you can use the `RFE` (Recursive Feature Elimination) class of the `sklearn.feature_selection` module.It is a  backward feature selection technique and is based on **feature importance**. You have already learnt how to use RFE in the linear regression lesson(s).

So let's try to find the optimal number of features required using RFE to build a logistic regression model to predict whether a person has heart disease. Here is the list of steps below that we will follow for this purpose:

1. Import the following modules
```
from sklearn.feature_selection import RFE
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
```

2. Create an empty dictionary and store it in a variable called `dict_rfe`.

3. Create a `for` loop that iterates through all the columns in normalised training data-frame. Inside the loop:
   
   - Create an object of `LogisticRegression` class and store it in a variable called `lg_clf_2`.
   
   - Create an object of `RFE` class and store it in a variable called `rfe`. Inside the `RFE()` constructor, pass the object of logistic regression and the number of features to be selected by RFE as inputs.
   
   - Call the `fit()` function of the `RFE` class to train a logistic regression model on the train set with `i` number of features where `i` goes from `1` to `len(X_train.columns)`.
   
   - The `support_` attribute holds rank value(s) of the selected feature(s) where rank `1` denotes the most important feature.
   
   - Create a list to store the important features in a variable called `rfe_features`.
   
   - Create a new data-frame having the features selected by RFE store it in a variable called `rfe_X_train`.
   
   - Create another `LogisticRegression` object, store it in a variable called `lg_clf_3` and build a logistic regression model using the `rfe_X_train` data-frame and `y_train` series.
   
   - Predict the target values for the normalised test set (containing the feature(s) selected by RFE) by calling the `predict()` function on `lg_clf_3` object.
   
   - Calculate f1-scores using the function `f1_score()` function of `sklearn.metrics` module that returns a NumPy array containing f1-scores for both the classes. Store the array in a variable called `f1_scores_array`. The **syntax** for the `f1_score()` function is `f1_score(y_true, y_pred, average = None)`
     where `y_true` and `y_pred` are the actual and predicted labels respectively, and `average = None` parameter returns the scores for each class.

   - Add the number of selected features and corresponding features & f1-scores as key-value pairs in the `dict_rfe` dictionary.

In [None]:
# S3.1: Create a dictionary containing the different combination of features selected by RFE and their corresponding f1-scores.
# Import the libraries


# Create the empty dictionary.

# Create a loop


  # Build a logistic regression model using the features selected by RFE.


  # Predicting 'y' values only for the test set as generally, they are predicted quite accurately for the train set.



In the above code:

1. ```
   lg_clf_2 = LogisticRegression()
   rfe = RFE(lg_clf_2, i)
   rfe.fit(norm_X_train, y_train)
   ```
   part gets the most important features using RFE.

2. ```
   rfe_features = list(norm_X_train.columns[rfe.support_])
   rfe_X_train = norm_X_train[rfe_features]
   ```
   part creates a new data-frame containing the values of the most important feature(s) selected by RFE.

3. ```
   lg_clf_3 = LogisticRegression()
   lg_clf_3.fit(rfe_X_train, y_train)
   ```
   part builds a logistic regression model using the most important feature(s) selected by RFE.

4. ```
   y_test_pred = lg_clf_3.predict(norm_X_test[rfe_features])
   ```
   part predicts the target values on the test set only as generally a machine learning model performs well on the training set.

5. ```
   f1_scores_array = f1_score(y_test, y_test_pred, average = None)
   ```
   part calculates f1-scores

6. ```
   dict_rfe[i] = {"features": list(rfe_features), "f1_score": f1_scores_array}
   ```
   part adds the number of features, features and their corresponding f1-scores as key-value pairs to the dictionary stored in the `dict_rfe` variable.

Let's print the dictionary created.

In [None]:
# S3.2: Print the dictionary created in the previous exercise.

Let's convert the `dict_rfe` dictionary to a Pandas DataFrame using the `from_dict()` function of `pandas` module. Pass `orient = index` parameter to the function to orient the DataFrame index-wise. Otherwise, the keys of the dictionary i.e. (1 through 12) will become columns.

Moreover, we need columns having larger width in the data-frame as the columns will contain lists and arrays as their values. To do this you can use the `max_colwidth` attribute.

**Syntax:** `pd.options.display.max_colwidth = W`

where `W` is the required column width.

Let's set the column widths to 100.


In [None]:
# S3.3: Convert the dictionary to the dataframe


From the above data-frame, we can see that we get the best f1-scores for both the classes when we have 3 features which are `cp, oldpeak` and `ca`. Beyond this point, the number of features increase but the f1-scores increase only marginally. Hence, it is best to have these many features to build a prediction model to predict whether a patient has heart disease.

Let's now rebuild a logistic regression model with the ideal number of features to predict whether a person has a heart disease.

In [None]:
# S3.4: Logistic Regression with the ideal number of features.


Let's stop here. In the next class, we will learn more metrics to evaluate a classification-based machine learning model.

----