<a href="https://colab.research.google.com/github/minnieteng/ivado-mila-dl-school-2021/blob/main/Assignment_2_Clinician_Champions_Regression_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Welcome to this Michener Institute & Vector Institute Course!

This is an optional Python tutorial in the ‘AI for Clinician Champions Certificate' program!  https://michener.ca/ce_course/ai-for-clinician-champions-certificate-program/ - 
This program is offered by Michener Institute & Vector Institute for clinicians who wish to learn more about AI in Healthcare.

Instructor: Dr. Devin Singh (@DrDevSK) | Assignment Developer: Alex Yun | Course Tutors: Jianan Chen, Flora Wan, and Alex Yun | Course Director: Shingai Manjengwa (@Tjido) 
 
***Never stop learning!***



# Case Study 1 - Regression Models

We will be exploring an medical cost dataset with various demographic and health information on patients. We can tackle 2 main ideas in supervised learning: (1) regression and (2) classification.

For a regression task, the goal is to predict medical costs using a linear regression model. For a (binary) classification task, the goal is to predict if the insurance beneficiary is a smoker or a non-smoker with a logistic regression model.

In [None]:
# Quick recap of Python syntax:
# the '#' sign denotes comments in the code - read all comments as they may include instructions for you to run in the code
# click on the 'play' button on the left hand side of the code to run each code block
# let's run some models!


In [None]:
# Import relevant Python packages
import matplotlib.pyplot as plt        # visualization
import numpy as np                     # matrices and high-level math functions
import pandas as pd                    # data manipulation
import seaborn as sns                  # visualization (based on matplotlib)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

from scipy import stats                # scientific computing
# sklearn is a popular machine learning library
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from yellowbrick.cluster import KElbowVisualizer # visualize optimal number of clusters

## Import Dataset

In [None]:
url = 'https://raw.githubusercontent.com/salexyun/Michener-AI-for-Clinician-Champions/main/medical_cost.csv'
df = pd.read_csv(url)

# ***you try*** by uncommenting the following line, 'df.head()', you can view the first few lines of data in the data file
# you may specify the number of lines to reveal in the preview of your data set e.g., df.head(20)
# df.head() 

In [None]:
# run the code to view the data here 


We have just read a comma-separated values (csv) file into a pandas data structure called DataFrame. This allows us to input the case study data into this programming environment.



## Exploratory Data Analysis (EDA)

In [None]:
print("Dimensionality of the DataFrame:", df.shape)
df.describe()
print("Data type of each feature:")
df.dtypes
print("\nAre there any missing datapoints in the dataset?", df.isnull().values.any())
print("Number of duplicated rows:", df.duplicated().sum())

There are 1338 individual datapoints with 7 columns or features in the dataset:
- age: age of the primary beneficiary
  - ratio (continuous variable)
- sex: sex of the beneficiary (male or female)
  - nominal (categorical variable)
- bmi: body mass index; a value derived from the mass and height of the beneficiary
  - interval (continuous variable)
- children: number of children covered by the insurance
  - ratio (discrete variable)
- smoker: whether the beneficiary smokes or not (yes or no)
  - nominal (categorical variable)
- region: residential area of the beneficiary in the U.S.
  - nominal (categorical variable)
- charges: individual medical costs billed by the insurance
  - ratio (continuous variable)

**int64** refers to integer numbers; **float64** refers to floating point numbers; and **object** refers to texts or alphanumeric values.

Luckily there are no missing datapoints in the dataset. Missing data can be problematic when carrying out data analysis. To combat this, we can either drop the entire row or use data imputation strategies where the missing value is replaced by a substituted value.

There is one duplicated row and will be removed accordingly.

In [None]:
df.drop_duplicates(keep='first', inplace=True)

### Data visualization

In addition to the descriptive statistics, visualizing the distribution of the data can provide additional information on the data itself and can guide us how to carry out the analysis appropriately.

In [None]:
sns.histplot(df['charges'], kde=True)
print(stats.shapiro(df['charges']));

The distribution appears to be non-Gaussian and positively skewed (or right skewed). We can formally test the normality with the Shapiro–Wilk test and can confirm that the distribution is indeed not normal.

***Note, normality is a condition for regression models.

While it is not always necessary to transform data, it often helps with interpretability and to meet certain assumptions for statistical inference. In our case, we will be using the Box-Cox transformation.

In [None]:
charges_transformed = stats.boxcox(df['charges'])[0]
sns.histplot(charges_transformed, kde=True);
plt.xlabel('charges (transformed)')

One of the major assumptions of linear regression is that there should be little to no multicollinearity; that is, independnet variables should be relatively independent from one another.

In [None]:
sns.heatmap(df.corr(), cmap='Blues', annot=True);

Checking the correlations between variables, we can ensure that the independent variables are indeed independent from one another.

***Independence of variables is a condition for regression.

We can now move on to visualizing the independent variables.

In [None]:
fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(20,5))
sns.histplot(x=df['age'], kde=True, ax=ax0);
sns.histplot(x=df['bmi'], kde=True, ax=ax1);
sns.countplot(x=df['children'], ax=ax2);

Using prior knowledge, we may be able to come up with a few hypotheses. In particular, older individuals and/or individuals with higher BMI would likely to have more ailments, and thus may incur higher medical costs.

***Correlation is different from causation. We are not establishing causal relationships between variables. 

While BMI is a continuous variable, it is often described as a (ordinal) categorical variable with the following categories: (1) underweight; (2) normal weight; (3) overweight; and (4) obese. As such, we can create a new feature accordingly.

In [None]:
conditions = [(df['bmi'] < 18.5),
              (df['bmi'] >= 18.5) & (df['bmi'] < 25),
              (df['bmi'] >= 25) & (df['bmi'] < 30),
              (df['bmi'] >= 30)]
labels = ['underweight', 'normal weight', 'overweight', 'obese']
df['bmi_categories'] = np.select(conditions, labels)
df.head()

In [None]:
fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(20,5))
sns.lineplot(x='age', y='charges', data=df, ax=ax0);
sns.barplot(x='bmi_categories', y='charges', data=df,
            order=['underweight', 'normal weight', 'overweight', 'obese'], ax=ax1);
sns.barplot(x='children', y='charges', data=df, ax=ax2);

As we suspected, medical cost tends to increase as a function of age or bmi. On the other hand, it is difficult to draw any conclusion regarding the medical costs based on the number of children/dependents that the beneficiary has. Let us examine rest of the independent variables.

In [None]:
fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(20,5))
sns.countplot(data=df, x='sex', ax=ax0);
sns.countplot(data=df, x='smoker', ax=ax1);
sns.countplot(data=df, x='region', ax=ax2);

*N.b.*, the number of smokers vs. non-smokers is quite unbalanced and it may or may not cause bias when training the model. This can be potentially addressed by data augmentation techniques. We will ignore this for now.

In [None]:
fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(20,5))
sns.pointplot(x='sex', y='charges', data=df, ax=ax0);
sns.pointplot(x='smoker', y='charges', data=df, ax=ax1);
sns.pointplot(x='region', y='charges', data=df, ax=ax2);

Initial thoughts on above plots:
- It appears that there may be sex differences in the medical costs; in particular, male, on average, spends more on medical procedures.
- Not surprisingly, medical costs are likely to be higher for smokers, compared to non-smokers.
- Interestingly, where one lives seems to have an effect on the medical costs. Most likely due to the differences in the state law, healthcare policies, and lifestyle of individuals living in different regions within the U.S.

### Aggregate plots and clusters

In [None]:
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(15,5))
sns.boxplot(x='smoker', y='charges', hue='sex', data=df, ax=ax0);
sns.boxplot(x='smoker', y='charges', hue='bmi_categories', hue_order=labels, data=df, ax=ax1);

## Unsupervised Learning: Clustering

In [None]:
partial_df = df[["bmi", "charges"]]

# Instantiate the clustering model and visualizer
visualizer = KElbowVisualizer(KMeans(), k=(2,5))
visualizer.fit(partial_df) # fit the data to the visualizer
visualizer.poof()          # finalize and render the figure

From looking at the visualizer, we can conclude that the optimal number of clusters is 3.

In [None]:
kmeans = KMeans(n_clusters=3) # initialize K-Means with 3 clusters
kmeans.fit(partial_df)
print(kmeans.labels_)
print(kmeans.cluster_centers_)

# Visualize the clusters
sns.scatterplot(x=partial_df.values[:,0], y=partial_df.values[:,1], c=kmeans.labels_, cmap='Accent', s=30).set(title='Clusters Observed by the K-Means Clustering');
sns.scatterplot(x=kmeans.cluster_centers_[:,0], y=kmeans.cluster_centers_[:,1], color='black', marker='x', s=300);


### Addendum: aggregate plots
We could have visualized additional clusters by manual plotting.

In [None]:
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(20,5))
sns.scatterplot(x='bmi', y='charges', hue='bmi_categories', data=df, ax=ax0);
sns.scatterplot(x='bmi', y='charges', hue='smoker', data=df, ax=ax1);

In [None]:
sns.scatterplot(x='bmi', y='charges', hue='smoker', data=df).set(title='Charges as a function of bmi, grouped by smoking status');

## Preprocessing

Given that we wish to use a regression model and some of the features are non-numeric, we must transform them via feature encoding. There are several encoding strategies and can vary depending on the nature of the feature (i.e., nominal vs. ordinal). In our case, we will be using a simple label encoder that turns the target labels with the value between 0 and n_classes-1.

In [None]:
encoder = LabelEncoder()
df['sex_encoded'] = encoder.fit_transform(df['sex'])
df['smoker_encoded'] = encoder.fit_transform(df['smoker'])
df['region_encoded'] = encoder.fit_transform(df['region'])
df['charges_transformed'] = stats.boxcox(df['charges'])[0]
df.head()

## Modelling

We have more or less completed the heavy lifting already. Contrary to popular belief, most of data science and AI (machine learning) is about exploring data and feature engineering (i.e., cleaning up data). That said, we are ready to define the model, train the model, make predictions, and evaluate the performance of our model.

### Linear regression

#### Build the model

In [None]:
# Dropping features that will not be used in the model
X = df.drop(['sex', 'smoker', 'region', 'charges', 'bmi_categories', 'charges_transformed'], axis=1)
y = df['charges_transformed']
X.head()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0) # 90-10 split

We will split the dataset into different datasets: (1) training set; and (2) test set. When we develop a model, we train the model using the samples in the training set. Once the model has learned, we evaluate the performance of our model with the samples in the test set. By testing on unseen data, we minimize the bias of our model and can accurately estimate the generalizability and predictive power of our model. Here, we have split the dataset into 90% training set and 10% test set. A typical split includes 70-30 split, 80-20 split, etc.

In [None]:
linear_model = LinearRegression() # define the model
linear_model.fit(X_train, y_train) # fit our data into the model and train it

print(linear_model.intercept_)
print(linear_model.coef_)

#### Make predictions

We will now make predictions on data (test set) that was not part of the training.

In [None]:
y_pred = linear_model.predict(X_test)

#### Model evaluation

Mean Squared Error $(MSE) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Mean Absolute Error $(MAE) = \frac{1}{n} \sum_{i=1}^{n} \lvert y_i - \hat{y_i} \rvert$

Both MSE and MAE represent the difference between the actual values and predicted values. Lower value indicates a better fit.

$R^2 = 1 - \frac{RSS}{TSS}$, where

$RSS=\sum_i (y_i - \hat{y})^2=$ sum of squares of residuals

$TSS=\sum_i (y_i - \bar{y})^2=$ total sum of squares

The coefficient of determination ($R^2$) is a measure of how well the model fits the dependent variable. The value ranges from 0 to 1; higher value indicating a better fit. R Squared may also be expressed as a percentage.

In [None]:
print("Mean squared error (MSE) =", metrics.mean_squared_error(y_test, y_pred))
# print("Root Mean squared error (RMSE) =", metrics.mean_squared_error(y_test, y_pred, squared=False))
print("Mean absolute error (MAE) =", metrics.mean_absolute_error(y_test, y_pred))
print("R^2 =", metrics.r2_score(y_test, y_pred))

Our model performs quite well!

### Logistic regression

Moving on to the classification problem; we will try to predict if the insurance beneficiary is a smoker or a non-smoker with a logistic regression.

#### Build the model

In [None]:
X = df[['age', 'bmi', 'children', 'sex_encoded', 'region_encoded', 'charges_transformed']]
y = df['smoker_encoded']
X.head()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
logit_model = LogisticRegression()
logit_model.fit(X_train, y_train)

#### Make predictions

In [None]:
y_pred = logit_model.predict(X_test)

#### Model evaluation

For a classification task, it is often useful to visualize using the confusion matrix. Recall in confusion matrix:

|                    |Actual/true class    | (observation)       |
|--------------------|--------------------:|:--------------------|
|**Predicted class** | true positive (tp)  | false positive (fp) |
|**(expectation)**   | false negative (fn) | true negative (tn)  | 

- tp: correct result
- fp: unexpected result (type I error)
- fn: missing result (type II error)
- tn: correct absence of result

We can utilize several metrics including precision, recall, and $F$ measure:
$$\text{precision} = \frac{tp}{tp + fp}$$
$$\text{recall} = \frac{tp}{tp + fn}$$
$$F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

In [None]:
cm = metrics.confusion_matrix(y_test, y_pred, labels=logit_model.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=logit_model.classes_)
disp.plot(cmap='Blues')
plt.show();

print(metrics.classification_report(y_test, y_pred, target_names=['non-smoker', 'smoker']));

Again, our model performs very well!

### Assignment Notes




*   Suppose you are dealing with the hospital auditor whose main goal is to reduce hospital costs; based on this data analysis, the auditor suggests that your hospital should have a capacity limit on patients with high BMI as they are likely to incur more medical costs. 
*   Based on the data analysis, is this a fair assessment? Do you trust this data set? Do you trust this model? Using the data analysis and what you are learning in this course, how would you respond to the auditor?
*   While working with a data scientist, you concluded that BMI is a useful feature in predicting medical costs, as well as certain patient outcomes. What reservations do you have about using this feature in a model? Are you creating bias unknowingly?








Congratulations, you have completed an optional Python tutorial in the ‘AI for Clinician Champions Certificate' program!  
https://michener.ca/ce_course/ai-for-clinician-champions-certificate-program/

This program is offered by Michener Institute & Vector Institute for clinicians who wish to learn more about AI in Healthcare.

Instructor: Dr. Devin Singh (@DrDevSK) | Assignment Developer: Alex Yun | Course Tutors: Jianan Chen, Flora Wan, and Alex Yun | Course Director: Shingai Manjengwa (@Tjido) 

***Never stop learning!***
