<a href="https://colab.research.google.com/github/jtmonroe/FreeCodeCamp-MachineLearning/blob/main/fcc_predict_health_costs_with_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import libraries. You may or may not use all of these.
!pip install -q git+https://github.com/tensorflow/docs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers, Sequential

import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

from scipy.stats import norm

def first(iterable):
  return next(iter(iterable))

# Linear Regression Health Costs Calculator

## Introduction

In this challenge, you will predict healthcare costs using a regression algorithm.

You are given a dataset that contains information about different people including their healthcare costs. Use the data to predict healthcare costs based on new data.

The first two cells of this notebook import libraries and the data.

Make sure to convert categorical data to numbers. Use 80% of the data as the train_dataset and 20% of the data as the test_dataset.

pop off the "expenses" column from these datasets to create new datasets called train_labels and test_labels. Use these labels when training your model.

Create a model and train it with the train_dataset. Run the final cell in this notebook to check your model. The final cell will use the unseen test_dataset to check how well the model generalizes.

To pass the challenge, model.evaluate must return a Mean Absolute Error of under 3500. This means it predicts health care costs correctly within $3500.

The final cell will also predict expenses using the test_dataset and graph the results.

In [None]:
# Import data
!wget https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv
dataset = pd.read_csv('insurance.csv')
dataset.tail()

# Understanding the Data


## Data Munging

First we try to understand the datasets through a series of plots and getting to know the datatypes. We will transform `sex` and `smoker` to `0, 1`. However, the `region` variable has 4 unique values. As a result, we will need to introduct `4-1` new columns; one for each region.

Note that we will be using a fun trick to avoid variable pollution. We do not need ALL the variables from EVERY cell, so we use `_func` pattern to only get back what we need.

In [None]:
print(f"Unique Regions: {dataset.region.unique()}")
dataset.apply(lambda x: x.dtype).T

In [None]:
def _mutate_dataset(df: pd.DataFrame) -> pd.DataFrame:
  cp_df = df.copy()
  cp_df['sex'] = dataset.sex.apply(lambda x: int(x.lower() == "male"))
  cp_df['smoker'] = dataset.sex.apply(lambda x: int(x.lower() == "male"))
  cp_df = pd.get_dummies(cp_df, columns=["region"], prefix="", prefix_sep="", dtype=int)
  cp_df.drop(columns = first(dataset.region.unique()), inplace=True)
  return cp_df

prepped_dataset = _mutate_dataset(dataset)
prepped_dataset

## Plots

### Non-Categorical Data

In [None]:
def _non_categorical_plots(dataset):
  from itertools import product

  def plot(series, ax):
    series.plot.hist(ax=ax, bins=30, density=True, xlabel=series.name, cmap="Dark2", label="hist")
    mean, sd = norm.fit(series)
    x = np.linspace(series.min(), series.max(), 100)
    ax.plot(x, norm.pdf(x, mean, sd), label="normal")
    twinAx = ax.twinx()
    series.plot.kde(ax=twinAx, color="orange", ind=x, label="kde")

    lines1, labels1 = ax.get_legend_handles_labels()
    lines2, labels2 = twinAx.get_legend_handles_labels()

    ax.legend(lines1 + lines2, labels1 + labels2)

  fig, ((ax00, ax01), (ax10, ax11), (ax20, empty)) = plt.subplots(3, 2, figsize=(20, 12))
  empty.axis('off')

  plot(dataset.age, ax00)
  plot(dataset.bmi, ax01)
  plot(dataset.expenses, ax11)
  plot(dataset.children, ax10)


  columns = ["age", "bmi", "expenses", "children"]
  corr_mat = dataset[columns].corr()
  ax20.imshow(corr_mat, cmap="Dark2")

  axis_tics = np.arange(len(columns))
  ax20.set_xticks(axis_tics)
  ax20.set_yticks(axis_tics)
  ax20.set_xticklabels(columns)
  ax20.set_yticklabels(columns)

  for i, j in product(*map(range, corr_mat.shape)):
    ax20.text(j, i, round(corr_mat.iloc[i,j], 4), ha="center", va="center", color='w')

  fig.suptitle("Continous Correlations", fontsize=24)
  return fig

_ = _non_categorical_plots(prepped_dataset)

Out of curiousity, we plot our data to get a sense for its distribution, and quickly see that our continuous variables are not normally distributed. We could do a test, but we can pretty clearly see that, for expenses and age, we are VERY far from the kde. We do, much more importantly, see that the columns are not at all correlated. Age and expenses are slightly correlated, but not enough to mean anything.

### Discrete Data

Note that we are going to use the original dataset since the plots will bin prettier.

In [None]:
def _categorical_plots(dataset):
  def plot(series, ax):
    series.value_counts().plot.bar(ax=ax, xlabel=series.name)

  fig, ((ax00, ax01), (ax10, ax11)) = plt.subplots(2, 2, figsize=(12, 9))

  plot(dataset.children, ax00)
  plot(dataset.sex, ax01)
  plot(dataset.smoker, ax10)
  plot(dataset.region, ax11)

_categorical_plots(dataset)

In [None]:
labels = prepped_dataset.expenses.values
features = prepped_dataset.drop(columns = "expenses").values

tf_dataset = tf.data.Dataset.from_tensor_slices((features, labels))
train_percent = 0.8
(train, test) = keras.utils.split_dataset(
    tf_dataset, train_percent, 1 - train_percent
)
print(train)


# Building the Model

In [None]:
model = Sequential([
    layers.Input(shape=)
])

In [None]:
# RUN THIS CELL TO TEST YOUR MODEL. DO NOT MODIFY CONTENTS.
# Test model by checking how well the model generalizes using the test set.
loss, mae, mse = model.evaluate(test_dataset, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} expenses".format(mae))

if mae < 3500:
  print("You passed the challenge. Great job!")
else:
  print("The Mean Abs Error must be less than 3500. Keep trying.")

# Plot predictions.
test_predictions = model.predict(test_dataset).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True values (expenses)')
plt.ylabel('Predictions (expenses)')
lims = [0, 50000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims,lims)
