<a href="https://colab.research.google.com/github/junwei2110/FreeCodeCamp_MachineLearning/blob/Linear-DNN-Regression-Health-Costs/fcc_predict_health_costs_with_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you will predict healthcare costs using a regression algorithm.

You are given a dataset that contains information about different people including their healthcare costs. Use the data to predict healthcare costs based on new data.

The first two cells of this notebook import libraries and the data.

Make sure to convert categorical data to numbers. Use 80% of the data as the `train_dataset` and 20% of the data as the `test_dataset`.

`pop` off the "expenses" column from these datasets to create new datasets called `train_labels` and `test_labels`. Use these labels when training your model.

Create a model and train it with the `train_dataset`. Run the final cell in this notebook to check your model. The final cell will use the unseen `test_dataset` to check how well the model generalizes.

To pass the challenge, `model.evaluate` must return a Mean Absolute Error of under 3500. This means it predicts health care costs correctly within $3500.

The final cell will also predict expenses using the `test_dataset` and graph the results.

In [None]:
# Import libraries. You may or may not use all of these.
!pip install -q git+https://github.com/tensorflow/docs
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split # This is for splitting the dataset into training and testing

# Most of my answers are gotten from the tensorflow website
# https://www.tensorflow.org/tutorials/keras/regression#one_variable
# There are many many ways to use a regression model. You can use pure tensorflow (Algo 1), you can use sklearn, but in this example we will use keras instead.

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

In [None]:
# Import data
!wget https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv
dataset = pd.read_csv('insurance.csv')
dataset.tail()

In [None]:
dataset.info()
# No null values
# 4 numerical (age, bmi, no. of children, expenses) and 3 categorical (sex, smoker, region)

In [None]:
sns.pairplot(dataset, hue = 'region') # It is good to do a pairplot to see the distribution of the data, as you may see some obvious patterns

In [None]:
# We should first convert the categorical columns into integers (sex, smoker region)
# There are several ways to do this: Use this website https://www.tensorflow.org/tutorials/keras/regression#one_variable
# Note that if we use the tensorflow way to do this machine learning (Algo 1), we have to convert the categorical columns using the tf.feature_column method
# However, since we are using keras, the keras model input allows for panda dataframe

dataset = pd.get_dummies(dataset, columns=['sex', 'smoker', 'region'], prefix='', prefix_sep='') 
# This get_dummies method takes all the unique values out and forms a new column in the dataset

dataset.tail()

In [None]:
# Next, let's split the dataset into training and testing 80:20 ratio
# Then we separate the dependent variable from the independent variables (i.e. expenses)

train_dataset, test_dataset = train_test_split(dataset, test_size=0.2)

train_labels = train_dataset.pop('expenses')
test_labels = test_dataset.pop('expenses')



In [None]:
# We will start building and training the model (For this cell we shall go with the multiple linear regression model)

# It is good practice to normalize the dataset that uses features with different scales and ranges
# This is because the features are multiplied by the model weights, so the scale of the outputs are affected by the scales of the inputs
normalizer = preprocessing.Normalization()

# Build the model with the normalizer layer and dense layer
model = tf.keras.Sequential([
    normalizer,
    layers.Dense(units=1)
])

# Compile the model
model.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss=['mean_absolute_error', 'mean_squared_error'],
    metrics = ['mae', 'mse'])

history = model.fit(
    train_dataset, train_labels, 
    epochs=200,
    # suppress logging (meaning you cannot see the training progress of the model)
    verbose=0,
    # Calculate validation results on 20% of the training data
    validation_split = 0.2)



In [None]:
# We will start building and training the model (For this cell we shall go with the Deep Neural Network (DNN) with linear regression layer model)
# This section implements single-input and multiple-input DNN models. The code is basically the same except the model is expanded to include some "hidden" non-linear layers. 
# The name "hidden" here just means not directly connected to the inputs or outputs.
# These models will contain a few more layers than the linear model:
# The normalization layer.
# Two hidden, nonlinear, Dense layers using the relu nonlinearity.
# A linear single-output layer.


# Normalize the training dataset
normalizer = preprocessing.Normalization()

# Define the function to build and compile the model with the normalizer layer and dense layer
def build_and_compile_model(norm):
  model = keras.Sequential([
      norm,
      layers.Dense(64, activation='relu'),
      layers.Dense(64, activation='relu'),
      layers.Dense(1)
  ])

  model.compile(
      optimizer=tf.keras.optimizers.Adam(0.001),
      loss=['mean_absolute_error', 'mean_squared_error'],
      metrics = ['mae', 'mse'])
  
  return model

model = build_and_compile_model(normalizer)

history = model.fit(
    train_dataset, train_labels, 
    epochs=150,
    # suppress logging (meaning you cannot see the training progress of the model)
    verbose=0,
    # Calculate validation results on 20% of the training data
    validation_split = 0.2)



In [None]:
# Visualize model's training progress using the data stored in the history object (from model.fit)

def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.plot(history.history['val_loss'], label='val_loss')
  plt.ylim([0, 10000])
  plt.xlabel('Epoch')
  plt.ylabel('Error [expenses]')
  plt.legend()
  plt.grid(True)

plot_loss(history)

In [None]:
# RUN THIS CELL TO TEST YOUR MODEL. DO NOT MODIFY CONTENTS.
# Test model by checking how well the model generalizes using the test set.
loss, mae, mse = model.evaluate(test_dataset, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} expenses".format(mae))

if mae < 3500:
  print("You passed the challenge. Great job!")
else:
  print("The Mean Abs Error must be less than 3500. Keep trying.")

# Plot predictions.
test_predictions = model.predict(test_dataset).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True values (expenses)')
plt.ylabel('Predictions (expenses)')
lims = [0, 50000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims,lims)
