In this challenge, you will predict healthcare costs using a regression algorithm.

You are given a dataset that contains information about different people including their healthcare costs. Use the data to predict healthcare costs based on new data.

The first two cells of this notebook import libraries and the data.

In [None]:
# Import libraries. You may or may not use all of these.
!pip install -q git+https://github.com/tensorflow/docs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
try:
  %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling
from sklearn.linear_model import LinearRegression

In [None]:
# Import data
!wget https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv
dataset = pd.read_csv('insurance.csv')
display(dataset.isna().sum()) # check for nan values

In [None]:
# Explore the data (Interesting Questions)
# What is the difference in BMI, smoker, children and age between male and females? (make a hist/bar plots with 4 subplots)
# How does BMI and being smoker affect the expenses? Different for male and females?
# how does the region influence the expenses?

Make sure to convert categorical data to numbers. Use 80% of the data as the train_dataset and 20% of the data as the test_dataset.
pop off the "expenses" column from these datasets to create new datasets called train_labels and test_labels. Use these labels when training your model.

In [None]:
categorical_col = ['sex','smoker','region']
numeric_col = ['age','bmi','children']
for col in categorical_col:
  dataset[col] = dataset[col].astype('category')

# one-hot encode categorical columns
dataset = pd.get_dummies(dataset, columns=['sex'], prefix='', prefix_sep='')
dataset = pd.get_dummies(dataset, columns=['smoker'], prefix='smoker_', prefix_sep='')
dataset = pd.get_dummies(dataset, columns=['region'], prefix='', prefix_sep='')

# train
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)

# pop_off expenses columns , used as labels
train_labels = train_dataset.pop('expenses')
test_labels = test_dataset.pop('expenses')

# normalize numerical columns
normalizer = layers.Normalization()
normalizer.adapt(train_dataset[numeric_col])

Create a model and train it with the train_dataset.

In [None]:
numfeatures = train_dataset.shape[1]
model = keras.Sequential([
    layers.Normalization(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(1, activation='linear')
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
  loss='mean_absolute_error',metrics=['mae','mse'])

# create a linear regression model with train_dataset
history = model.fit(train_dataset,
   train_labels, epochs=100,verbose=0) # from 3000 epochs the loss doesnt get lower
    #validation_split = 0.2)

def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  #plt.plot(history.history['val_loss'], label='val_loss')
  #plt.ylim([0, 10])
  plt.xlabel('Epoch')
  plt.ylabel('Mean absolute error [expenses]')
  plt.legend()
  plt.grid(True)
plot_loss(history)

Run the final cell in this notebook to check your model. The final cell will use the unseen test_dataset to check how well the model generalizes.
To pass the challenge, model.evaluate must return a Mean Absolute Error of under 3500. This means it predicts health care costs correctly within $3500.

The final cell will also predict expenses using the test_dataset and graph the results.

In [None]:
# RUN THIS CELL TO TEST YOUR MODEL. DO NOT MODIFY CONTENTS.
# Test model by checking how well the model generalizes using the test set.
loss, mae, mse = model.evaluate(test_dataset, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} expenses".format(mae))

if mae < 3500:
  print("You passed the challenge. Great job!")
else:
  print("The Mean Abs Error must be less than 3500. Keep trying.")

# Plot predictions.
test_predictions = model.predict(test_dataset).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True values (expenses)')
plt.ylabel('Predictions (expenses)')
lims = [0, 50000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims,lims)
