<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/03_regression/09_regression_project/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Regression Project

We have learned about regression and how to build regression models using both scikit-learn and TensorFlow. We will now build a regression model from start to finish. We will acquire data and perform exploratory data analysis and data preprocessing. We will build and tune our model and measure how well our model generalizes.

## Framing the Problem

### Overview

*Friendly Insurance, Inc.* has requested that we do a study for them to help predict the cost of their policyholders. They have provided us with sample [anonymous data](https://www.kaggle.com/mirichoi0218/insurance) about some of their policyholders for the previous year. The dataset includes the following information:

Column   | Description
---------|-------------
age      | age of primary beneficiary
sex      | gender of the primary beneficiary (male or female)
bmi      | body mass index of the primary beneficiary
children | number of children covered by the plan
smoker   | is the primary beneficiary a smoker (yes or no)
region   | geographic region of the beneficiaries (northeast, southeast, southwest, or northwest)
charges  | costs to the insurance company

We have been asked to create a model that, given the first six columns, can predict the charges the insurance company might incur.

The company wants to see how accurate we can get with our predictions. If we can make a case for our model, they will provide us with the full dataset of all of their customers for the last ten years to see if we can improve on our model and possibly even predict cost per client year over year.

### Exercise 1: Thinking About the Data

Before we dive in to looking closely at the data, let's think about the problem space and the dataset. Consider the questions below.

#### Question 1

Is this problem actually a good fit for machine learning? Why or why not?

##### **Student Solution**

> *Please Put Your Answer Here*

---

##### Answer Key

There are valid arguments for and against using machine learning for this scenario. Some examples are:

**Yes**

> This problem is likely a good fit for machine learning. We have a consistent set of input data that we can get from the insurance company. If we can make a reasonable model with this sample of data, we will get access to a decade of data for all of their clients, which should really help us build a powerful model.
>
> Regression is also a well-proven task of machine learning. There are strong mathematical models for linear regression and impressive cutting-edge deep learning models.

**No**

> Insurance companies invest heavily in well-trained actuaries to make these predictions. We'd be showing quite a bit of hubris to think we could completely replace their work. We might be able to create a model that could supplement their research, but there is already a lot of science around insurability and costs.
>
> Also, the data that we have isn't actually very comprehensive. We know information about the primary person covered by the insurance, but we know very little about their children and nothing about their spouse. We also don't know if the covered persons have other insurance that might be first in line to cover costs. There is so much data we don't have that it seems risky to make a model using this data.

---

#### Question 2

If we do build the machine learning model, what biases might exist in the data? Is there anything that might cause the model to have trouble generalizing to other data? If so, how might we make the model more resilient?

##### **Student Solution**

> *Please Put Your Answer Here*

---

##### Answer Key

An example answer might be:

> There are many potential sources of bias in the data that we have been given. For instance, was the sample pulled randomly from the pool of the insured? Is it actually representative of the entire population of insured? Was there some large-scale medical event (or non-event) in the year that we were given data for that might make our model not generalize well to other years? Are BMI and smoking status really the appropriate health information to partially base our model on?

---

#### Question 3

We have been asked to take input features about people who are insured and predict costs, but we haven't been given much information about how these predictions will be used. What effect might our predictions have on decisions made by the insurance company? How might this affect the insured?

##### **Student Solution**

> *Please Put Your Answer Here*

---

##### Answer Key

An example answer might be:

> Insurance companies need to make predictions about costs in order to set prices for insurance, and in some countries like the United States, to decide if they will insure a person or not. It is likely our predictions will be used by the insurance company to set plan premiums and to make decisions about who to insure.
>
> Given that our data will likely be used to determine plan costs and possibly even who gets insured, this model will have a very direct and dramatic effect on many people. The data might have an underlying bias against people with certain conditions and might be able to target those people without even knowing that they have the condition.
>
> This model could have large and damaging effects for the insurance company and for consumers. Great care must be taken when building this type of model.

---

## Exploratory Data Analysis

Now that we have considered the societal implications of our model, we can start looking at the data to get a better understanding of what we are working with.

The data we'll be using for this project can be [found on Kaggle](https://www.kaggle.com/mirichoi0218/insurance). Upload your `kaggle.json` file and run the code block below.

In [0]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'

### Exercise 2: EDA and Data Preprocessing

Using as many code and text blocks as you need, download the dataset, explore it, and do any model-independent preprocessing that you think is necessary. Feel free to use any of the tools for data analysis and visualization that we have covered in this course so far. Be sure to do individual column analysis and cross-column analysis. Explain your findings.

#### **Student Solution**

In [0]:
# Add code and text blocks to explore the data and explain your work

---

#### Answer Key

*There are many possible solutions for this exercise. An example solution follows.*

First things first, we need to download the data from Kaggle.

In [0]:
!kaggle datasets download mirichoi0218/insurance
!ls

Next, we load that data into a `DataFrame` and see the columns and data types that we'll be working with.

In [0]:
import pandas as pd

df = pd.read_csv('insurance.zip')

df.dtypes

Looks like we have both objects (strings) and numbers. Let's describe the data. We pass the `include="all"` argument so we can see the string column statistics, too.

In [0]:
df.describe(include="all")

At a glance, it doesn't look like we have any missing data.

Next we can look at each of our columns. (In our examples we just use the `.hist()` method. You might have used a package like Matplotlib or seaborn.)

We can create a histogram for 'age'. The histogram shows that this dataset seems to be somewhat biased toward a younger crowd.

In [0]:
_ = df['age'].hist()

'sex' is a string column. Let's see what values we have.

In [0]:
_ = df['sex'].hist()

Looks like there are only two values: 'male' and 'female'. This matches the documentation. The values are also relatively evenly distributed.

Next we'll look at 'bmi', which is the body mass index.

In [0]:
_ = df['bmi'].hist()

We get a somewhat normal distribution centered around `30`. Upon [further research](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html) we find that a BMI of 30 is considered obese, so many of our insured are considered obese. Let's see how many.

In [0]:
(df['bmi'] >= 30).sum() / df['bmi'].count()

Wow, more than half of our dataset is considered obese (53%). With a little research we find that 40% of the adult US population is considered obese, so our dataset skews quite a bit heavier. Putting our ethics hat back on, we might want to double-check what the insurance company is planning on doing with this model.

Moving on to the 'children' column, we can see a pretty unsurprising histogram. Many of our insured were in their 20s, so seeing 0 children being such a large value isn't shocking. Likewise, we expect fewer United States families to have 3 or more children.

In [0]:
df['children'].hist()

'Smoker' is the next column. We can see that we have two values: 'yes' and 'no'.

In [0]:
df['smoker'].hist()

'Region' is our final feature column. We can see that our insured are fairly evenly distributed across each of the four regions. It would be interesting to know if this was representative of the entire client base for the company.

In [0]:
df['region'].hist()

Finally we can look at 'charges', which is our target column. In doing so, we can see a long tail of charges that range from around `0` to around `60000`.

In [0]:
df['charges'].hist()

There are many cross-column analyses that can be performed. In the example below, we calculate average charges per region to see if any region stands out. No region seems to stand out.

In [0]:
charges_per_region = df.groupby('region').agg({'charges': ['sum', 'count']})
charges_per_region['charges']['sum'] / charges_per_region['charges']['count']

We could also create a correlation matrix to see if there are any strong correlations. There aren't any at this stage.

In [0]:
df.corr()

---

## Modeling

Now that we understand our data a little better, we can build a model. We are trying to predict 'charges', which is a continuous variable. We'll use a regression model to predict 'charges'.

### Exercise 3: Modeling

Using as many code and text blocks as you need, build a model that can predict 'charges' given the features that we have available. To do this, feel free to use any of the toolkits and models that we have explored so far.

You'll be expected to:
1. Prepare the data for the model (or models) that you choose. Remember that some of the data is categorical. In order for your model to use it, you'll need to convert the data to some numeric representation.
1. Build a model or models and adjust parameters.
1. Validate your model with holdout data. Hold out some percentage of your data (10-20%), and use it as a final validation of your model. Print the root mean squared error. We were able to get an RMSE between `3500` and `4000`, but your final RMSE will likely be different.

#### **Student Solution**

In [0]:
# Add code and text blocks to build and validate a model and explain your work

---

#### Answer Key

We chose to create a neural network using TensorFlow 2's Keras API.

The first step for our model was to one-hot encode our categorical columns.

In [0]:
one_hot_features = []

df['is_female'] = df['sex'].apply(lambda x: int(x == 'female'))
df['is_male'] = df['sex'].apply(lambda x: int(x == 'male'))
one_hot_features.extend(['is_female', 'is_male'])

df['is_smoker'] = df['smoker'].apply(lambda x: int(x == 'yes'))
df['is_nonsmoker'] = df['smoker'].apply(lambda x: int(x == 'no'))
one_hot_features.extend(['is_smoker', 'is_nonsmoker'])

df['region_sw'] = df['region'].apply(lambda x: int(x == 'southwest'))
df['region_se'] = df['region'].apply(lambda x: int(x == 'southeast'))
df['region_nw'] = df['region'].apply(lambda x: int(x == 'northwest'))
df['region_ne'] = df['region'].apply(lambda x: int(x == 'northeast'))
one_hot_features.extend(['region_sw', 'region_se', 'region_nw', 'region_ne'])

df[one_hot_features].describe()

Next, we normalized our numeric features.

In [0]:
numeric_features = ['age', 'bmi', 'children']

df.loc[:, numeric_features] = (
    (df[numeric_features] - df[numeric_features].min()) /
    (df[numeric_features].max() - df[numeric_features].min()))

df[numeric_features].describe()

We then set our target and features. Since our target has a minimum value near zero, we didn't scale it at all.

In [0]:
target = 'charges'
features = numeric_features + one_hot_features

We then split out training and testing data. We shuffled our data before splitting.

In [0]:
record_count = df['charges'].count()

test_size = int(record_count * .2)

df = df.sample(frac=1.0)

test_df, train_df = df[:test_size], df[test_size:]

test_df['charges'].count(), train_df['charges'].count()

We loaded TensorFlow 2.

In [0]:
%tensorflow_version 2.x

import tensorflow as tf
tf.__version__

And then built a sequential model that inputs our features and outputs a single value.

In [0]:
from tensorflow import keras
from tensorflow.keras import layers

feature_count = len(numeric_features + one_hot_features)

model = keras.Sequential([
  layers.Dense(128, input_shape=[feature_count], activation='relu'),
  layers.Dense(64, activation='relu'),
  layers.Dense(32, activation='relu'),
  layers.Dense(16, activation='relu'),
  layers.Dense(1, activation='relu')
])

model.summary()

We then compiled the model and trained it for `2500` epochs.

In [0]:
model.compile(
  loss='mse',
  optimizer='Adam',
  metrics=['mse'],
)

EPOCHS = 2500

history = model.fit(
  df[features],
  df[target],
  epochs=EPOCHS,
  verbose=0,
  validation_split=0.1,
)

We printed the mean squared error per epoch. In many of our runs the mean squared error, and validation mean squared error crossed as we began to overfit on our training data.

In [0]:
import pandas as pd
import math
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn import metrics
from tensorflow import keras
from tensorflow.keras import layers

errors = history.history['mse']
validation_errors = history.history['val_mse']
epochs = np.arange(0, len(errors))
plt.figure(figsize=(14,10))
plt.xlabel('Epoch')
plt.ylabel('MSE')
sns.lineplot(epochs, errors)
sns.lineplot(epochs, validation_errors)
_ = plt.legend(['Mean Squared Error', 'Validation Mean Squared Error'])

And printed our root mean squared error.

In [0]:
predictions = model.predict(test_df[features])

root_mean_squared_error = math.sqrt(
    metrics.mean_squared_error(
      predictions,
      test_df[target]
))

print("Root Mean Squared Error (on training data): %0.3f" % 
  root_mean_squared_error)

---