# Insurance cost prediction using linear regression


We're going to use information like a person's age, sex, BMI, no. of children and smoking habit to predict the price of yearly medical bills. This kind of model is useful for insurance companies to determine the yearly insurance premium for a person. The dataset for this problem is taken from [Kaggle](https://www.kaggle.com/mirichoi0218/insurance).


We will create a model with the following steps:
1. Download and explore the dataset
2. Prepare the dataset for training
3. Create a linear regression model
4. Train the model to fit the data
5. Make predictions using the trained model




In [None]:
# Uncomment and run the appropriate command for your operating system, if required
 
# Linux / Binder
# !pip install numpy matplotlib pandas torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
 
# Windows
# !pip install numpy matplotlib pandas torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
 
# MacOS
# !pip install numpy matplotlib pandas torch torchvision torchaudio

In [None]:
import torch
import jovian
import torchvision
import torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torchvision.datasets.utils import download_url
from torch.utils.data import DataLoader, TensorDataset, random_split

## Step 1: Download and explore the data

Let us begin by downloading the data. We'll use the `download_url` function from PyTorch to get the data as a CSV (comma-separated values) file. 

In [None]:
DATASET_URL = "https://hub.jovian.ml/wp-content/uploads/2020/05/insurance.csv"
DATA_FILENAME = "insurance.csv"
download_url(DATASET_URL, '.')

Downloading https://hub.jovian.ml/wp-content/uploads/2020/05/insurance.csv to ./insurance.csv


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

To load the dataset into memory, we'll use the `read_csv` function from the `pandas` library. The data will be loaded as a Pandas dataframe. See this short tutorial to learn more: https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/

In [None]:
dataframe_raw = pd.read_csv(DATA_FILENAME)
dataframe_raw.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


We're going to do a slight customization of the data, so that you every participant receives a slightly different version of the dataset. Fill in your name below as a string (enter at least 5 characters)

In [None]:
your_name = 'jeshlindonna' # at least 5 characters

The `customize_dataset` function will customize the dataset slightly using your name as a source of random numbers.

In [None]:
def customize_dataset(dataframe_raw, rand_str):
    dataframe = dataframe_raw.copy(deep=True)
    # drop some rows
    dataframe = dataframe.sample(int(0.95*len(dataframe)), random_state=int(ord(rand_str[0])))
    # scale input
    dataframe.bmi = dataframe.bmi * ord(rand_str[1])/100.
    # scale target
    dataframe.charges = dataframe.charges * ord(rand_str[2])/100.
    # drop column
    if ord(rand_str[3]) % 2 == 1:
        dataframe = dataframe.drop(['region'], axis=1)
    return dataframe

In [None]:
dataframe = customize_dataset(dataframe_raw, your_name)
dataframe.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
619,55,female,37.471,0,no,southwest,12320.6906
319,32,male,37.70835,1,no,northeast,5367.748797
34,28,male,36.764,1,yes,southwest,58873.743011
403,49,male,32.623,3,no,northwest,11809.879
113,21,female,36.0772,0,no,northwest,2765.44387


Let us answer some basic questions about the dataset. 


**Q: How many rows does the dataset have?**

In [None]:
num_rows = 5
print(num_rows)

5


**Q: How many columns doe the dataset have**

In [None]:
num_cols = 7
print(num_cols)

7


**Q: What are the column titles of the input variables?**

In [None]:
input_cols = ['age' , 'sex' , 'bmi', 'children' , 'smoker', 'region']

**Q: Which of the input columns are non-numeric or categorial variables ?**

Hint: `sex` is one of them. List the columns that are not numbers.

In [None]:
categorical_cols = ['sex', 'smoker', 'region']

**Q: What are the column titles of output/target variable(s)?**

In [None]:
output_cols = ['charges']

Remember to commit your notebook to Jovian after every step, so that you don't lose your work.

In [None]:
!pip install jovian --upgrade -q

In [None]:
import jovian

In [None]:
jovian.commit(project='my-project')

[jovian] Detected Colab notebook...[0m
[jovian] Please enter your API key ( from https://jovian.ai/ ):[0m
API KEY: ··········
[jovian] Uploading colab notebook to Jovian...[0m
[jovian] Capturing environment..[0m
[jovian] Committed successfully! https://jovian.ai/mm20b029/my-project[0m


'https://jovian.ai/mm20b029/my-project'

## Step 2: Prepare the dataset for training

We need to convert the data from the Pandas dataframe into a PyTorch tensors for training. To do this, the first step is to convert it numpy arrays. If you've filled out `input_cols`, `categorial_cols` and `output_cols` correctly, this following function will perform the conversion to numpy arrays.

In [None]:
def dataframe_to_arrays(dataframe):
    # Make a copy of the original dataframe
    dataframe1 = dataframe.copy(deep=True)
    # Convert non-numeric categorical columns to numbers
    for col in categorical_cols:
        dataframe1[col] = dataframe1[col].astype('category').cat.codes
    # Extract input & outupts as numpy arrays
    inputs_array = dataframe1[input_cols].to_numpy()
    targets_array = dataframe1[output_cols].to_numpy()
    return inputs_array, targets_array

Read through the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) to understand how we're converting categorical variables into numbers.

In [None]:
inputs_array, targets_array = dataframe_to_arrays(dataframe)
inputs_array, targets_array

(array([[55.     ,  0.     , 37.471  ,  0.     ,  0.     ,  3.     ],
        [32.     ,  1.     , 37.70835,  1.     ,  0.     ,  0.     ],
        [28.     ,  1.     , 36.764  ,  1.     ,  1.     ,  3.     ],
        ...,
        [54.     ,  1.     , 29.492  ,  1.     ,  0.     ,  3.     ],
        [18.     ,  1.     , 27.6336 ,  1.     ,  1.     ,  0.     ],
        [22.     ,  1.     , 37.4407 ,  2.     ,  1.     ,  2.     ]]),
 array([[12320.6906   ],
        [ 5367.7487975],
        [58873.743011 ],
        ...,
        [12001.5104   ],
        [19755.48476  ],
        [43107.116695 ]]))

**Q: Convert the numpy arrays `inputs_array` and `targets_array` into PyTorch tensors. Make sure that the data type is `torch.float32`.**

In [None]:
import numpy as np
inputs_array=inputs_array.astype(np.float32)
inputs = torch.from_numpy(inputs_array)
targets_array=targets_array.astype(np.float32)
targets = torch.from_numpy(targets_array)

In [None]:
inputs.dtype, targets.dtype

(torch.float32, torch.float32)

Next, we need to create PyTorch datasets & data loaders for training & validation. We'll start by creating a `TensorDataset`.

In [None]:
dataset = TensorDataset(inputs, targets)
len(dataset)

1271

**Q: Pick a number between `0.1` and `0.2` to determine the fraction of data that will be used for creating the validation set. Then use `random_split` to create training & validation datasets.**

In [None]:
val_percent = 0.15 # between 0.1 and 0.2
val_size = int(num_rows * val_percent)
train_size = num_rows - val_size
 
 
train_ds, val_ds = random_split(dataset,(1000,271)) # Use the random_split function to split dataset into 2 parts of the desired length

Finally, we can create data loaders for training & validation.

**Q: Pick a batch size for the data loader.**

In [None]:
batch_size = 100

In [None]:
train_loader = DataLoader(train_ds, batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size)

Let's look at a batch of data to verify everything is working fine so far.

In [None]:
for xb, yb in train_loader:
    print("inputs:", xb)
    print("targets:", yb)
    break

inputs: tensor([[18.0000,  0.0000, 37.2185,  0.0000,  1.0000,  2.0000],
        [18.0000,  0.0000, 21.8766,  0.0000,  1.0000,  0.0000],
        [51.0000,  1.0000, 40.0970,  1.0000,  0.0000,  3.0000],
        [32.0000,  0.0000, 31.8554,  1.0000,  0.0000,  0.0000],
        [56.0000,  0.0000, 26.8660,  1.0000,  0.0000,  1.0000],
        [61.0000,  0.0000, 44.4400,  0.0000,  0.0000,  3.0000],
        [34.0000,  1.0000, 25.5227,  1.0000,  0.0000,  1.0000],
        [30.0000,  0.0000, 20.1495,  3.0000,  0.0000,  1.0000],
        [26.0000,  0.0000, 42.8240,  1.0000,  0.0000,  3.0000],
        [53.0000,  1.0000, 34.4460,  0.0000,  1.0000,  0.0000],
        [50.0000,  1.0000, 32.1432,  0.0000,  1.0000,  0.0000],
        [60.0000,  0.0000, 30.8050,  0.0000,  0.0000,  3.0000],
        [41.0000,  1.0000, 31.0878,  3.0000,  1.0000,  0.0000],
        [33.0000,  1.0000, 36.1075,  1.0000,  1.0000,  2.0000],
        [57.0000,  0.0000, 22.4523,  0.0000,  0.0000,  0.0000],
        [24.0000,  0.0000, 34.32

Let's save our work by committing to Jovian.

In [None]:
jovian.commit(project=project_name, environment=None)

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
[jovian] Committed successfully! https://jovian.ai/mm20b029/02-insurance-linear-regression[0m


'https://jovian.ai/mm20b029/02-insurance-linear-regression'

## Step 3: Create a Linear Regression Model

Our model itself is a fairly straightforward linear regression (we'll build more complex models in the next assignment). 


In [None]:
input_size = len(input_cols)
output_size = len(output_cols)

**Q: Complete the class definition below by filling out the constructor (`__init__`), `forward`, `training_step` and `validation_step` methods.**

Hint: Think carefully about picking a good loss fuction (it's not cross entropy). Maybe try 2-3 of them and see which one works best. See https://pytorch.org/docs/stable/nn.functional.html#loss-functions

In [None]:
 import torch.nn.functional as F 
class InsuranceModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(input_size,output_size)                 # fill this (hint: use input_size & output_size defined above)
        
    def forward(self, xb):
       
        out = self.linear(xb)                        # fill this
        return out
    
    def training_step(self, batch):
        inputs, targets = batch 
        # Generate predictions
        out = self(inputs)          
        # Calcuate loss
         
        loss_fn=F.mse_loss
        loss=loss_fn(model(inputs),targets)                     # fill this
        return loss
    
    def validation_step(self, batch):
        inputs, targets = batch
        # Generate predictions
        out = self(inputs)
        # Calculate loss
        loss_fn=F.mse_loss
        loss=loss_fn(model(inputs),targets)                      # fill this    
        return {'val_loss': loss.detach()}
        
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        return {'val_loss': epoch_loss.item()}
    
    def epoch_end(self, epoch, result, num_epochs):
        # Print result every 20th epoch
        if (epoch+1) % 20 == 0 or epoch == num_epochs-1:
            print("Epoch [{}], val_loss: {:.4f}".format(epoch+1, result['val_loss']))

Let us create a model using the `InsuranceModel` class. You may need to come back later and re-run the next cell to reinitialize the model, in case the loss becomes `nan` or `infinity`.

In [None]:
model = InsuranceModel()

Let's check out the weights and biases of the model using `model.parameters`.

In [None]:
list(model.parameters())

[Parameter containing:
 tensor([[ 0.2416,  0.3834, -0.3133,  0.3621,  0.0868, -0.1043]],
        requires_grad=True), Parameter containing:
 tensor([-0.2827], requires_grad=True)]

One final commit before we train the model.

In [None]:
jovian.commit(project=project_name, environment=None)

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
[jovian] Committed successfully! https://jovian.ai/mm20b029/02-insurance-linear-regression[0m


'https://jovian.ai/mm20b029/02-insurance-linear-regression'

## Step 4: Train the model to fit the data

To train our model, we'll use the same `fit` function explained in the lecture. That's the benefit of defining a generic training loop - you can use it for any problem.

In [None]:
def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)
 
def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        # Training Phase 
        for batch in train_loader:
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        # Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result, epochs)
        history.append(result)
    return history

**Q: Use the `evaluate` function to calculate the loss on the validation set before training.**

In [None]:
result =   evaluate(model, val_loader)     # Use the the evaluate function
print(result)

{'val_loss': 393058432.0}



We are now ready to train the model. You may need to run the training loop many times, for different number of epochs and with different learning rates, to get a good result. Also, if your loss becomes too large (or `nan`), you may have to re-initialize the model by running the cell `model = InsuranceModel()`. Experiment with this for a while, and try to get to as low a loss as possible.

**Q: Train the model 4-5 times with different learning rates & for different number of epochs.**

Hint: Vary learning rates by orders of 10 (e.g. `1e-2`, `1e-3`, `1e-4`, `1e-5`, `1e-6`) to figure out what works.

In [None]:
epochs = 25
lr = 0.00001
history1 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 169009088.0000
Epoch [25], val_loss: 168999504.0000


In [None]:
epochs = 41
lr = 0.00001
history2 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 89679144.0000
Epoch [40], val_loss: 89629096.0000
Epoch [41], val_loss: 89644296.0000


In [None]:
epochs = 900
lr = 0.00018
history3 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 93303640.0000
Epoch [40], val_loss: 89543960.0000
Epoch [60], val_loss: 87552360.0000
Epoch [80], val_loss: 86391944.0000
Epoch [100], val_loss: 85331216.0000
Epoch [120], val_loss: 84575728.0000
Epoch [140], val_loss: 85133000.0000
Epoch [160], val_loss: 83200168.0000
Epoch [180], val_loss: 82345752.0000
Epoch [200], val_loss: 81870344.0000
Epoch [220], val_loss: 80883248.0000
Epoch [240], val_loss: 80643808.0000
Epoch [260], val_loss: 79528552.0000
Epoch [280], val_loss: 78877080.0000
Epoch [300], val_loss: 78298040.0000
Epoch [320], val_loss: 78973864.0000
Epoch [340], val_loss: 78865704.0000
Epoch [360], val_loss: 77043944.0000
Epoch [380], val_loss: 76100088.0000
Epoch [400], val_loss: 75299736.0000
Epoch [420], val_loss: 76239352.0000
Epoch [440], val_loss: 74573920.0000
Epoch [460], val_loss: 74024144.0000
Epoch [480], val_loss: 73067704.0000
Epoch [500], val_loss: 72813784.0000
Epoch [520], val_loss: 72685112.0000
Epoch [540], val_loss: 71521240.0000
Epoch

In [None]:
epochs = 69
lr = 0.00018
history4 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 63841184.0000
Epoch [40], val_loss: 65398048.0000
Epoch [60], val_loss: 63129468.0000
Epoch [69], val_loss: 62947508.0000


In [None]:
epochs = 3825
lr = 0.00018
history5 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 46824960.0000
Epoch [40], val_loss: 48063444.0000
Epoch [60], val_loss: 47299884.0000
Epoch [80], val_loss: 46972544.0000
Epoch [100], val_loss: 46916764.0000
Epoch [120], val_loss: 47079740.0000
Epoch [140], val_loss: 46534708.0000
Epoch [160], val_loss: 47061776.0000
Epoch [180], val_loss: 46551552.0000
Epoch [200], val_loss: 47671340.0000
Epoch [220], val_loss: 46977844.0000
Epoch [240], val_loss: 46362924.0000
Epoch [260], val_loss: 46520244.0000
Epoch [280], val_loss: 47078016.0000
Epoch [300], val_loss: 46314444.0000
Epoch [320], val_loss: 46668892.0000
Epoch [340], val_loss: 46194708.0000
Epoch [360], val_loss: 48610004.0000
Epoch [380], val_loss: 46543232.0000
Epoch [400], val_loss: 47217136.0000
Epoch [420], val_loss: 46231216.0000
Epoch [440], val_loss: 46105908.0000
Epoch [460], val_loss: 46915700.0000
Epoch [480], val_loss: 46158860.0000
Epoch [500], val_loss: 46850512.0000
Epoch [520], val_loss: 46331788.0000
Epoch [540], val_loss: 45895980.0000
Epoch

**Q: What is the final validation loss of your model?**

In [None]:
val_loss = 43749332.0000

Let's log the final validation loss to Jovian and commit the notebook

In [None]:
jovian.log_metrics(val_loss=val_loss)

[jovian] Metrics logged.[0m


In [None]:
jovian.commit(project=project_name, environment=None)

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
[jovian] Attaching records (metrics, hyperparameters, dataset etc.)[0m
[jovian] Committed successfully! https://jovian.ai/mm20b029/02-insurance-linear-regression[0m


'https://jovian.ai/mm20b029/02-insurance-linear-regression'

Now scroll back up, re-initialize the model, and try different set of values for batch size, number of epochs, learning rate etc. Commit each experiment and use the "Compare" and "View Diff" options on Jovian to compare the different results.

## Step 5: Make predictions using the trained model

**Q: Complete the following function definition to make predictions on a single input**

In [None]:
 def predict_single(input, target, model):
    inputs = input.unsqueeze(0)
    predictions = model(xb)              # fill this
    prediction = predictions[0].detach()
    print("Input:", input)
    print("Target:", target)
    print("Prediction:", prediction)

In [None]:
input, target = val_ds[0]
predict_single(input, target, model)

Input: tensor([60.0000,  1.0000, 25.9974,  0.0000,  0.0000,  2.0000])
Target: tensor([13963.9658])
Prediction: tensor([5102.5928])


In [None]:
input, target = val_ds[10]
predict_single(input, target, model)

Input: tensor([42.0000,  1.0000, 26.3307,  1.0000,  1.0000,  2.0000])
Target: tensor([43982.4336])
Prediction: tensor([5102.5928])


In [None]:
input, target = val_ds[23]
predict_single(input, target, model)

Input: tensor([60.0000,  1.0000, 33.1280,  0.0000,  1.0000,  3.0000])
Target: tensor([60479.4531])
Prediction: tensor([5102.5928])
