# Healthylife insurance charge prediction - Project 2

## Problem Statement
HealthyLife is a leading insurance company headquartered in New York City, serving customers nationwide with a range of insurance policies, including health, auto, and life insurance. Currently, the company relies on traditional methods to assess insurance charges based on customer details such as age, sex, and BMI. However, they face challenges in accurately predicting insurance charges due to limited insights into how different customer attributes impact premiums. This uncertainty leads to potential underpricing or overpricing of policies, affecting both profitability and customer satisfaction. To address these challenges, the company is looking to leverage advanced predictive modeling techniques to enhance its insurance charge estimation process and provide more accurate and personalized pricing to customers.

## Objective
As a Data Scientist hired by the insurance company, the objective is to develop an app and implement a predictive model for estimating insurance charges based on customer attributes. The primary challenges to solve include improving the accuracy of insurance charge predictions by incorporating various customer attributes, streamlining the underwriting process to enhance efficiency and customer experience, and maintaining regulatory compliance while optimizing pricing strategies also analyze and identify the driftness in model and data to understand the model behavior overtime

By achieving these objectives


*   We aim to achieve more accurate and personalized insurance charge estimations
*   Reducing the risks of underpricing and overpricing
*   Improve customer satisfaction and loyalty through fair and competitive pricing
*   Ensuring transparency and compliance with regulatory requirements in pricing strategies will further strengthen our competitive position in the market and enhance overall business performance









### Import the required libraries

## Import the Data

In [None]:
# Read data

In [None]:
# split data in to numerical and categorical

In [None]:
# display the statistical summary of the numerical, categorical and target data

#### Write your insights and findings from the statistical summary

* --
* --
* --

In [None]:
# Check the missing values

In [None]:
# check duplicated rows

In [None]:
# display the info of the dataset

In [None]:
# drop the columns which was not required for modelling

## Exploratory data analysis

#### Charges amount distribution per sex

In [None]:
# Display a histogram to visualize the distribution of charges based on sex in the dataset

#### Distribution of Age

In [None]:
# Create a histogram to display the age distribution in the dataset

#### Charges amount distribution per smoker

In [None]:
# Show a histogram to visualize the distribution of charges amounts based on smoking status in the dataset

#### Average BMI per age

In [None]:
# Display a line plot showing the average BMI per age using markers to highlight the data points

As we can see with time - the average BMI score starts getting higher to unhealthier ranges.

#### Age vs charges

In [None]:
# Create a scatter plot to show the relationship between age and charges in the dataset.

## Model Estimation

In [None]:
# write you are code here

## Model Evaluation

In [None]:
# write you are code here

## Serialization

In [None]:
# Display information about the scikit-learn package

In [None]:
# Generate a requirements.txt file for the project's dependencies

In [None]:
# Create a training script which we can use to train and save model

In [None]:
# Execute the training script

## Test Predictions

In [None]:
# write you are code here

# Gradio Interface

In [None]:
%%writefile app.py
# Import the libraries



# Run the training script placed in the same directory as app.py
# The training script will train and persist a linear regression
# model with the filename 'model.joblib'




# Load the freshly trained model from disk


# Prepare the logging functionality
log_file = Path("logs/") / f"data_{uuid.uuid4()}.json"
log_folder = log_file.parent

scheduler = CommitScheduler(
    repo_id="-----------",  # provide a name "insurance-charge-mlops-logs" for the repo_id
    repo_type="dataset",
    folder_path=log_folder,
    path_in_repo="data",
    every=2
)

# Define the predict function which will take features, convert to dataframe and make predictions using the saved model
# the functions runs when 'Submit' is clicked or when a API request is made


    # While the prediction is made, log both the inputs and outputs to a  log file
    # While writing to the log file, ensure that the commit scheduler is locked to avoid parallel
    # access

    with scheduler.lock:
        with log_file.open("a") as f:
            f.write(json.dumps(
                {
                    'age': age,
                    'bmi': bmi,
                    'children': children,
                    'sex': sex,
                    'smoker': smoker,
                    'region': region,
                    'prediction': prediction[0]
                }
            ))
            f.write("\n")

    return prediction[0]



# Set up UI components for input and output



# Create the gradio interface, make title "HealthyLife Insurance Charge Prediction"


# Launch with a load balancer
demo.queue()
demo.launch(share=False)

Writing app.py


# Online/Batch Inferencing

### Paste your gradio app link

*   app link here

Note: Make sure your Hugging Face space repository is set to public. If it's private, the evaluator won't be able to access the app you've built, which could result in losing marks.

In [None]:
# Install the gradio_client package silently

In [None]:
# import the libraries

## Test Data

In [None]:
# Read the insurance dataset from a CSV file into a DataFrame

# Define the target variable

# Define the numeric features used for modeling

# Define the categorical features used for modeling

# Print a message indicating that data subsets are being created

# Create a variable X by combining numeric and categorical features

# Create the target variable y

# Split the data into training and testing sets

In [None]:
# Check the shape of the testing features dataset


In [None]:
# Display the first 3 rows of the testing features dataset


In [None]:
# Retrieve the values of a specific row (index 764) from the testing features dataset and convert them to a list


In [None]:
# Randomly sample 100 rows from the testing features dataset


In [None]:
# Convert the sampled rows from the DataFrame to a list of tuples


In [None]:
# Access the first tuple in the list of sampled rows


## Inference

In [None]:
# Create a Gradio client instance for the specified Gradio interface


## Online/Real time

In [None]:
# Submit a single data point prediction request to the Gradio interface


In [None]:
# Print the result of the prediction job


## Batch

In [None]:
# Initialize an empty list to store batch predictions
batch_predictions = []

In [None]:
# Iterate through the sampled rows and submit prediction requests to the Gradio interface
# Store the results in batch_predictions list
for row in tqdm(Xtest_sample_rows):
    try:
        # Submit a prediction request for the current row


        # Append the prediction result to batch_predictions
        batch_predictions.append(int(round(float(prediction))))

        # Sleep for 1 second before the next request


    except Exception as e:
        # Print any exceptions that occur during the prediction
        print(e)

In [None]:
# Display the first 10 predictions from the batch_predictions list


# Unit Testing

## Setup

In [None]:
# Install the gradio_client package silently using pip

In [None]:
# import the required libraries

In [None]:
client = Client("---paste your gradio app api---")

## Baseline Checks

Test Data

In [None]:
# Read the insurance dataset from a CSV file into a DataFrame

# Define the target variable and features

# Display a message indicating the creation of data subsets

# Create feature matrix (X) and target vector (y)

# Split the dataset into training and testing sets

# Sample 100 rows from the testing set for evaluation

# Convert the sampled test set into a list of tuples


Creating data subsets


Predictions on the test data

In [None]:
# Initialize an empty list to store baseline test predictions
baseline_test_predictions = []

# Iterate over each row in the sampled test set
for row in tqdm(Xtest_sample_rows):
    try:
        # Submit a prediction request to the client API using the row data

        # Retrieve the prediction result and append it to the predictions list


        baseline_test_predictions.append(int(round(float(prediction))))

    # Handle any exceptions that may occur during prediction
    except Exception as e:
        print(e)

Estimation of accuracy on the test sample. use RMSE and R-squared to measure the performance of the model

In [None]:
print(f"RMSE: {mean_squared_error(ytest_sample, baseline_test_predictions, squared=False)}")

In [None]:
print(f"R-squared: {r2_score(ytest_sample, baseline_test_predictions)}")

If the Mean Absolute Error (MAE) or Root Mean Square Error (RMSE) is lower than the existing baseline (human or a previous model version), we move on to unit tests.

## Unit Tests

### Perturbation tests

*Baseline*

*Test (perturbed baseline)*

### Known edge-cases (criticial subgroups)

In this scenario, a known edge case is that when a person is smoker , the insurance charge should be high. Let us see if the model can recognize this state.

If the unit tests pass, the model is ready to be tagged for release to staging and production.

# Identify the driftness in model and data

## Setup

In [None]:
# !pip install -q datasets

In [None]:
# import the required libraries

## Monitoring Setup

## Access Training Data

In [None]:
# Read the insurance dataset from a CSV file into a DataFrame

# Define the target variable

# Define the numeric features used for modeling

# Define the categorical features used for modeling

# Print a message indicating that data subsets are being created

# Create a variable X by combining numeric and categorical features

# Create the target variable y

# Split the data into training and testing sets

Creating data subsets


## Access Logs

We connect to the dataset of all the production logs and extract a 30% random sample to execute the monitoring workflow.

In [None]:
prediction_logs = load_dataset("--paste your log dataset api---")

In [None]:
# Convert the 'train' logs from a Dask DataFrame to a Pandas DataFrame

In [None]:
# Sample 30% of the rows from the prediction logs DataFrame with a random state 42

In [None]:
# print the 5 sample data points

## Model Drift Checks

### Predicted Targets vs Training Targets

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plotting the distributions of actual target values and predicted values
plt.figure(figsize=(12, 6))

plt.subplot(211)
# Plot the histogram with a KDE (Kernel Density Estimation) curve
# write your code here
plt.title("Distribution of actual target values in training data")
plt.xlabel("Actual Target Values")
plt.ylabel("Frequency")

# Distribution of predicted target values from the deployed model
plt.subplot(212)
# Plot a histogram with a KDE (Kernel Density Estimation) curve for the predicted values from the sample prediction logs DataFrame
# write your code here
plt.title("Distribution of predicted target values from the deployed model")
plt.xlabel("Predicted Target Values")
plt.ylabel("Frequency")

plt.tight_layout()
plt.show()

In [None]:
# Calculate mean of actual values in training data (sum(target y) / len(target y))
mean_training_data =

In [None]:
# Calculate mean of predicted values in sample logs (sum(logs.prediction) / len(logs.prediction))
mean_sample_logs =

In [None]:
# Calculate variance of actual values in training data
variance = sum((y - mean_training_data)**2 for y in ytrain) / len(ytrain)

In [None]:
# Calculate absolute difference between means
diff = abs()

In [None]:
# Check for model drift
if diff > 2 * math.sqrt(variance):
    print("Model Drift Detected!")
else:
    print("No Model Drift!")

No Model Drift!


## Data Drift Checks

### Live Features vs Training Features

In [None]:
mean_age_training_data =
std_age_training_data =

mean_age_sample_logs =

In [None]:
(mean_age_training_data, mean_age_sample_logs)

In [None]:
mean_feature_training_data = 39.35
mean_feature_sample_logs = 37.04
std_feature_training_data = 14.07

mean_diff = abs()

if mean_diff > 2 * std_feature_training_data:
    print("Data Drift Detected!")
else:
    print("No Data Drift!")

The current model stays in production unless we detect model drift or data drift.

# Convert ipynb to HTML

Instructions:
1. Go to File
2. Download these current working Notebook in to ipynb format
3. Now, run the below code, select the notebook from local where you downloaded the file
4. Wait for few sec, your notebook will automatically converted in to html format and save in your local pc


In [None]:
# @title HTML Convert
# Upload ipynb
from google.colab import files
f = files.upload()

# Convert ipynb to html
import subprocess
file0 = list(f.keys())[0]
_ = subprocess.run(["pip", "install", "nbconvert"])
_ = subprocess.run(["jupyter", "nbconvert", file0, "--to", "html"])

# download the html
files.download(file0[:-5]+"html")


## Power Ahead!