# **Recommender System for Best Insurance Provider  in Kenya**

## **Business Understanding**

According to Cytonn's report on Kenya's listed insurance sector for H1 2023, insurance penetration in Kenya remains historically low. As of FY 2022, penetration stood at just 2.3%, according to the Kenya National Bureau of Statistics (KNBS) 2023 Economic Survey. This is notably below the global average of 7.0%, as reported by the Swiss Re Institute. The low penetration rate is largely attributed to the perception of insurance as a luxury rather than a necessity, and it is often purchased only when required or mandated by regulation. Additionally, a pervasive mistrust of insurance providers has significantly contributed to the low uptake.

Despite the critical importance of non-liability insurance—such as health, property, and personal accident coverage—for financial protection against unforeseen events, many individuals find it challenging to identify a trustworthy insurance provider due to opaque information on claim settlement records. This project seeks to address this pressing issue by simplifying the process of selecting a suitable insurance provider based on their performance in settling non-liability claims. By enhancing transparency and fostering trust in insurance providers, the project aims to boost insurance uptake and overall customer satisfaction.

In [2]:
# Assigning the text in the previous cell to a variable named Business_understanding

Business_understanding = """# # **Recommender System for Best Insurance Provider  in Kenya**
# ## **Business Understanding**
# According to Cytonn's report on Kenya's listed insurance sector for H1 2023, insurance penetration in Kenya remains historically low. As of FY 2022, penetration stood at just 2.3%, according to the Kenya National Bureau of Statistics (KNBS) 2023 Economic Survey. This is notably below the global average of 7.0%, as reported by the Swiss Re Institute. The low penetration rate is largely attributed to the perception of insurance as a luxury rather than a necessity, and it is often purchased only when required or mandated by regulation. Additionally, a pervasive mistrust of insurance providers has significantly contributed to the low uptake.
#
# Despite the critical importance of non-liability insurance—such as health, property, and personal accident coverage—for financial protection against unforeseen events, many individuals find it challenging to identify a trustworthy insurance provider due to opaque information on claim settlement records. This project seeks to address this pressing issue by simplifying the process of selecting a suitable insurance provider based on their performance in settling non-liability claims. By enhancing transparency and fostering trust in insurance providers, the project aims to boost insurance uptake and overall customer satisfaction."""


## **Problem Statement**

Kenya faces a significant challenge with low insurance penetration, standing at just 2.3% as of FY 2022, well below the global average of 7.0%. This disparity is largely due to the perception of insurance as a luxury rather than a necessity, coupled with widespread mistrust towards insurance providers. Many Kenyans only purchase insurance when mandated or compelled by regulation, undermining the financial protection benefits offered by non-liability insurance such as health, property, and personal accident coverage.

In [3]:
# Assigning the text in the previous cell to a variable named problem statement

problem_statement = """#
# ## **Problem Statement**
# Kenya faces a significant challenge with low insurance penetration, standing at just 2.3% as of FY 2022, well below the global average of 7.0%. This disparity is largely due to the perception of insurance as a luxury rather than a necessity, coupled with widespread mistrust towards insurance providers. Many Kenyans only purchase insurance when mandated or compelled by regulation, undermining the financial protection benefits offered by non-liability insurance such as health, property, and personal accident coverage."""


## **Objectives**

Our project aims to address the challenges outlined in the business understanding and problem statement by developing a recommender system for the Kenyan insurance market. This system will leverage historical data to assess insurers based on their non-liability claim settlement performance. By offering transparent insights into insurer reliability and fostering trust between consumers and providers, we strive to achieve the following objectives:

1. **Enhanced Transparency:** Provide clear and detailed information on insurers' performance to improve market transparency and empower customers to make informed decisions.

2. **Increased Customer Trust:** Build greater trust between customers and insurance providers by showcasing insurers' reliability in settling non-liability claims.

3. **Boosted Insurance Uptake:** Encourage increased insurance adoption by simplifying the process of finding trustworthy insurers, thereby raising overall market penetration.

4. **Improved Customer Satisfaction:** Guide customers towards insurers with strong claim settlement records to enhance their overall satisfaction with the insurance experience.

In [4]:
# Write objectives based on the Business_understanding and problem_statement variables and assign it a variable called objectives.

objectives = """### **Objectives**

Our project aims to address the challenges outlined in the business understanding and problem statement by developing a recommender system for the Kenyan insurance market. This system will leverage historical data to assess insurers based on their non-liability claim settlement performance. By offering transparent insights into insurer reliability and fostering trust between consumers and providers, we strive to achieve the following objectives:

1. **Enhanced Transparency:** Provide clear and detailed information on insurers' performance to improve market transparency and empower customers to make informed decisions.

2. **Increased Customer Trust:** Build greater trust between customers and insurance providers by showcasing insurers' reliability in settling non-liability claims.

3. **Boosted Insurance Uptake:** Encourage increased insurance adoption by simplifying the process of finding trustworthy insurers, thereby raising overall market penetration.

4. **Improved Customer Satisfaction:** Guide customers towards insurers with strong claim settlement records to enhance their overall satisfaction with the insurance experience."""
print(objectives)


### **Objectives**

Our project aims to address the challenges outlined in the business understanding and problem statement by developing a recommender system for the Kenyan insurance market. This system will leverage historical data to assess insurers based on their non-liability claim settlement performance. By offering transparent insights into insurer reliability and fostering trust between consumers and providers, we strive to achieve the following objectives:

1. **Enhanced Transparency:** Provide clear and detailed information on insurers' performance to improve market transparency and empower customers to make informed decisions.

2. **Increased Customer Trust:** Build greater trust between customers and insurance providers by showcasing insurers' reliability in settling non-liability claims.

3. **Boosted Insurance Uptake:** Encourage increased insurance adoption by simplifying the process of finding trustworthy insurers, thereby raising overall market penetration.

4. **Impro

## **Metrics of Success**


**Prediction Accuracy Metrics**

1. **RMSE (Root Mean Squared Error)**: For prediction of claim amounts or settlement amounts, RMSE measures the difference between predicted and actual values. Lower RMSE indicates better accuracy. It will serve as our target metric and any RMSE of 1-2% of the mean will be deemed acceptable with 1% being preferred.

2. **MAE (Mean Absolute Error)**: Similar to RMSE but less sensitive to outliers. It measures the average absolute difference between predicted and actual values. Not necessarily predetermined but anything < 0.5 will be considered good.

3. **R-squared (R²)**:  Indicates how well the model explains the variability of the target variable. Higher values indicate better performance. Any value above 0.85 will be considered as good.

In [5]:
# Assigning the text from the previous cell to a variable called metrics

metrics = """# ## **Metrics of Success**
#
# **Prediction Accuracy Metrics**
#
# 1. **RMSE (Root Mean Squared Error)**: For prediction of claim amounts or settlement amounts, RMSE measures the difference between predicted and actual values. Lower RMSE indicates better accuracy. It will serve as our target metric and any RMSE of 1-2% of the mean will be deemed acceptable with 1% being preferred.
#
# 2. **MAE (Mean Absolute Error)**: Similar to RMSE but less sensitive to outliers. It measures the average absolute difference between predicted and actual values. Not necessarily predetermined but anything < 0.5 will be considered good.
#
# 3. **R-squared (R²)**:  Indicates how well the model explains the variability of the target variable. Higher values indicate better performance. Any value above 0.85 will be considered as good."""


## **Data Understanding**

Our approach will involve analyzing aggregated data from multiple sources, including historical claim settlement records, insurer performance metrics, and customer feedback. By normalizing and processing this data, we aim to develop a robust recommendation model that aligns with consumer needs and market dynamics.

The dataset obtained from Insurance Regulatory Authority(IRA) website (https://www.ira.go.ke/index.php/publications/statistical-reports/claims-settlement-*statistics*) contains the following columns:

* **Date**: End date of the quarter.

* **Insurer**: Name of the insurance company.

* **Claims_outstanding_at_the_beginning**: Claims outstanding at the beginning of the period.

* **Claims_intimated**: The initial notification or reporting of a claim by a policyholder to their insurance company.

* **Claims_revived**:  insurance claims that were previously closed, denied, or settled but have been reopened for reconsideration or additional processing.

* **Total_Claims_Payable**: Total claims payable (summation of claims outstanding at the beginning, claims intimated, and claims revived).

* **Claims_paid**: these are the claims paid by the insurers during the quarter. The claims paid may include those outstanding at the beginning of the period and those intimated and revived during the quarter.

* **Claims_declined**: Claims declined during the period.

* **Claims_closed_as_no_claims**: notified claims for which the insurer
makes provisions for liability, but the liability does not crystalize during the quarter.

* **Total_Claims_Action_during_the_Quarter**: summation of the number of
claims paid, claims declined, claims closed as no claims, and claims
outstanding at the end of the quarter.

* **Claims_outstanding_at_the_end**: Claims outstanding at the end of the period. Calculated as the subtraction of total claims action during the quarter from the total claims payable during the quarter.

* **Claims_declined_ratio_(%)**:  proportion of the number of claims declined in relation to the total
number of claims actionable during the quarter

* **Claims_closed_as_no_claims_ratio (%)**: proportion of claims closed as no claims in relation to the total number
of claims actionable during the quarter.

* **Claim_payment_ratio_(%)**: proportion of the number of claims paid in relation to the total number
of claims actionable during the quarter.

* **Claim_payment_ratio_(%)_prev**: Previous quarter claim payment ratio.

In [6]:
# Assigning the text from the previous cell to a variable called features.

column_features = """# ## **Data Understanding**
# Our approach will involve analyzing aggregated data from multiple sources, including historical claim settlement records, insurer performance metrics, and customer feedback. By normalizing and processing this data, we aim to develop a robust recommendation model that aligns with consumer needs and market dynamics.
#
# The dataset obtained from Insurance Regulatory Authority(IRA) website (https://www.ira.go.ke/index.php/publications/statistical-reports/claims-settlement-*statistics*) contains the following columns:
#
# * **Date**: End date of the quarter.
#
# * **Insurer**: Name of the insurance company.
#
# * **Claims_outstanding_at_the_beginning**: Claims outstanding at the beginning of the period.
#
# * **Claims_intimated**: The initial notification or reporting of a claim by a policyholder to their insurance company.
#
# * **Claims_revived**:  insurance claims that were previously closed, denied, or settled but have been reopened for reconsideration or additional processing.
#
# * **Total_Claims_Payable**: Total claims payable (summation of claims outstanding at the beginning, claims intimated, and claims revived).
#
# * **Claims_paid**: these are the claims paid by the insurers during the quarter. The claims paid may include those outstanding at the beginning of the period and those intimated and revived during the quarter.
#
# * **Claims_declined**: Claims declined during the period.
#
# * **Claims_closed_as_no_claims**: notified claims for which the insurer
# makes provisions for liability, but the liability does not crystalize during the quarter.
#
# * **Total_Claims_Action_during_the_Quarter**: summation of the number of
# claims paid, claims declined, claims closed as no claims, and claims
# outstanding at the end of the quarter.
#
# * **Claims_outstanding_at_the_end**: Claims outstanding at the end of the period. Calculated as the subtraction of total claims action during the quarter from the total claims payable during the quarter.
#
# * **Claims_declined_ratio_(%)**:  proportion of the number of claims declined in relation to the total
# number of claims actionable during the quarter
#
# * **Claims_closed_as_no_claims_ratio (%)**: proportion of claims closed as no claims in relation to the total number
# of claims actionable during the quarter.
#
# * **Claim_payment_ratio_(%)**: proportion of the number of claims paid in relation to the total number
# of claims actionable during the quarter.
#
# * **Claim_payment_ratio_(%)_prev**: Previous quarter claim payment ratio."""


In [7]:
#Identifying the possible column feature titles in the previous cell and save them as a markdown file called column_features in the content directory.

%%writefile /content/column_features.md
- Date
- Insurer
- Claims_outstanding_at_the_beginning
- Claims_intimated
- Claims_revived
- Total_Claims_Payable
- Claims_paid
- Claims_declined
- Claims_closed_as_no_claims
- Total_Claims_Action_during_the_Quarter
- Claims_outstanding_at_the_end
- Claims_declined_ratio (%)
- Claims_closed_as_no_claims_ratio (%)
- Claim_payment_ratio (%)
- Claim_payment_ratio (%)_prev


Writing /content/column_features.md


In [8]:
!pip install streamlit

Collecting streamlit
  Downloading streamlit-1.37.1-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting tenacity<9,>=8.1.0 (from streamlit)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Downloading GitPython-3.1.43-py3-none-any.whl.metadata (13 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting watchdog<5,>=2.1.5 (from streamlit)
  Downloading watchdog-4.0.2-py3-none-manylinux2014_x86_64.whl.metadata (38 kB)
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Downloading gitdb-4.0.11-py3-none-any.whl.metadata (1.2 kB)
Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Downloading smmap-5.0.1-py3-none-any.whl.metadata (4.3 kB)
Downloading streamlit-1.37.1-py2.py3-none-any.whl (8.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.7/8.7 MB[0m [31m33.4 MB

**Importing Libraries**

In [9]:
#Data importation and exportation
import os
from google.colab import drive

#Ignore warning messages
import warnings
warnings.filterwarnings("ignore")

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
from scipy.stats import linregress

# Statistical analysis
from scipy import stats
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Machine learning (preprocessing, models, metrics)
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import StackingRegressor

# Model deployment
import streamlit as st
import joblib
from datetime import date

## 6. **Deployment**

In [10]:
# Importing our data frames
df_features = pd.read_csv('/content/normalized_features.csv')
df_target = pd.read_csv('/content/target.csv')
print(df_features.head())
print(df_target.head())

   Claims_closed_as_no_claims  Claims_declined_ratio_(%)  \
0                    0.378057                  -0.055479   
1                   -0.085075                  -0.223516   
2                   -0.078313                  -0.223516   
3                   -0.084565                  -0.223516   
4                   -0.085075                  -0.223516   

   Claim_payment_ratio_(%)  Claims_closed_as_no_claims_ratio (%)  \
0                 2.223808                              0.049249   
1                -0.158895                             -0.472385   
2                 0.440747                             -0.028991   
3                -0.164946                             -0.167443   
4                 0.663652                             -0.472385   

   Insurer_Encoded  Reliability_Label_Encoded  
0        -1.372682                   1.107806  
1        -1.324615                  -1.258918  
2        -1.276547                  -1.258918  
3        -1.228480                  -1

In [11]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_features, df_target, test_size=0.2, random_state=42)

In [12]:
# Create an imputer object
imputer = SimpleImputer(strategy='mean')  # Replace missing values with the mean

# We fit the imputer to the training data and transform both training and testing data
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Handle missing values in y_train and y_test (if any)
y_train_imputed = imputer.fit_transform(y_train.values.reshape(-1, 1)).ravel()
y_test_imputed = imputer.transform(y_test.values.reshape(-1, 1)).ravel()


In [13]:
# Initialize the ElasticNet model
elastic_net_model = ElasticNet(random_state=42)  # You can adjust hyperparameters here

# Fit the model to the training data
elastic_net_model.fit(X_train_imputed, y_train_imputed)

# Make predictions on the test data
y_pred_en = elastic_net_model.predict(X_test_imputed)

# Evaluate the ElasticNet model
mse_en = mean_squared_error(y_test_imputed, y_pred_en)
rmse_en = np.sqrt(mse_en)
r2_en = r2_score(y_test_imputed, y_pred_en)
mae_en = mean_absolute_error(y_test_imputed, y_pred_en)

print("ElasticNet - Root Mean Squared Error:", rmse_en)
print("ElasticNet - R-squared:", r2_en)
print("ElasticNet - Mean Absolute Error:", mae_en)


ElasticNet - Root Mean Squared Error: 7.343995266939545
ElasticNet - R-squared: 0.8687570346348801
ElasticNet - Mean Absolute Error: 5.605689927194578


In [14]:
# Identify the best model based on a chosen metric (e.g., RMSE)
best_model = "Elastic Net"
print(f"The best model based on RMSE is: {best_model}")

The best model based on RMSE is: Elastic Net


In [15]:
joblib.dump(elastic_net_model, 'en_model.pkl')

['en_model.pkl']

In [16]:
!ls -al /content/en_model.pkl

-rw-r--r-- 1 root root 679 Aug 22 11:21 /content/en_model.pkl


In [17]:
# Load the saved model
elastic_net_deploy = joblib.load('en_model.pkl')

# Now you can use the chosen model to make predictions
# For example:
# Define a function to make predictions
def predict(normalized_features_df):
    input_array = np.array(normalized_features_df).reshape(1, -1)
    prediction = elastic_net_deploy.predict(normalized_features_df)
    return prediction

# Streamlit app layout
st.title("ElasticNet Model Prediction App")

st.write("Enter the input features to get a prediction:")

# Example input fields (customize based on your model's features)
Year = st.date_input("Year", value=date.today())
Insurer = st.text_input("Insurer")
Quarter = st.number_input("Quarter", value=0.0)
# Add more input fields as needed

# Collect the inputs into a list
input_data = [Year, Insurer, Quarter]

# Button to trigger prediction
if st.button("Predict"):
    prediction = predict(input_data)
    st.write(f"Predicted value: {prediction[0]}")

2024-08-22 11:21:25.361 
  command:

    streamlit run /usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py [ARGUMENTS]
2024-08-22 11:21:25.366 Session state does not function when running a script without `streamlit run`


### **Summary of our deployment step**

In [18]:
# Summary of our deployment step

deployment_summary = f"""
## Deployment Summary

In this final step, we focused on deploying the chosen {best_model} for practical use. The model was saved using joblib, enabling easy loading and utilization for predictions.

This deployment strategy allows for:

- **Efficient Predictions:** The saved model can be readily loaded to generate predictions on new data, either in real-time or batch mode.
- **Scalability:** The deployment can be scaled based on the volume of prediction requests, ensuring timely responses.
- **Integration:** The deployed model can be integrated into existing systems or applications, providing seamless access to reliability score predictions.

By successfully deploying the XGBoost model, we have empowered stakeholders to leverage the model's predictive capabilities for informed decision-making and proactive reliability management.
"""
print(deployment_summary)



## Deployment Summary

In this final step, we focused on deploying the chosen Elastic Net for practical use. The model was saved using joblib, enabling easy loading and utilization for predictions.

This deployment strategy allows for:

- **Efficient Predictions:** The saved model can be readily loaded to generate predictions on new data, either in real-time or batch mode.
- **Scalability:** The deployment can be scaled based on the volume of prediction requests, ensuring timely responses.
- **Integration:** The deployed model can be integrated into existing systems or applications, providing seamless access to reliability score predictions.

By successfully deploying the XGBoost model, we have empowered stakeholders to leverage the model's predictive capabilities for informed decision-making and proactive reliability management.



In [19]:
drive.mount('/content/drive')

# Assuming 'deployment_summary' is your variable containing the report text
with open('/content/drive/My Drive/deployment_summary.txt', 'w') as f:
  f.write(deployment_summary)

Mounted at /content/drive
