# Predicting Insurance Premiums with Data-Driven Insights for SecureLife Insurance Co.

## About SecureLife Insurance Co.

![SecureLife](https://drive.google.com/uc?export=view&id=1uMGZfZRIQIRGx0pTD3zcENjOCMHAOhx8)

Founded in 1995, SecureLife Insurance Co. is a leading provider of personal and commercial insurance products. With a strong commitment to innovation and customer-centric policies, SecureLife serves over 10 million policyholders across Africa.

### Mission:

To deliver affordable, reliable, and tailored insurance solutions that protect what matters most to our customers.

### Vision:

To be the most trusted name in insurance by leveraging technology and data-driven insights to improve the customer experience.

### Key Facts:

- Headquarters: Lagos, Nigeria
- Revenue: $3.5 billion annually
- Employees: 8,000+
- Customer Satisfaction Rating: 4.8/5 (based on internal surveys)
- Products Offered:
  - Auto Insurance
  - Health Insurance
  - Home Insurance
  - Life Insurance
  - Specialty Products (e.g., Travel, Pet Insurance)

### Core Values:

- Integrity: Always prioritize ethical practices.
- Innovation: Leverage cutting-edge technology to serve customers better.
- Customer-Centricity: Deliver products and services tailored to individual needs.
- Sustainability: Support green initiatives and sustainable practices.

### Recent Initiatives:

- Digital Transformation: SecureLife has adopted a fully digital claims process, reducing claim settlement times by 40%.
- AI-Powered Premium Estimation: The company is developing machine learning models to provide personalized premium quotes in real time.
- Community Outreach: SecureLife donates 5% of annual profits to support disaster relief and education initiatives.



We have now learned a bunch about the company you work for, however, there's a situation on ground!!

## Overview/Problem statement

Insurance companies like SecureLife Insurance Co. rely heavily on accurate premium prediction models to balance competitiveness with profitability. You are a data scientist recently hired by this leading insurance company (I mean look at their stats) to develop a predictive model for estimating insurance premiums. As a data scientist that you are, you will be responsible for end-to-end development: from data preprocessing and exploratory analysis to model building and evaluation.

## Objective

Develop a regression model to predict the Premium Amount based on the data provided. The key objectives are:

- Clean and preprocess the dataset.
- Explore feature importance and relationships.
- Build and evaluate a robust predictive model.
- Interpret results and provide actionable insights.

## About the Data

The dataset contains over 200,000 records with 20 features, designed to mimic real-world scenarios in the insurance domain. It includes a mix of numerical, categorical, and textual data with challenges such as missing values, skewed distributions, and improperly formatted fields. The target variable is "Premium Amount", representing the insurance premium to be predicted.

| **Feature**            | **Description**                                                                 |
|-------------------------|---------------------------------------------------------------------------------|
| **Age**                | Age of the insured individual (Numerical)                                       |
| **Gender**             | Gender of the insured individual (Categorical: Male, Female)                   |
| **Annual Income**      | Annual income of the insured individual (Numerical, skewed)                    |
| **Marital Status**     | Marital status of the insured individual (Categorical: Single, Married, Divorced) |
| **Number of Dependents** | Number of dependents (Numerical, with missing values)                          |
| **Education Level**    | Highest education level attained (Categorical: High School, Bachelor's, Master's, PhD) |
| **Occupation**         | Occupation of the insured individual (Categorical: Employed, Self-Employed, Unemployed) |
| **Health Score**       | A score representing the health status (Numerical, skewed)                     |
| **Location**           | Type of location (Categorical: Urban, Suburban, Rural)                         |
| **Policy Type**        | Type of insurance policy (Categorical: Basic, Comprehensive, Premium)          |
| **Previous Claims**    | Number of previous claims made (Numerical, with outliers)                      |
| **Vehicle Age**        | Age of the vehicle insured (Numerical)                                         |
| **Credit Score**       | Credit score of the insured individual (Numerical, with missing values)        |
| **Insurance Duration** | Duration of the insurance policy (Numerical, in years)                         |
| **Premium Amount**     | Target variable representing the insurance premium amount (Numerical, skewed)  |
| **Policy Start Date**  | Start date of the insurance policy (Text, improperly formatted)                |
| **Customer Feedback**  | Short feedback comments from customers (Text)                                  |
| **Smoking Status**     | Smoking status of the insured individual (Categorical: Yes, No)                |
| **Exercise Frequency** | Frequency of exercise (Categorical: Daily, Weekly, Monthly, Rarely)            |
| **Property Type**      | Type of property owned (Categorical: House, Apartment, Condo)                  |


**You would find the dataset at: "[Insurance Premium Prediction Dataset.csv](https://drive.google.com/file/d/1bQ8RE4HrVakjJlWlfDmmy6OiwyYa4wdB/view?usp=drive_link)"**

## Task

Below is an overview of what is expected of you to do to accomplish project objectives:

A) Data Understanding and Preprocessing:

- Load the dataset and understand its structure.
- Identify and handle missing values.
- Correct data types and format text fields
- Address skewed distributions for numerical features.

B) Exploratory Data Analysis (EDA):

- Perform univariate, bivariate, and multivariate analysis.
- Identify correlations and trends that impact Premium Amount.

C) Feature Engineering:

- Encode categorical variables
- Generate new features (such as the number of years since Policy Start Date)

D) Model Development:

- Split the dataset into training and testing sets.
- Experiment with different regression algorithms (feel free to explore as many as you deem fit, you're the data scientist here) to determine the best one
- Evaluate models using metrics like MAE, MSE, and R².

E) Model Tuning and Optimization:

- Use techniques like hyperparameter tuning (Grid Search or Random Search) to improve model performance.
- Address overfitting or underfitting as necessary.

F) Interpretation and Insights:

- Analyze feature importance to understand the drivers of insurance premiums.
- Provide actionable insights for stakeholders.

## Deliverables

- An EDA jupyter notebook with visualizations and insights and data cleaning steps (2 weeks) --> Jupyter notebook
- An organized Jupyter Notebook detailing necessary project phases like feature engineering, model development, training and evaluation (2 weeks). --> Jupyter notebook
- Documentation of the entire workflow, including challenges faced and solutions implemented. (2 weeks) --> Microsoft word document or pdf file format
- A power point presentation highlighting key insights and recommendations for SecureLife Insurance Co. (2 weeks) --> ppt file format

Timeline = 8 weeks.

## Tools, Frameworks, Libraries, and Techniques for the Project

**Programming Language**

Python: For all data processing, analysis, and machine learning tasks.

**Data Handling and Preprocessing**

- Pandas: For data manipulation and cleaning.
- NumPy: For numerical computations.

**Exploratory Data Analysis (EDA)**

- Matplotlib: For creating visualizations like line charts, bar plots, and histograms.
- Seaborn: For advanced and visually appealing statistical graphics (e.g., correlation heatmaps, pair plots).
- Plotly: For interactive visualizations to explore the data more dynamically.

**Feature Engineering and Preprocessing**

- Scikit-learn:
  - For encoding categorical variables (e.g., OneHotEncoder, LabelEncoder).
  - For imputing missing values (e.g., SimpleImputer).
  - For scaling and transforming features (e.g., StandardScaler, MinMaxScaler).
- DateTime Library: For handling and transforming date-related features (e.g., Policy Start Date).

**Machine Learning Models**

- Scikit-learn: For building and evaluating regression models, including:
  - Linear Regression
  - Random Forest Regressor
  - Gradient Boosting Regressor (e.g., using HistGradientBoostingRegressor), etc.
- XGBoost: For advanced gradient-boosting models, especially when performance is a priority.
- LightGBM: For fast and efficient gradient-boosting on large datasets.

**Model Evaluation**

Scikit-learn Metrics:

- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- R-squared (R²)
- Residual plots for model diagnostics

**Hyperparameter Tuning**

- GridSearchCV: For exhaustive parameter search.
- RandomizedSearchCV: For quicker, randomized parameter search.

**Data Pipeline and Workflow Automation**

- Jupyter Notebook: For documenting and running the entire workflow interactively.
- Pipelines in Scikit-learn: For creating robust and reusable data processing pipelines.

**Natural Language Processing (Optional - For Textual Features)**

NLTK or spaCy: For preprocessing textual features like Customer Feedback.

**Documentation and Reporting**

- Markdown: For documenting progress in Jupyter Notebooks.
- PowerPoint or Google Slides: For creating presentations summarizing insights.
- Excel: For sharing processed datasets or results with stakeholders.

**Deployment (Optional)**

Streamlit: For building a simple user interface to interact with the model

**P.S. while the deployment is optional, imagine the smile on your line manager's face when you go the extra mile to give them they could use in real-time, you just might get promoted in the next recruiting season.**

**NOTE:**

- You are a data scientist and SecureLife Insurance Co. has entrusted their ability to stand out in the market based on your capacity to do a good job. In light if this, approach this project with that mindset.

- Part of your role as a data scientist is to stay on top of the ever evolving trends in the data space and as such be ready to conduct thorough research, pick up new tools and techniques along the way to deliver an excellent job. It is your job to ask clarifying questions, scour the internet to gain more domain knowlege and build knowledge and technical expertise in new tools.

- Due to the time-sensitive nature of this project, your line manager has given you only **7 weeks to turn up your report and deliverables.** Get on with quickly and goodluck!