<h2 align="center" style="color:Green">Data Segmentation</h2>

### Tackling Age-Specific Errors in Premium Estimation with Data Segmentation and Model Customization

#### Problem Statement:
In the earlier analysis, we observed that the **premium estimation error** for insurance is significantly higher in the **18 to 25 age group**. To address this issue, we plan to segment the dataset based on age and build two distinct models:
1. **Model 1:** For individuals aged **less than 25 years**.
2. **Model 2:** For individuals aged **greater than or equal to 25 years**.

Additionally, for individuals aged less than 25, we have an extra column, **genetical risk**, which could potentially improve the model's accuracy for this age group. Our hypothesis is that incorporating this additional feature for the younger age group may reduce the estimation error in the model.

#### Approach:

1. **Data Segmentation:**
   We will segment the data into two subsets based on the age column:
   - **Age < 25:** This subset will include all the rows where the individual's age is less than 25 years.
   - **Age >= 25:** This subset will include all the rows where the individual's age is greater than or equal to 25 years.

   This segmentation will allow us to build separate models for each group, addressing the specific patterns and errors observed in the premium estimation.

2. **Feature Engineering for Age < 25:**
   - The age group **less than 25** will have an additional feature, **genetical risk**, which is expected to provide more meaningful insights for premium estimation in this group. This feature will be included in the model specifically for individuals under 25.
   - **Genetical risk** could provide more predictive power for estimating the insurance premium, especially if genetic factors are known to influence the risk profile of younger individuals.

3. **Modeling Strategy:**
   - **For Age < 25:** A regression model (e.g., **XGBoost**, **Random Forest**, or **Linear Regression**) will be trained using all relevant features, including the **genetical risk** feature, to predict the premium for individuals in this age group. By incorporating the genetic risk feature, we aim to reduce the premium estimation error observed in this segment.
   - **For Age >= 25:** A separate regression model will be trained on the remaining data (age >= 25). This model will use the standard features available and will not include the **genetical risk** column, as it is not available for individuals aged 25 or older.

4. **Model Evaluation:**
   After training both models, we will evaluate their performance separately. The performance metrics will include:
   - **Mean Squared Error (MSE):** To quantify the prediction accuracy.
   - **R² Score:** To assess the model's fit and its ability to explain the variance in the target variable (annual premium amount).
   - **Cross-validation:** To ensure the models' generalizability and to avoid overfitting.

   We will compare the errors of the two models and verify whether the inclusion of the **genetical risk** feature improves the performance for individuals aged less than 25.

5. **Expected Outcome:**
   By building and evaluating separate models for these two age groups, we expect to achieve the following:
   - **Improved accuracy for the under-25 age group** by utilizing the additional **genetical risk** feature.
   - **Better model performance overall** by addressing the specific error patterns observed in each age group.
   - **Reduction in the premium estimation error** for the 18-25 age group, especially by leveraging the more personalized feature (genetical risk) for younger individuals.

#### Conclusion:
By splitting the dataset into two age-based segments and utilizing different models for each, we can address the specific challenges in premium estimation. The use of an extra feature like **genetical risk** for individuals below 25 is expected to improve model performance and reduce errors in this segment. This approach exemplifies how segmentation and tailored feature engineering can enhance the accuracy of machine learning models, especially in domains like insurance pricing.

In [3]:
import pandas as pd
df = pd.read_excel("premiums.xlsx")
df.head()

Unnamed: 0,Age,Gender,Region,Marital_status,Number Of Dependants,BMI_Category,Smoking_Status,Employment_Status,Income_Level,Income_Lakhs,Medical History,Insurance_Plan,Annual_Premium_Amount
0,26,Male,Northwest,Unmarried,0,Normal,No Smoking,Salaried,<10L,6,Diabetes,Bronze,9053
1,29,Female,Southeast,Married,2,Obesity,Regular,Salaried,<10L,6,Diabetes,Bronze,16339
2,49,Female,Northeast,Married,2,Normal,No Smoking,Self-Employed,10L - 25L,20,High blood pressure,Silver,18164
3,30,Female,Southeast,Married,3,Normal,No Smoking,Salaried,> 40L,77,No Disease,Gold,20303
4,18,Male,Northeast,Unmarried,0,Overweight,Regular,Self-Employed,> 40L,99,High blood pressure,Silver,13365


In [4]:
df.shape

(50000, 13)

In [5]:
df.Age.describe()

count    50000.000000
mean        34.593480
std         15.000437
min         18.000000
25%         22.000000
50%         31.000000
75%         45.000000
max        356.000000
Name: Age, dtype: float64

In [6]:
df_young = df[df.Age<=25]
df_rest = df[df.Age>25]

In [7]:
df_young.shape, df_rest.shape

((20096, 13), (29904, 13))

In [8]:
df_young.to_excel("premiums_young.xlsx", index=False)
df_rest.to_excel("premiums_rest.xlsx", index=False)

In [10]:
import pandas as pd
df = pd.read_excel("premiums_young_with_gr.xlsx")
df.head(2)

Unnamed: 0,Age,Gender,Region,Marital_status,Number Of Dependants,BMI_Category,Smoking_Status,Employment_Status,Income_Level,Income_Lakhs,Medical History,Insurance_Plan,Annual_Premium_Amount,Genetical_Risk
0,18,Male,Northeast,Unmarried,0,Overweight,Regular,Self-Employed,> 40L,99,High blood pressure,Silver,13365,4
1,22,Female,Northwest,Unmarried,0,Underweight,No Smoking,Freelancer,<10L,3,No Disease,Silver,11050,3
