# Step 3: Training Data Preparation


## Insights and Findings

### 1. Data Filtering
- **Age Filtering**: We removed customers under 18 years old to focus on adult drivers, improving the relevance of the data for model training.

### 2. Key Insights

- **Claim Costs by Age**: The 43-47 age group has the highest average claim payouts, followed by the 33-37 and 23-27 groups.

- **Regional Age Variation**: Significant age variation exists across regions, with Utah showing the highest diversity. R

- **Household Size by State**: States like Arkansas and Massachusetts have the largest households, which may correlate with higher vehicle usage.

- **Customer Demographics**: The average customer age is 31.8 years. T

- **Data Quality**: We addressed data quality issues by filtering out unrealistic ages.


### 2. Strategy for Handling Missing and Duplicate Data

- **Missing Data**:
  - **Imputation**: For missing numerical data, such as `Income`, we used median imputation to fill in gaps, ensuring that outliers did not skew the results. For categorical data, such as `Marital Status` and `Employment Type`, we used mode imputation to replace missing values with the most common category.
  - **Removal**: In cases where entire columns or rows were mostly empty (e.g., certain unnamed columns), we removed them to streamline the dataset. Additionally, any records with critical missing information that could not be reliably imputed were excluded from the analysis.

- **Duplicate Data**:
  - **Deduplication**: We checked for and removed duplicate rows based on unique identifiers like `CUST_ID` and `CAR_ID` to ensure each customer and car was represented only once in the dataset. This helped prevent any biases or distortions in the analysis and model training.
  - **Merge Handling**: During data merging, we carefully managed potential duplicates by aligning on primary keys (`HH_ID`, `CUST_ID`, and `CAR_ID`) to maintain data integrity across combined datasets.


### 3. Additional Features to Consider for Insurance Models

From an insurance standpoint, adding the following features could significantly enhance the predictive power and relevance of the models:

- **Vehicle Safety Features**:
  - **Advanced Safety Systems**: Include whether the vehicle has advanced safety features like collision detection, lane assist, or automated braking. These can be strong predictors of lower accident rates and claim costs.
  - **Crash Test Ratings**: Incorporate the vehicle’s crash test ratings from recognized safety organizations. Higher safety ratings could correlate with lower risk.

- **Driver History**:
  - **Previous Claims**: Add a feature indicating the customer’s past claim history. A history of frequent or costly claims might suggest higher future risk.
  - **Driving Record**: Include data on traffic violations, accidents, or points on the driver’s license. A poor driving record could be a strong indicator of risk.

- **Geographic Risk Factors**:
  - **Accident Rate by Region**: Incorporate regional accident rates or crime statistics. Areas with higher accident rates might warrant higher premiums.
  - **Weather Conditions**: Add features related to the typical weather conditions in the customer’s region (e.g., frequency of snow, rain). Poor weather conditions can increase accident risk.

- **Vehicle Usage**:
  - **Annual Mileage**: A feature indicating the average annual mileage driven. Higher mileage could correlate with increased exposure to risk.
  - **Primary Use of Vehicle**: Whether the vehicle is used for personal, business, or mixed purposes. Business use vehicles might have different risk profiles than those used strictly for personal purposes.

- **Customer Demographics**:
  - **Occupation and Income Stability**: While income is already considered, adding features that reflect income stability (e.g., job type, employment history) could improve risk assessment, especially for predicting payment consistency or the likelihood of policy lapses.
  - **Family Structure**: Including whether the customer has dependents or is part of a multi-car household. These factors could influence driving behavior and risk.

- **Insurance Policy Details**:
  - **Policy Tenure**: How long the customer has been with the insurance company. Longer tenure might indicate loyalty and lower risk of fraud or policy shopping.
  - **Deductible Amount**: Include the deductible chosen by the customer. Higher deductibles might correlate with lower claim frequency but higher claim amounts when they do occur.

Adding these features could provide a more comprehensive view of each customer’s risk profile, leading to more accurate pricing and better risk management.


### 4. Features to Consider Removing from the Final Dataset

While most features in the dataset provide valuable information, there are a few that may be redundant, irrelevant, or potentially misleading. Here are some features that could be considered for removal:

- **Unnamed Columns**:
  - **Reason**: Unnamed columns typically result from errors during data import and often contain only `NaN` values or irrelevant information. These columns do not contribute to the analysis and should be removed to clean up the dataset.

- **Phone Number**:
  - **Reason**: The phone number is not typically relevant for risk assessment or predictive modeling. It’s more of an administrative detail that doesn’t provide any predictive value and could be a privacy concern.

- **ZIP Code**:
  - **Reason**: While the region or state can be a valuable predictor, the ZIP code might be too granular and could introduce noise rather than add value, especially if region-level data (like state or city) is already included. Additionally, using ZIP code could raise privacy concerns.

- **CUST_ID, CAR_ID, HH_ID**:
  - **Reason**: These unique identifiers are essential for data processing and merging but do not hold predictive value. After all necessary merges and transformations are complete, these IDs can be removed from the dataset to reduce dimensionality.

- **Referral Source**:
  - **Reason**: Unless there's a specific reason to investigate how customers were referred to the insurance company (e.g., a marketing analysis), this feature might not be relevant for predicting risk or claim costs. If it's not correlated with the target variables, it could be safely removed.

- **Active HH (Active Household)**:
  - **Reason**: This feature is useful for filtering during the data preparation phase, but once you've identified active households and filtered out inactive ones, it may no longer be necessary. Removing it could simplify the dataset without losing valuable information.

- **Antique Vehicle**:
  - **Reason**: If the number of antique vehicles is very small or if these vehicles have very different risk profiles that warrant separate analysis, this feature might be removed or handled separately. Including it in a general model could skew results if antique vehicles behave differently than standard ones.



## Prepared by: Kevin Luzbetak
[https://luzbetak.github.io/](https://luzbetak.github.io/)