## **Problem statement:**

_The objective is to develop a predictive model that can accurately estimate health insurance_ 

_charges for individuals, based on their demographic and health-related information_.  

**Task**

_The task is to use the data provided to achieve the above objective_

### Q 1.	Define an approach you would take to solve the problem and document it.

_This is a machine learning problem.I will use the following approach to achieve the above objective_

### Import relevant packages

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

In [3]:
### Get Data

df = pd.read_csv('insurance.csv')

### (a). Exploratory Data Analysis
This is exploring the dataset to understand its structure, features, and distributions. It involves summary statistics, visualizations, and identifying any missing values or outliers.

In [4]:
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


_no missing values, charges should be in currency_

### Charges is currency hence assume its in dollars

In [29]:
df.rename(columns={'charges': 'charges ($)'}, inplace=True)

In [31]:
# checking the change in charges with ($)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          1338 non-null   int64  
 1   sex          1338 non-null   object 
 2   bmi          1338 non-null   float64
 3   children     1338 non-null   int64  
 4   smoker       1338 non-null   object 
 5   region       1338 non-null   object 
 6   charges ($)  1338 non-null   object 
dtypes: float64(1), int64(2), object(4)
memory usage: 73.3+ KB


_**Data Dictionary**_

1. **age**: 
   - Description: The age of the individual.
   - Data Type: Integer (int64)
   - Notes: This column contains the age of each individual in the dataset.

2. **sex**:
   - Description: The gender of the individual.
   - Data Type: Object (string)
   - Notes: This column contains categorical data indicating the gender of each individual.

3. **bmi**:
   - Description: Body Mass Index (BMI) of the individual.
   - Data Type: Float (float64)
   - Notes: BMI is a measure of body fat based on height and weight. This column contains the BMI values for each individual.

4. **children**:
   - Description: The number of children/dependents covered by the insurance.
   - Data Type: Integer (int64)
   - Notes: This column indicates the number of children or dependents covered under the insurance policy for each individual.

5. **smoker**:
   - Description: Indicates whether the individual is a smoker or not.
   - Data Type: Object (string)
   - Notes: This column contains categorical data indicating whether the individual is a smoker (yes) or not (no).

6. **region**:
   - Description: The region where the individual resides.
   - Data Type: Object (string)
   - Notes: This column contains categorical data indicating the region where each individual resides.

7. **charges**:
   - Description: The health insurance charges for the individual.
   - Data Type: Float (float64)
   - Notes: This column contains the health insurance charges for each individual, which is the target variable for 
   - Source : Chatgpt

In [7]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


**Observation**

_The age range of individuals in the datasate is from 18 to 64 years, with an average age of approximately 39 years_

_The BMI ranges from 15.96 to 53.13, with an average BMI of approximately 30.66. The distribution appears to be positively skewed._

_The number of children covered by insurance ranges from 0 to 5, with an average of approximately 1.0. Most individuals have either 0 or 1 child covered._

_Health insurance charges range from $1,121.87 to $63,770.43, with an average charge of approximately $13,270.42. The distribution appears to be positively skewed, with a wide range of charges observed._