# U.S. Medical Insurance Costs
---
**Lorenzo Tomas Diez - [LinkedIn](https://www.linkedin.com/in/lorenzotomasdiez/)**

## The Dataset
---
The dataset, contains the following columns:

- `age`: Age of primary beneficiary.
- `sex`: Gender of the insurance contractor.
- `bmi`: Body Mass Index, an indicator of relative weight to height.
- `children`: Number of dependents covered by health insurance.
- `smoker`: Indicates whether the beneficiary is a smoker (yes or no).
- `region`: The beneficiary's residential area.
- `charges`: Individual medical costs.

## Preparing for Data Analysis
---
Before diving into the dataset analysis, it's crucial to first load and store the data in variables. This step ensures that we can efficiently work with the dataset even after reading and closing the file.

In [92]:
import csv
with open('insurance.csv', newline='') as insurance_file:
  reader = csv.DictReader(insurance_file)
  dataset = []
  for row in reader:
    dataset.append(row)

print(dataset[0])

{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}


## Validate missing data in dataset
---

In [93]:
def create_data_dict(dataset):
  age = []
  sex = []
  bmi = []
  children = []
  smoker = []
  region = []
  charges = []
  for row in dataset:
    if(row.get('age', None) != None):
      age.append(int(row['age']))
    if(row.get('sex', None) != None):
      sex.append(row['sex'])
    if(row.get('bmi', None) != None):
      bmi.append(float(row['bmi']))
    if(row.get('children', None) != None):
      children.append(row['children'])
    if(row.get('smoker', None) != None):
      smoker.append(row['smoker'])
    if(row.get('region', None) != None):
      region.append(row['region'])
    if(row.get('charges', None) != None):
      charges.append(float(row['charges']))
  return {
    "age":age,
    "sex":sex,
    "bmi":bmi,
    "children":children,
    "smoker":smoker,
    "region":region,
    "charges":charges
  }

data_dict = create_data_dict(dataset)

for key in data_dict:
  print("The number of data in {key} list is {list_length}".format(key=key, list_length = len(data_dict[key])))
print("The number of rows in the dataset is {dataset}".format(dataset=len(dataset)))

The number of data in age list is 1338
The number of data in sex list is 1338
The number of data in bmi list is 1338
The number of data in children list is 1338
The number of data in smoker list is 1338
The number of data in region list is 1338
The number of data in charges list is 1338
The number of rows in the dataset is 1338


No missing that has been confirmed. Now we can begin analyzing.

---

**Explore Demographic Trends**: Analyze how gender, age, smoking and region influence medical insurance costs.

---

In [94]:
total_people = len(dataset)
sex_data = {}

for sex in data_dict["sex"]:
  if sex in sex_data:
    sex_data[sex]["total"] += 1
  if sex not in sex_data:
    sex_data[sex] = {
    "total": 1,
    "pct":0
  }

for key in sex_data:
  sex_data[key]["pct"] = round(sex_data[key]["total"]/total_people * 100, 2)
  print("There are {total} {key} data, and represents the {pct}% of total".format(
    total=sex_data[key]["total"],
    key=key,
    pct=sex_data[key]["pct"]
  ))
print(total_people)

There are 662 female data, and represents the 49.48% of total
There are 676 male data, and represents the 50.52% of total
1338


In [95]:
region_data = {}
for region in data_dict["region"]:
  if region in region_data:
    region_data[region]["total"] += 1
  if region not in region_data:
    region_data[region] = {
    "total": 1,
    "pct":0
  }

for key in region_data:
  region_data[key]["pct"] = round(region_data[key]["total"]/total_people * 100, 2)
  print("There are {total} people from {key} , and represents the {pct}% of total".format(
    total=region_data[key]["total"],
    key=key,
    pct=region_data[key]["pct"]
  ))

There are 325 people from southwest , and represents the 24.29% of total
There are 364 people from southeast , and represents the 27.2% of total
There are 325 people from northwest , and represents the 24.29% of total
There are 324 people from northeast , and represents the 24.22% of total


In [96]:
age_data = {
  "min":min(data_dict["age"]),
  "avg":int(sum(data_dict["age"])/total_people),
  "max":max(data_dict["age"])
}

for key in age_data:
  print("The {key} age in the population is {value} years".format(key=key, value=age_data[key]))

The min age in the population is 18 years
The avg age in the population is 39 years
The max age in the population is 64 years


In [97]:
bmi_data = {
  "min":min(data_dict["bmi"]),
  "avg":round(sum(data_dict["bmi"])/total_people, 2),
  "max":max(data_dict["bmi"])
}

for key in bmi_data:
  print("The {key} bmi in the population is {value}".format(key=key, value=bmi_data[key]))

The min bmi in the population is 15.96
The avg bmi in the population is 30.66
The max bmi in the population is 53.13


In [98]:
smokers_data = {}
for smoker in data_dict["smoker"]:
  if smoker in smokers_data:
    smokers_data[smoker]["total"] += 1
  if smoker not in smokers_data:
    smokers_data[smoker] = {
    "total": 1,
    "pct":0
  }

for key in smokers_data:
  smokers_data[key]["pct"] = round(smokers_data[key]["total"]/total_people * 100, 2)
  print("The {pct}% is {key}".format(
    key="smoker" if key == "yes" else "non-smoker",
    pct=smokers_data[key]["pct"]
  ))

The 20.48% is smoker
The 79.52% is non-smoker


In [99]:
charges_data = {
  "min":round(min(data_dict["charges"]), 1),
  "avg":round(sum(data_dict["charges"])/total_people, 1),
  "max":round(max(data_dict["charges"]), 1)
}

for key in charges_data:
  print("The {key} charges in the population is ${value} dollars".format(key=key, value=charges_data[key]))

The min charges in the population is $1121.9 dollars
The avg charges in the population is $13270.4 dollars
The max charges in the population is $63770.4 dollars


---

**BMI Impact Assessment**: Investigate the correlation between BMI and insurance charges

---

In [100]:
bmi_deviation = [x - bmi_data["avg"] for x in data_dict["bmi"]]
charges_deviation = [x - charges_data["avg"] for x in data_dict["charges"]]

bmi_squared_deviation = [x**2 for x in bmi_deviation]
charges_squared_deviation = [x**2 for x in charges_deviation]


Now, we are multiplying the deviations of BMI and insurance charges for each observation. This provides us with a measure of how much BMI and insurance charges vary together in relation to their means.

In [101]:
deviations_product = [bmi_deviation[i] * charges_deviation[i] for i in range(len(bmi_deviation))]

Sum all the products of deviations and squares of deviations

In [102]:
sum_of_product_deviations = sum(deviations_product)
sum_of_squared_deviation_bmi = sum(bmi_squared_deviation)
sum_of_squared_deviation_charges = sum(charges_squared_deviation)

Finally, we are using the formula for the Pearson correlation coefficient:

r = 
(Sum of Products of Deviations) / 
root(Sum of Squares of Deviation of BMI × Sum of Squares of Deviation of Insurance Charges)

This formula gives us a value between -1 and 1, indicating the strength and direction of the linear relationship between BMI and insurance charges. A value close to 1 indicates a positive correlation, while a value close to -1 indicates a negative correlation. A value close to 0 indicates a weak or no correlation.

In [103]:
import math

correlation = sum_of_product_deviations / math.sqrt(sum_of_squared_deviation_bmi * sum_of_squared_deviation_charges)

print("Correlation between BMI and Charges: {value}".format(
  value = correlation
))

Correlation between BMI and Charges: 0.19834093906454933



The correlation coefficient of approximately 0.198 suggests a weak positive linear relationship between BMI and insurance charges in the dataset. This indicates that as BMI tends to increase, insurance charges also tend to increase, but the correlation is not strong. It's important to note that other factors may play a significant role in determining insurance charges, and this correlation alone may not provide a complete understanding of the underlying factors influencing the costs. Further analysis and consideration of additional variables may be needed for a comprehensive assessment

---

**Smoking Habits and Costs**: Examine the significant impact of smoking on insurance costs, and explore implications for health and financial planning.

---

In [104]:
boolean_smoker = [1 if e == "yes" else 0 for e in data_dict["smoker"]]
mean_smoker = sum(boolean_smoker)/len(boolean_smoker)

smoker_deviation = [x - mean_smoker for x in boolean_smoker]

squared_deviation_smoker = [x**2 for x in smoker_deviation]

product_of_deviations_smoker_charges = [smoker_deviation[i] * charges_deviation[i] for i in range(len(boolean_smoker))]

sum_of_product_deviations_smoker_changes = sum(product_of_deviations_smoker_charges)
sum_of_squared_deviation_smoker = sum(squared_deviation_smoker)

correlation_smoker = sum_of_product_deviations_smoker_changes / (math.sqrt(sum_of_squared_deviation_smoker * sum_of_squared_deviation_charges))

print("Correlation between Smokers and Charges", correlation_smoker)

Correlation between Smokers and Charges 0.7872514304971381


Certainly! Based on the calculated correlation coefficient of approximately 0.787 between smoking and insurance charges, we can conclude that there is a strong positive correlation. This indicates that individuals who smoke tend to have higher insurance charges compared to non-smokers. It's important for insurance companies to consider smoking habits when determining insurance premiums, as it is a significant factor influencing the cost of coverage. This finding highlights the importance of promoting and incentivizing healthy lifestyle choices to potentially lower insurance costs for individuals.

---

**Predictive Modeling**: Develop a predictive model to estimate insurance costs based on smokers costs correlation.

---

In [119]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X = [[e] for e in boolean_smoker]
y = data_dict['charges']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Selecting and training the model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluating the Model
y_pred = model.predict(X_test)

# Interpreting the Model
smoker_coefficient = model.coef_[0]
intercept = model.intercept_

# Printing the coefficients
print(f"Smoker Coefficient: {smoker_coefficient}")
print(f"Intercept: {intercept}")

# Using the Model
# Now you can use the model to make predictions.
# For example, to predict the insurance cost for a 22-year-old non-smoker with a BMI of 30:
new_data = [[0]]  # 0 represents non-smoker
prediction = model.predict(new_data)
print(f"The predicted cost for a non-smoker with a BMI of 30 is: ${prediction[0]:.2f}")




Smoker Coefficient: 23188.685870681944
Intercept: 8578.322547999975
The predicted cost for a non-smoker with a BMI of 30 is: $8578.32
