# U.S. Medical Insurance Costs

## Hypothesis
* Some regions have a higher insurance cost per patient due to lifestyle factors
* These regions have a higher average rate of smokers, a higher average number of kids and/or a higher average bmi among their patients


## Project Goal

The goal of this project is to test my hypothesis.

I will test the hypothesis by finding out:

* How many patients has each region?
* What's the average insurance cost per patient per region? 
* Which region has the highest average insurance cost?
* Which region has the highest percentage of smokers amongst patients?
* Which region has the most kids on average per patient?
* Which region has the highest average bmi per patient?


## Data
The data used is from the medical insurance cost dataset provided by Codecademy, [originally from Kaggle](https://www.kaggle.com/mirichoi0218/insurance).

## Analysis
### How many patients has each region?

In [12]:
import csv

# Create reusable function to find out numbers of patients per variable

def patients_per_category(csv_file, category):
    categories = {}
    with open(csv_file, newline='') as insurance_csv:
        insurance_data = csv.DictReader(insurance_csv)
        for row in insurance_data:
            key = row[category]
            if key in categories:
                categories[key] = categories[key] + 1
            else:
                categories[key] = 1
    return categories

In [13]:
patients_per_region = patients_per_category("insurance.csv", "region")

In [14]:
print(patients_per_region)

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}


### What's the average insurance cost per patient per region?

In [15]:
# create reusable function to find out total data for a chosen variable per region

def category_per_region(csv_file, category):
    regions = {}
    with open(csv_file, newline='') as insurance_csv:
        insurance_data = csv.DictReader(insurance_csv)
        for row in insurance_data:
            region = row["region"]
            cat = float(row[category])
            if region in regions:
                regions[region] += cat
            else:
                regions[region] = cat
    return regions

cost_per_region = category_per_region("insurance.csv", "charges")

print(cost_per_region)

{'southwest': 4012754.647620001, 'southeast': 5363689.763290002, 'northwest': 4035711.9965399993, 'northeast': 4343668.583308999}


In [16]:
# create reusable function to find out average data per patient for each region

def average_cat_per_region(cat, cat_per_region):
    for key in cat_per_region:
        average = cat_per_region[key]/patients_per_region[key]
        print("The average {} per patient for {} is {}".format(cat, key, average))

average_cat_per_region("cost", cost_per_region)


The average cost per patient for southwest is 12346.93737729231
The average cost per patient for southeast is 14735.411437609895
The average cost per patient for northwest is 12417.575373969228
The average cost per patient for northeast is 13406.3845163858


**Southeast** has the highest average cost per patient with 14735.41 per patient.

### Which region has the highest percentage of smokers amongst patients?


In [17]:
# create reusable function to be used for binary variables such as male/female or smoker/non-smoker

def binary_per_region(csv_file, binary):
    regions = {}
    with open(csv_file, newline='') as insurance_csv:
        insurance_data = csv.DictReader(insurance_csv)
        for row in insurance_data:
            region = row["region"]
            value = row[binary]
            if region in regions:
                if value in regions[region]:
                    regions[region][value] += 1 
                else: 
                    regions[region][value] = 1
            else:
                regions[region] = {}
                regions[region][value] = 1     
    return regions

smokers_per_region = binary_per_region("insurance.csv", "smoker")

print(smokers_per_region)

{'southwest': {'yes': 58, 'no': 267}, 'southeast': {'no': 273, 'yes': 91}, 'northwest': {'no': 267, 'yes': 58}, 'northeast': {'no': 257, 'yes': 67}}


In [18]:
# create function to calculate percentage of smokers per region

def percentage_smokers(smokers_per_region, patients_per_region):
    for key in smokers_per_region:
        decimals = (smokers_per_region[key]["yes"]) / patients_per_region[key]
        percentage = round(decimals * 100, 2)
        print("{} has {} percent smokers".format(key, percentage))
              
    
percentage_smokers(smokers_per_region, patients_per_region)


southwest has 17.85 percent smokers
southeast has 25.0 percent smokers
northwest has 17.85 percent smokers
northeast has 20.68 percent smokers


**Southeast** have the highest percentage of smokers with 25%.

### Which region has the most kids on average per patient?

In [19]:
kids_per_region = category_per_region("insurance.csv", "children")

print(kids_per_region)


{'southwest': 371.0, 'southeast': 382.0, 'northwest': 373.0, 'northeast': 339.0}


In [20]:
average_cat_per_region("children", kids_per_region)

The average children per patient for southwest is 1.1415384615384616
The average children per patient for southeast is 1.0494505494505495
The average children per patient for northwest is 1.1476923076923078
The average children per patient for northeast is 1.0462962962962963


**Northwest** have the most children per patient with 1.15 children per patient.

### Which region has the highest average bmi per patient?


In [21]:
bmi_per_region = category_per_region("insurance.csv", "bmi")

print(bmi_per_region)
        

{'southwest': 9943.899999999998, 'southeast': 12141.580000000005, 'northwest': 9489.930000000004, 'northeast': 9452.215000000002}


In [22]:
average_cat_per_region("bmi", bmi_per_region)

The average bmi per patient for southwest is 30.59661538461538
The average bmi per patient for southeast is 33.35598901098903
The average bmi per patient for northwest is 29.199784615384626
The average bmi per patient for northeast is 29.17350308641976


**Southeast** has the highest average bmi per patient with 33.36 per patient.

## Conclusion

The southeast region had the highest insurance cost per patient. 
This region also had the highest percent of smokers and highest average bmi per patient.
However, they did not have the highest average number of children per patient. 

Lifestyle factors such as smoking and bmi seem to be contributing to the higher insurance cost in the southeast region. Number of kids seems to be contributing less to the  insurance cost.

In a further analysis it would be interesting to see if age and gender distribution also affects the average insurance cost of each region. And could the region itself be a variable that affects the insurance cost of a patient?

### Suggestions for improvements of the code

* Round up numbers to two decimals
* Create a function that calculates which region has the highest numbers instead of checking it manually