# Analysis of U.S. Medical Insurance Costs

# Introduction

This project will use Python to investigate a medical insurance dataset from https://www.kaggle.com/datasets/mirichoi0218/insurance. The steps are to import the data, observe, and analyze the dataset. The goals of the analysis will be to:

+ Find out the average age, BMI, number of children, and charges in the dataset.
+ Count the total number of men, women, smokers, non-smokers, and people in each region.
+ Do a similiar analysis as above, but compare the aggregates by region.


# Step 1: Import the data

We will open `"insurance.csv"` and then read each line of the file, storing it in a list called `insurance _data`.

In [1]:
import csv

with open("insurance.csv") as insurance_file:
    reader_obj = csv.reader(insurance_file)
    insurance_data = []
    for row in reader_obj:
        insurance_data.append(row)
        # Let the file close now that we've moved the data into a variable

# Step 2: Observe the data

Printing the first line of `insurance_data` will show us the column headers. Printing a few more lines will give us an idea of what the data looks like, and we can also check the data types.

In [2]:
# Check what data we are working with
print(insurance_data[0])
print(insurance_data[1])
print(insurance_data[2])
print(insurance_data[3])

for i in range(len(insurance_data[0])):
    print(f"The data type of the {insurance_data[0][i]} column is {type(insurance_data[1][i])}")

['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']
['19', 'female', '27.9', '0', 'yes', 'southwest', '16884.924']
['18', 'male', '33.77', '1', 'no', 'southeast', '1725.5523']
['28', 'male', '33', '3', 'no', 'southeast', '4449.462']
The data type of the age column is <class 'str'>
The data type of the sex column is <class 'str'>
The data type of the bmi column is <class 'str'>
The data type of the children column is <class 'str'>
The data type of the smoker column is <class 'str'>
The data type of the region column is <class 'str'>
The data type of the charges column is <class 'str'>


## 2.1: Organize the data

OK! Now we know that every column contains strings. If we want to do any more analysis on numbers, we will have to convert them to floats. It would also be good to keep the headers and data seperate. To do this, we will change the data from a list to a dictionary with the headers as keys.

In [3]:
# Begin by creating a dictionary with headers as keys and empty lists
insurance_dict = {insurance_data[0][0]: [],
                  insurance_data[0][1]: [],
                  insurance_data[0][2]: [],
                  insurance_data[0][3]: [],
                  insurance_data[0][4]: [],
                  insurance_data[0][5]: [],
                  insurance_data[0][6]: []}

# Populate the empty lists with data, changing relevant data types to floats
columns_to_change = ['age', 'bmi', 'children', 'charges']
key_list = list(insurance_dict)
for index in range(len(list(insurance_dict))):
    for row in insurance_data[1:]:
        if key_list[index] in columns_to_change:
            insurance_dict[key_list[index]].append(float(row[index]))
        else:
            insurance_dict[key_list[index]].append(row[index])

# Step 3: Analyze the data

## 3.1: Find aggregates of the entire dataset

Now we can quickly find average age, bmi, children, and charges. We can also get a count for sex, smoker, and region.

In [4]:
def avg(list):
    return sum(list) / len(list)

print(f"The average age in the dataset is: {round(avg(insurance_dict['age']))} years old.")
print(f"The average BMI in the dataset is: {round(avg(insurance_dict['bmi']), 2)}")
print(f"The average amount of children in the dataset is: {round(avg(insurance_dict['children']))}")
print(f"The average charge in the dataset is: ${round(avg(insurance_dict['charges']), 2)}")
print(f"There are {insurance_dict['smoker'].count('yes')} smokers and {insurance_dict['smoker'].count('no')} non-smokers.")
print(f"There are {insurance_dict['sex'].count('female')} women and {insurance_dict['sex'].count('male')} men. \n")
print(f"""By region, there are: 
{insurance_dict['region'].count('northeast')} people from the northeast 
{insurance_dict['region'].count('southeast')} people from the southeast
{insurance_dict['region'].count('northwest')} people from the northwest
{insurance_dict['region'].count('southwest')} people from the southwest""")

The average age in the dataset is: 39 years old.
The average BMI in the dataset is: 30.66
The average amount of children in the dataset is: 1
The average charge in the dataset is: $13270.42
There are 274 smokers and 1064 non-smokers.
There are 662 women and 676 men. 

By region, there are: 
324 people from the northeast 
364 people from the southeast
325 people from the northwest
325 people from the southwest


## 3.2: Study by region

Interseting! It looks like the count of people from each region is pretty close. If we split the dataset by region, we could take a look at the same aggregates that we performed on the whole set, but compare them by region.

In [12]:
# creates a regional dictionary with the same keys as the 'national_dict' but which holds values
# for an individual region
def create_regional_dict(region):
    regional_dict = {'region': region,
                     'age': [],
                     'sex': [],
                     'bmi': [],
                     'children': [],
                     'smoker': [],
                     'charges': []
                    }
    for i in range(len(insurance_dict['age'])):
        if insurance_dict['region'][i] == region:
            regional_dict['age'].append(insurance_dict['age'][i])
            regional_dict['sex'].append(insurance_dict['sex'][i])
            regional_dict['bmi'].append(insurance_dict['bmi'][i])
            regional_dict['children'].append(insurance_dict['children'][i])
            regional_dict['smoker'].append(insurance_dict['smoker'][i])
            regional_dict['charges'].append(insurance_dict['charges'][i])
    return regional_dict
        
# use the above function to create regional dictionaries
northeast_dict = create_regional_dict('northeast')
southeast_dict = create_regional_dict('southeast')
southwest_dict = create_regional_dict('southwest')
northwest_dict = create_regional_dict('northwest')

# replace value of the original dictionary 'region' key with 'national'
national_dict = insurance_dict
national_dict["region"] = "national"

# Create a function to print all of the details of a given region
def print_details(region_dict):
    print(f"""
    Region: {region_dict['region']}
    Total people: {len(region_dict['age'])}
    Males: {region_dict['sex'].count('male')}
    Females: {region_dict['sex'].count('female')}
    Average age: {round(avg(region_dict['age']))}
    Average BMI: {round(avg(region_dict['bmi']), 2)}
    Average number of children: {round(avg(region_dict['children']))}
    Smokers: {region_dict['smoker'].count('yes')}
    Nonsmokers: {region_dict['smoker'].count('no')}
    Average patient charge: ${round(avg(region_dict['charges']), 2)}""")

print_details(national_dict)
print_details(northeast_dict)
print_details(southeast_dict)
print_details(southwest_dict)
print_details(northwest_dict)


    Region: national
    Total people: 1338
    Males: 676
    Females: 662
    Average age: 39
    Average BMI: 30.66
    Average number of children: 1
    Smokers: 274
    Nonsmokers: 1064
    Average patient charge: $13270.42

    Region: northeast
    Total people: 324
    Males: 163
    Females: 161
    Average age: 39
    Average BMI: 29.17
    Average number of children: 1
    Smokers: 67
    Nonsmokers: 257
    Average patient charge: $13406.38

    Region: southeast
    Total people: 364
    Males: 189
    Females: 175
    Average age: 39
    Average BMI: 33.36
    Average number of children: 1
    Smokers: 91
    Nonsmokers: 273
    Average patient charge: $14735.41

    Region: southwest
    Total people: 325
    Males: 163
    Females: 162
    Average age: 39
    Average BMI: 30.6
    Average number of children: 1
    Smokers: 58
    Nonsmokers: 267
    Average patient charge: $12346.94

    Region: northwest
    Total people: 325
    Males: 161
    Females: 164
    Averag

# Conclusion
The original dataset showed that each US region is fairly evenly represented in the national dataset (though with a slightly larger showing from the southeast). This prompted an analysis by each of the four regions.

Each region has a pretty even male:female ratio, average age, and average number of children.

The southeast region has a notably higher average cost of insurance, which does correlate poitively with a higher average BMI and higher ratio of smokers:non-smokers compared to national results and other regions. Further analysis would be required to investigate if the link is causally connected.