# U.S. Medical Insurance Costs

# 1. Introduction

In this project, a U.S. medical insurance dataset will be investigated with learnings that has been made through the Python Fundamentals course from Codecademy.

The set of data is a .csv file made of 7 columns. 6 row gives characteristics of patient (age, sex, bmi, children, smoker, region) and the seventh row their insurance cost.
Informations are a mix of strings, integers or floats and for a total of 1338 patients.

# 2. Projects Goals

There is no specific guidelines given by Codecademy in this exercise.

I propose to analyze the data through the 3 following topics:

    1. Know your patient : analyze the characteristics such as average age, region, average bmi etc...
    2. Impact of certain characteristics on the Insurance Cost : how much more will pay a smoker, how age impact cost...
    3. Estimate the cost for a new patient : from information at disposal, is it possible to estimate our own insurance cost
    
Firstly, datas need to be imported and transformed into Python objects.

# 3. Import data

The datas are imported and save into a List.

In [9]:
import csv

with open("insurance.csv") as insurance_csv:
    insurance_data = []
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        insurance_data.append(row)

print(insurance_data[:3])
print("...")

[{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, {'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}, {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}]
...


insurance_data is a list of dictionaries. 
Each element of the list is a line of the .csv file. Each line is described by a dictionnary in which keys are the names of the column.
Now, it is very easy to access data we need for our analysis.

Here is an example:

In [4]:
#example if we want to access all the ages and save them in a list.
age_list = []
for line in insurance_data:
    age_list.append(line["age"])

print(age_list[:20])
print("...")

['19', '18', '28', '33', '32', '31', '46', '37', '37', '60', '25', '62', '23', '56', '27', '19', '52', '23', '56', '30']
...


This could be helpful to have a function to do the same for each column.

In [5]:
def make_a_list_from_column(name_of_column):
    list_of_column_values = []
    for line in insurance_data:
#as it would be easier to have it in float if they are numbers, we will use try instruction
        try:
            list_of_column_values.append(float(line[name_of_column]))
        except:
            list_of_column_values.append(line[name_of_column])
    return list_of_column_values

#list of names of column available
with open("insurance.csv") as insurance_csv:
    print("Here is the list of columns to use as parameter of our make_a_list_from_column: " + insurance_csv.readline())

Here is the list of columns to use as parameter of our make_a_list_from_column: age,sex,bmi,children,smoker,region,charges



Let's try our function:

In [20]:
bmi_list = make_a_list_from_column("bmi")
print(bmi_list[:100])
print("...")

[27.9, 33.77, 33.0, 22.705, 28.88, 25.74, 33.44, 27.74, 29.83, 25.84, 26.22, 26.29, 34.4, 39.82, 42.13, 24.6, 30.78, 23.845, 40.3, 35.3, 36.005, 32.4, 34.1, 31.92, 28.025, 27.72, 23.085, 32.775, 17.385, 36.3, 35.6, 26.315, 28.6, 28.31, 36.4, 20.425, 32.965, 20.8, 36.67, 39.9, 26.6, 36.63, 21.78, 30.8, 37.05, 37.3, 38.665, 34.77, 24.53, 35.2, 35.625, 33.63, 28.0, 34.43, 28.69, 36.955, 31.825, 31.68, 22.88, 37.335, 27.36, 33.66, 24.7, 25.935, 22.42, 28.9, 39.1, 26.315, 36.19, 23.98, 24.75, 28.5, 28.1, 32.01, 27.4, 34.01, 29.59, 35.53, 39.805, 32.965, 26.885, 38.285, 37.62, 41.23, 34.8, 22.895, 31.16, 27.2, 27.74, 26.98, 39.49, 24.795, 29.83, 34.77, 31.3, 37.62, 30.8, 38.28, 19.95, 19.3]
...


# 4. Analysis
## 4.1. Know your patient

Datas are imported and stored in variables that can be used for analysis.

The first step is to have a look to the age of patients:

In [14]:
import statistics

# average age of our patient
age_list = make_a_list_from_column("age")
print("The average age of our patients is: {} years old.".format(round(statistics.mean(age_list),1)))

# function to know how many patient between 2 ages and what they represent on total
def how_many_between(age_min, age_max):
    age_count = 0
    for age in age_list:
        if age >= age_min and age <= age_max:
            age_count += 1
    percent = age_count / len(age_list) * 100
    return "There are {} patients between {} and {} years old and it represents {}% of our total of patient.".format(age_count, age_min, age_max, round(percent,2))
                                                                                                                     
print(how_many_between(18, 35))
print(how_many_between(35, 50))


The average age of our patients is: 39.2 years old.
There are 574 patients between 18 and 35 years old and it represents 42.9% of our total of patient.
There are 433 patients between 35 and 50 years old and it represents 32.36% of our total of patient.


It will be interested to know where they live as well.

In [21]:
#count how many unique region
regions_unique = []
region_list = make_a_list_from_column("region")

for region in region_list:
    if region in regions_unique:
        continue
    else:
        regions_unique.append(region)

print("Here are the unique regions in the dataset : {}".format(regions_unique))


#we can store population per region in a dictionnary
region_population = {
    "northwest": 0,
    "northeast": 0,
    "southeast": 0,
    "southwest": 0,
    }

for key in region_population.keys():
    region_population[key] = make_a_list_from_column("region").count(key)

print("Here are the number of patient per region : {}".format(region_population))


Here are the unique regions in the dataset : ['southwest', 'southeast', 'northwest', 'northeast']
Here are the number of patient per region : {'northwest': 325, 'northeast': 324, 'southeast': 364, 'southwest': 325}


## 4.2. Impact of patient characteristics on charges

This part of the analysis will check how certain charateristics impact the patient's medical insurance cost, such as if the patient smoke or not.

In [93]:
#average cost for smokers versus non smokers
costs_smokers = []
costs_non_smokers = []

for line in insurance_data:
    if line["smoker"] == "yes":
        costs_smokers.append(float(line["charges"]))
    else: 
        costs_non_smokers.append(float((line["charges"])))
                                 
print("The average insurance cost for smokers are : {} dollars".format(round(statistics.mean(costs_smokers),2)))
print("The average insurance cost for non smokers are : {} dollars".format(round(statistics.mean(costs_non_smokers),2)))

        


The average insurance cost for smokers are : 32050.23 dollars
The average insurance cost for non smokers are : 8434.27 dollars


Even if other parameters count in this analaysis, it appears that smoking makes the insurance cost higher than non smoking.
How about bmi ?

In [105]:
# here is a function that calculate average charges for patient above a certain bmi
def average_cost_above_bmi(bmi_min):
    costs_at_bmi_min = []
    for line in insurance_data:
        if float(line["bmi"]) >= bmi_min:
            costs_at_bmi_min.append(float(line["charges"]))
    return "Average insurance cost of patient above {} of bmi is {} dollars".format(bmi_min, round(statistics.mean(costs_at_bmi_min), 1))

print(average_cost_above_bmi(16))
print(average_cost_above_bmi(22))
print(average_cost_above_bmi(28))
print(average_cost_above_bmi(35))


Average insurance cost of patient above 16 of bmi is 13279.1 dollars
Average insurance cost of patient above 22 of bmi is 13606.7 dollars
Average insurance cost of patient above 28 of bmi is 14642.5 dollars
Average insurance cost of patient above 35 of bmi is 16953.8 dollars


The higher the bmi the higher the insurance cost but it sounds to be less impactful than smoking.

So probably the patient with the higher cost should be a smoker with a high bmi. What other characteristics could he or she have ?

In [23]:
# find the patient with the higher cost and have a look to his/her characteristics

max_cost = max(make_a_list_from_column("charges"))

for line in insurance_data:
    if float(line["charges"]) == max_cost:
        print("Here is the characteristic of the patient with the higher insurance cost: \n{}".format(line))


Here is the characteristic of the patient with the higher insurance cost: 
{'age': '54', 'sex': 'female', 'bmi': '47.41', 'children': '0', 'smoker': 'yes', 'region': 'southeast', 'charges': '63770.42801'}


Here was some example of how some characteristics can impact the insurance cost.
It would be interesting to know how to predict what is the price increase if one of my characteristic change. 
For example if a patient quit smoking, how much will his or her insurance cost decrease ?

## 4.3. Estimate the cost for a new patient

From the data set, it might be posible to establish the model to calculate insurance charges.
Linear regression can be used for that.

Mathematically insurance cost would look like the following:
$$Insurancecost(x_1, x_2, x_3, x_4, x_5, x_6) = a_0 + a_1*x_1 + a_2*x_2 + a_3*x_3 + a_4*x_4 + a_5*x_5 + a_6*x_6 $$

Where $x_n$ are characteristics such as age, bmi, smoker... and $a_n$ the slopes to find out.



*note of 05 june 2021 : apparently I need to learn more to perform this ;p - let's continue the course and come back later to this project*