# U.S. Medical Insurance Costs

Previous on-platform Codecademy's Python challenge projects on U.S. medical insurance costs treated the insurance costs as the dependent variable and other data as independent variables. Therefore, in this project I will do the same.

### Initial importing and inspection

Import modules:

In [1]:
import csv

Import data:

In [2]:
with open("insurance.csv", newline="") as insurance_csv:
    insurance_data = list(csv.DictReader(insurance_csv, delimiter=","))

Briefly describe the dataset:

In [3]:
var_list = tuple(insurance_data[0].keys())
print("The data provided in the dataset contains these {} variables: {}".format(len(var_list), var_list))
print("The total size of the sample is {}.".format(len(insurance_data)))
del var_list

The data provided in the dataset contains these 7 variables: ('age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges')
The total size of the sample is 1338.


Check how the data is formatted:

In [4]:
print(insurance_data[0])
print(insurance_data[1])

{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}
{'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}


Much of the further analysis will require working with numbers.
Because of that, numerical data should be converted from string to int or float.

Define a function to convert the numerical data:

In [5]:
def convert_num_data_types(data_list):
    for row in data_list:
        row["age"] = int(row["age"])
        row["bmi"] = float(row["bmi"])
        row["children"] = int(row["children"])
        row["charges"] = float(row["charges"])

Run the function and test to see the results:

In [6]:
convert_num_data_types(insurance_data)

print(insurance_data[0])
print(insurance_data[1])

{'age': 19, 'sex': 'female', 'bmi': 27.9, 'children': 0, 'smoker': 'yes', 'region': 'southwest', 'charges': 16884.924}
{'age': 18, 'sex': 'male', 'bmi': 33.77, 'children': 1, 'smoker': 'no', 'region': 'southeast', 'charges': 1725.5523}


### Describing each variable separately

Make a general function that takes the insurance data and the column name (dict key) of a variable and describes the variable:

In [7]:
def describe_var(data_list, name):
    try:
        var_data_list = []
        for row in data_list:
            var_data_list.append(row[name])
        if type(data_list[0][name]) == str:
            print("'{}' is a categorical (nominal) variable.".format(name))
            # describe categorical variable, including how many entries there are for each category
            category_list = set(var_data_list)
            category_dict = {cat: 0 for cat in category_list}
            for entry in var_data_list:
                category_dict[entry] += 1
            print("It has {} categories: {}.".format(len(category_list), category_list))
            print("Out of all data rows", end="")
            for cat, cat_count in category_dict.items():
                print(", {nr} are {cat}".format(nr=cat_count, cat=cat), end="")
            print(".")
        else:
            print("'{}' is a numerical variable.".format(name))
            # describe numerical variable, format output based on whether var_is_int
            minimum = min(var_data_list)
            maximum = max(var_data_list)
            average = sum(var_data_list) / len(var_data_list)            
            if type(data_list[0][name]) == int:
                average = int(average)
                print("It contains values ranging from {mini} to {maxi}, with an average of {avg}.".\
                      format(mini=minimum ,maxi=maximum ,avg=average))
            else:
                print("It contains values ranging from {mini:,} to {maxi:,}, with an average of {avg:,.2f}.".\
                      format(mini=minimum ,maxi=maximum ,avg=average))
        print("")
    except KeyError:
        print("Please provide a valid list of dicts and a dict key as arguments.")
        return

Test the function:

In [8]:
describe_var(insurance_data, "bmi")
describe_var(insurance_data, "a typo")

'bmi' is a numerical variable.
It contains values ranging from 15.96 to 53.13, with an average of 30.66.

Please provide a valid list of dicts and a dict key as arguments.


Since we want to analyze all the variables, make a function that would call `describe_var()` for all the variables and call it:

In [9]:
def describe_all_vars(data_list):
    for key in data_list[0].keys():
        describe_var(data_list, key)
        
describe_all_vars(insurance_data)

'age' is a numerical variable.
It contains values ranging from 18 to 64, with an average of 39.

'sex' is a categorical (nominal) variable.
It has 2 categories: {'male', 'female'}.
Out of all data rows, 676 are male, 662 are female.

'bmi' is a numerical variable.
It contains values ranging from 15.96 to 53.13, with an average of 30.66.

'children' is a numerical variable.
It contains values ranging from 0 to 5, with an average of 1.

'smoker' is a categorical (nominal) variable.
It has 2 categories: {'yes', 'no'}.
Out of all data rows, 274 are yes, 1064 are no.

'region' is a categorical (nominal) variable.
It has 4 categories: {'northwest', 'northeast', 'southwest', 'southeast'}.
Out of all data rows, 325 are northwest, 324 are northeast, 325 are southwest, 364 are southeast.

'charges' is a numerical variable.
It contains values ranging from 1,121.8739 to 63,770.42801, with an average of 13,270.42.



### Analyzing other variables' relationships with insurance costs (`charges`)

Make a function that calculates Pearson correlation, coefficient of determination and linear regression slope for a numerical variable:

> Pearson correlation: $r_{xy} = \frac{\sum x y \,-\, \frac{\sum x \sum y}{n}}{\sqrt{(\sum x^2 \,-\, \frac{(\sum x)^2}{n})(\sum y^2 \,-\, \frac{(\sum y)^2}{n})}}$

> Coefficient of determination is $r_{xy}^2$

> Regression equasion is $y = a + bx + ε$, where $$b=\frac{\sum x y \,-\, \frac{\sum x \sum y}{n}}{\sum x^2 \,-\, \frac{(\sum x)^2}{n}}$$ <br> $$a=\frac{\sum y \,-\, b \sum x}{n}$$

In [10]:
def num_analysis(data_list, name):
    # correlation:
    sum_xy = 0
    sum_x = 0
    sum_x_sq = 0
    sum_y = 0
    sum_y_sq = 0
    sample_size = 0
    for row in data_list:
        sum_xy += row[name] * row["charges"]
        sum_x += row[name]
        sum_x_sq += row[name]**2
        sum_y += row["charges"]
        sum_y_sq += row["charges"]**2
        sample_size += 1        
    pearsons_r_num = (sum_xy - sum_x * sum_y / sample_size)
    pearsons_r_denom = ((sum_x_sq - sum_x**2 / sample_size) * (sum_y_sq - sum_y**2 / sample_size))**0.5
    pearsons_r = pearsons_r_num / pearsons_r_denom
    pearsons_r_sq = pearsons_r**2
    print("'{name}' and 'charges' have a correlation coefficient of {r:.2f} and a determination coefficient of {r_sq:.2f}."\
         .format(name=name, r=pearsons_r, r_sq=pearsons_r_sq))
    # comment on relationship (positive/inverse) and its strength
    if "{:.0%}".format(pearsons_r_sq) == "0%":
        print("It is a very weak or nonexistent", end=" ")
        # print("relationship.")
        # return pearsons_r, pearsons_r_sq, None 
        # ^ uncommenting the two lines above will change the function to not go through with
        # linear regression when it is very likely to be statistically insignificant
    elif pearsons_r_sq >= 0.6:
        print("It is a strong", end=" ")
    elif pearsons_r_sq >= 0.4:
        print("It is a moderately strong", end=" ")
    else:
        print("It is a weak", end=" ")
    if pearsons_r > 0:
        print("positive relationship.")
    else:
        print("inverse relationship.")
    # linear regression
    slope_denom = (sum_x_sq - sum_x**2 / sample_size)
    slope = pearsons_r_num / slope_denom
    print("Each unit increase in {name} is associated with an average {slope:,.2f} unit"\
         .format(name=name, slope=abs(slope)), end=" ")
    if pearsons_r > 0:
        print("increase in insurance costs ('charges').")
    else:
        print("decrease in insurance costs ('charges').")
    # significance
    print("This linear relationship could explain {:.0%} of variance in insurance costs."\
         .format(pearsons_r_sq))
    return pearsons_r, pearsons_r_sq, slope

Test the function:

In [11]:
num_analysis(insurance_data, "age")

'age' and 'charges' have a correlation coefficient of 0.30 and a determination coefficient of 0.09.
It is a weak positive relationship.
Each unit increase in age is associated with an average 257.72 unit increase in insurance costs ('charges').
This linear relationship could explain 9% of variance in insurance costs.


(0.29900819333064554, 0.0894058996788567, 257.7226186668939)

Make a function that calculates average charges for each category of a categorical variable, and the % difference between lowest and highest average:

In [12]:
def cat_analysis(data_list, name):
    var_data_list = []
    for row in data_list:
        var_data_list.append(row[name])
    category_list = set(var_data_list)
    category_dict = {cat: [] for cat in category_list}
    for row in data_list:
        category_dict[row[name]].append(row["charges"])
    for (cat, charges) in category_dict.items():
        category_dict[cat] = sum(charges) / len(charges)
    print("'{name}' has {count} categories: {list}.".format(name=name, count=len(category_list), list=category_list))
    print("The average insurance costs ('charges') for each of them are:")
    maxi = ["", -1]
    mini = ["", float("inf")]
    for cat, avg_charges in category_dict.items():
        print("{nr:.2f} for {cat},".format(nr=avg_charges, cat=cat), end=" ")
        if avg_charges > maxi[1]:
            maxi = [cat, avg_charges]
        if avg_charges < mini[1]:
            mini = [cat, avg_charges]
    print("with insurance costs on average {diff:.0%} higher for '{max_cat}' compared to '{min_cat}'."\
         .format(diff = maxi[1]/mini[1] - 1, max_cat=maxi[0], min_cat=mini[0]))
    return category_dict

Test the function:

In [13]:
cat_analysis(insurance_data, "sex")

'sex' has 2 categories: {'male', 'female'}.
The average insurance costs ('charges') for each of them are:
13956.75 for male, 12569.58 for female, with insurance costs on average 11% higher for 'male' compared to 'female'.


{'male': 13956.751177721886, 'female': 12569.57884383534}

Make a general function that takes the insurance data and the column name (dict key) of a variable and analyzes the variable:

In [14]:
def analyze_var(data_list, name):
    try:
        if type(data_list[0][name]) == str:
            cat_analysis(data_list, name)
        elif name == "charges":
            print("'charges' is the dependent variable.")
        else:
            num_analysis(data_list, name)
        print("")
    except KeyError:
        print("Please provide a valid list of dicts and a dict key as arguments.")
        return

Test the function:

In [15]:
analyze_var(insurance_data, "bmi")
analyze_var(insurance_data, "charges")

'bmi' and 'charges' have a correlation coefficient of 0.20 and a determination coefficient of 0.04.
It is a weak positive relationship.
Each unit increase in bmi is associated with an average 393.87 unit increase in insurance costs ('charges').
This linear relationship could explain 4% of variance in insurance costs.

'charges' is the dependent variable.



Since we want to analyze all the variables, make a function that would call `analyze_var()` for all the variables and call it:

In [16]:
def analyze_all_vars(data_list):
    for key in data_list[0].keys():
        analyze_var(data_list, key)
        
analyze_all_vars(insurance_data)

'age' and 'charges' have a correlation coefficient of 0.30 and a determination coefficient of 0.09.
It is a weak positive relationship.
Each unit increase in age is associated with an average 257.72 unit increase in insurance costs ('charges').
This linear relationship could explain 9% of variance in insurance costs.

'sex' has 2 categories: {'male', 'female'}.
The average insurance costs ('charges') for each of them are:
13956.75 for male, 12569.58 for female, with insurance costs on average 11% higher for 'male' compared to 'female'.

'bmi' and 'charges' have a correlation coefficient of 0.20 and a determination coefficient of 0.04.
It is a weak positive relationship.
Each unit increase in bmi is associated with an average 393.87 unit increase in insurance costs ('charges').
This linear relationship could explain 4% of variance in insurance costs.

'children' and 'charges' have a correlation coefficient of 0.07 and a determination coefficient of 0.00.
It is a very weak or nonexistent