# <center>U.S. Medical Insurance Costs</center>

This is a **Data Science Python** portfolio project. The data set is provided by *codecademy.com* (Codecademy from now on) and it cosists in a csv file called 'insurance.csv' containing data about a group of people of the U.S., and how much they've been charged in medical insurance costs (in dollars). Even though this project belongs to a skill path at Codecademy, it is presented as a great opportunity to craft a showcase of the student skills. In fact, it encorages the student to optionally go beyond what he/she has learnt in the skill path at the given point, and apply alternative concepts and tools that may be useful for the implementation of the analysis (i.e. using Python libraries such as _Numpy_, _Matplotlib_ or _Pandas_).

## INDEX

* 1. Scope of the project
    * 1.1. Objectives
    * 1.2. Methodology
* 2. Data cleaning and preprocessing
    * 2.1. Opening the csv 
    * 2.2. About the data
    * 2.3. Checking for missing data. Counting data items.
    * 2.4. Cleaning the data
* 3. Analysis
    * 3.1. Functions
    * 3.2. Storing data into variables for analysis
* 4. Outcomes
* 5. Conclusions
* 6. References

## 1. Scope of the project

### 1.1. Objectives

The goals of this project are:

* To find who is the person from the data set that pays the less and the person that pays the most.
* To calculate the sample of each group in between categories, both numerically and in percentage. 
* To find out the total charges of each subcategory inside age, sex, bmi, number of children, smoker status and region.
* To calculate the mean and median of the charges of all the subcategories (taking children as example, it would be to calculate the mean and median charges of people with no children, 1 child, 2 children, etc) and to compare the results.
* To calculate the interquartile range of charges for each category, to discover how spread these charges are in each category. 
* To locate skewed data and explain why it skewed.

### 1.2. Methodology

This is a descriptive analysis. These are the steps I will conduct:

* First, I will open the csv file and put each category into a list.
* After that, I will make a basic explanation from some of the fields in the data set.
* Then, I will clean the data provided and make it suitable for analysis. These will involve the following steps:
    * To count the elements in every list to ensure the number of items coincide.
    * To turn string values that are numbers to int or float as needed.
    * When values are necessarily strings, to observe which these are. 
    * After elements of the original lists are cleaned and moved to a new list, to create a Python dictionary that merges every category with its clean list.
* Afterwards, I will define some functions that will serve as tools for the analysis.
* Later, I will use mine and built-in functions to create variables that store the analyzed data.
* In the end, I will make prints of these variables to visualise the results and make my conclusions.

## 2. Data cleaning and preprocessing

### 2.1. Opening the csv file

In [1]:
import csv
import numpy as np

# Lists to add the data as it is initially provided
raw_age = []
raw_sex = []
raw_bmi = []
raw_children = []
raw_smoker = []
raw_region = []
raw_charges = []

"""After the previous data lists are cleaned, new lists will be generated. These will 
act as values for the keys (column names) of a dictionary which will store all the data
for further analysis."""

# Openning the data set and filling raw_data lists
with open("insurance.csv", newline='') as insurance_csv:
    data_reader = csv.DictReader(insurance_csv)
    for row in data_reader:
        raw_age.append(row['age'])
        raw_sex.append(row['sex'])
        raw_bmi.append(row['bmi'])
        raw_children.append(row['children'])
        raw_smoker.append(row['smoker'])
        raw_region.append(row['region'])
        raw_charges.append(row['charges'])

### 2.2. About the data

This data set displays several rows (later on we will know how many) and a total of seven columns. This is data about a group of people, with some information about them and the medical insurance costs they've been charged. The column names are:

* age
<br>
-sex
<br>
-bmi _(which stands for Body Mass Index)_
<br>
-children
<br>
-smoker
<br>
-region
<br>
-charges

There are some of these categories where we may assume what to expect. For example, the 'smokers' category. In the U.S., smokers get charged more by insurance companies. This is called 'tobacco rating', and it's different depending on the State you live$^1$.  

BMI is another one. To be overweight does not have to be offensive, and no one should be insulted or alienated because of it. But on a health context, it is well proved that obesity and overweight can lead to several diseases, including type 2 diabetes, cardiovascular disease, liver disease, obstructive sleep apnea and cancer, among others$^2$.

For adults, there are four categories that indicate what weight state you are depending the range your BMI falls in:

* Below 18.5 – you're in the underweight range.
* Between 18.5 and 24.9 – you're in the healthy weight range.
* Between 25 and 29.9 – you're in the overweight range.
* 30 or over – you're in the obese range.

In addition to height and weight, BMI should be calculated considering other aspects that can be crucial to determine its accurate value, such as the age, sex, muscle density and ethnic group. "For example, adults of South Asian origin may have a higher risk of some health problems, such as diabetes, with a BMI of 23, which is usually considered healthy."$^3$

* __About how medical insurance costs are calculated__

According to Jean Folger, the insurance companies take account of the following factors to calculate insurance premiums: the age, the type of coverage, the amount of coverage, personal information and actuarial tables. The later are used to "assess the risk of financial loss, using mathematics and statistics to predict the likelihood of an insurance claim"$^4$.


### 2.3. Checking for missing data. Counting data items

In [7]:
# I print the lenght of each raw_data list to ensure they store the same number of items

print(len(raw_age))
print(len(raw_sex))
print(len(raw_bmi))
print(len(raw_children))
print(len(raw_smoker))
print(len(raw_region))
print(len(raw_charges))

# The result for every list is 1338 items. They all match.

1338
1338
1338
1338
1338
1338
1338


### 2.4. Cleaning the data

In [35]:
population_lenght = 1338

# Age

clean_age = [int(c) for c in raw_age]
unique_values_age = list(set(clean_age))

# Sex

clean_sex = raw_sex
unique_values_sex = list(set(clean_sex))

# BMI

clean_bmi = [round(float(c), 1) for c in raw_bmi]
unique_values_bmi = list(set(clean_bmi))

# Children

clean_children = [int(c) for c in raw_children]
unique_values_children = list(set(clean_children))

# Smoker

clean_smoker = raw_smoker
unique_values_smoker = list(set(raw_smoker))

# Region

clean_region = raw_region
unique_values_region = list(set(raw_region))

# Charges

clean_charges = [round(float(c), 2) for c in raw_charges]

# Data dictionary

data_dict = {}

data_dict['age'] = clean_age
data_dict['sex'] = clean_sex
data_dict['bmi'] = clean_bmi
data_dict['children'] = clean_children
data_dict['smoker'] = clean_smoker
data_dict['region'] = clean_region
data_dict['charges'] = clean_charges

## 3. Analysis

### 3.1. Functions

In [51]:
def arranger_counter_dict_maker(data_list):
    """This function takes a list as the only argument, and returns a dictionary, 
    where the keys are the different items of the list sorted (alphabetic or numerically), and 
    the values are how many times that item appears in that list."""
    sorted_list = sorted(data_list)
    count_list = []
    count_list = [data_list.count(i) for i in sorted_list]
    return {key:value for key, value in zip(sorted_list, count_list)}

def percentage_calculator(sample, total):
    """This function calculates a percentage round 100. It needs two numbers to work: a sample 
    (int or float) and a total as arguments. Total must be greater than number"""
    return sample * 100 / total

def sum_second_value(dictionary, key_1, value_1, key_2):
    """This function takes four arguments: a dictionary, two of its keys and one value belonging 
    to the first key. It returns the sum of all the values of the second key that have the 
    same index than the first value in the first key."""
    total = 0
    for i in range(len(dictionary[key_1])):
        if dictionary[key_1][i] == value_1:
            total += dictionary[key_2][i]
    return total

def fftl_inclusive_dt_arranger(dt_list, bottom_value, top_value):
    """This function counts how many items are in a list in a range from bottom value to top 
    value. If there is no top value, put 'inf' as top value."""
    item_counter = 0
    for i in dt_list:
        if i >= bottom_value and top_value == 'inf':
            item_counter += 1
        elif i >= bottom_value and i <= top_value:
            item_counter += 1
    return item_counter

def mean_second_value(dictionary, key_1, value_1, key_2):
    """This function takes four arguments: a dictionary, two of its keys and one value belonging 
    to the first key. It returns the mean of all the values in the second key that have the 
    same index than the first value in the first key. It requires to import numpy as np to work
    properly"""
    value_2_list = [dictionary[key_2][i] for i in range(len(dictionary[key_1])) if dictionary[key_1][i] == value_1]
    value_2_array = np.array(value_2_list)
    value_2_mean = np.mean(value_2_array)
    return value_2_mean

def median_second_value(dictionary, key_1, value_1, key_2):
    """This function takes four arguments: a dictionary, two of its keys and one value belonging 
    to the first key. It returns the median of all the values in the second key that have the 
    same index than the first value in the first key. It requires to import numpy as np to work
    properly"""
    value_2_list = [dictionary[key_2][i] for i in range(len(dictionary[key_1])) if dictionary[key_1][i] == value_1]
    value_2_array = np.array(value_2_list)
    value_2_median = np.median(value_2_array)
    return value_2_median

def interquartile_range_finder(dictionary, key_1, value_1, key_2):
    """This function takes four arguments: a dictionary, two of its keys and one value belonging 
    to the first key. It returns the interquartile range of all the values in the second key that have the 
    same index than the first value in the first key. It requires to import numpy as np to work
    properly"""
    value_2_list = [dictionary[key_2][i] for i in range(len(dictionary[key_1])) if dictionary[key_1][i] == value_1]
    value_2_array = np.array(value_2_list)
    first_percentile = np.percentile(value_2_array, 25)
    third_percentile = np.percentile(value_2_array, 75)
    return third_percentile - first_percentile
    
def lowest_value_index_sample(dictionary, key):
    """This function takes a dictionary and one of its keys as arguments. If all dictionary 
    values are lists, it continues. If not, it stops and prints a message. In the first case, 
    the function looks for the index of the lowest value of one key, and returns a dictionary 
    where the keys are the same than the dictionary given as argument, and the values for
    each key are those that share the aforementioned index."""
    returned_dictionary = {}
    index = 0
    for i in range(len(dictionary[key])):
        if dictionary[key][i] == min(dictionary[key]):
            index = i
    for i in dictionary:
        returned_dictionary[i] = dictionary[i][index]
    return returned_dictionary

def highest_value_index_sample(dictionary, key):
    """This function takes a dictionary and one of its keys as arguments. If all dictionary 
    values are lists, it continues. If not, it stops and prints a message. In the first case, 
    the function looks for the index of the lowest value of one key, and returns a dictionary 
    where the keys are the same than the dictionary given as argument, and the values for
    each key are those that share the aforementioned index."""
    returned_dictionary = {}
    index = 0
    for i in range(len(dictionary[key])):
        if dictionary[key][i] == max(dictionary[key]):
            index = i
    for i in dictionary:
        returned_dictionary[i] = dictionary[i][index]
    return returned_dictionary

def dictionary_range_maker(dictionary, key_1, key_2, range_bottom, range_top, new_key_1, new_key_2):
    """This function takes as argument a dictionary, two of its keys, two numbers to use in a 
    range function, and two strings that will act as keys for a new dictionary. It takes all the
    values in a given range of the first key of a dictionary, and appends them to a new list called
    new_values_1. The values from the key_2 that have the same index than the chosen values from 
    key_1 are also put in a new list called new_values_2. The function returns a dictionary where
    the keys are the two names given as the last two arguments, and their values are the new 
    lists"""
    returned_dictionary = {}
    new_values_1 = []
    new_values_2 = []
    for i in range(len(dictionary[key_1])):
        if dictionary[key_1][i] >= range_bottom and range_top == 'inf':
            new_values_1.append(dictionary[key_1][i])
            new_values_2.append(dictionary[key_2][i])
        elif dictionary[key_1][i] >= range_bottom and dictionary[key_1][i] <= range_top:
            new_values_1.append(dictionary[key_1][i])
            new_values_2.append(dictionary[key_2][i])
    returned_dictionary[new_key_1] = new_values_1
    returned_dictionary[new_key_2] = new_values_2
    return returned_dictionary
    
def category_data_encapsulator(dictionary, key_1, key_2):
    """This function takes a dictionary and two of its keys. It then returns a dictionary where
    the keys are the values of the first key provided, and the values are dictionaries where 
    the keys are four strings (total, mean, median and interquartile range), and the 
    values for each one are calculations of these operations among the values of the second key
    that coincide in index with the values from the first keys, using for each one a different
    function defined before. To work properly, this function needs other functions to be defined
    before, and some of them require to import numpy as np."""
    category_dictionary = {}
    for i in dictionary[key_1]: 
        category_dictionary[i] = {'Sample': dictionary[key_1].count(i),
                                 'Percentage' : round(percentage_calculator(dictionary[key_1].count(i), population_lenght), 2),
                                 'Total': round(sum_second_value(dictionary, key_1, i, key_2), 2),
                                 'Mean': round(mean_second_value(dictionary, key_1, i, key_2), 2),
                                 'Median': round(median_second_value(dictionary, key_1, i, key_2), 2),
                                 'Interquartile range': round(interquartile_range_finder(dictionary, key_1, i, key_2), 2)}
    return category_dictionary

### 3.2. Storing data into variables for analysis

In [66]:
# Person who pays the less and the most charges

lowest_insurance_costs_person = lowest_value_index_sample(data_dict, 'charges')
highest_insurance_costs_person = highest_value_index_sample(data_dict, 'charges')

# Age

clean_age_array = np.array(clean_age)
age_total_mean = round(np.mean(clean_age_array))
age_total_median = round(np.median(clean_age_array))

age_dict = category_data_encapsulator(data_dict, 'age', 'charges')

# Sex

sex_dict = category_data_encapsulator(data_dict, 'sex', 'charges')

# BMI

bmi_mean = np.mean(clean_bmi)
bmi_median = np.median(clean_bmi)

underweight_bmi = fftl_inclusive_dt_arranger(clean_bmi, 0, 18.4)
healthy_bmi = fftl_inclusive_dt_arranger(clean_bmi, 18.5, 24.9)
overweight_bmi = fftl_inclusive_dt_arranger(clean_bmi, 25, 29.9)
obese_bmi = fftl_inclusive_dt_arranger(clean_bmi, 30, 'inf')

clean_bmi_x100 = [int(round((i * 100), 1)) for i in clean_bmi]
data_dict['bmi x100'] = clean_bmi_x100

bmi_under = fftl_inclusive_dt_arranger(clean_bmi_x100, 0, 1840)
bmi_healthy = fftl_inclusive_dt_arranger(clean_bmi_x100, 1850, 2490)
bmi_over = fftl_inclusive_dt_arranger(clean_bmi_x100, 2500, 2990)
bmi_obese = fftl_inclusive_dt_arranger(clean_bmi_x100, 3000, 'inf')

bmi_charges_underweight = dictionary_range_maker(data_dict, 'bmi x100', 'charges', 0, 1840, 'underweight', 'charges')
bmi_charges_healthy = dictionary_range_maker(data_dict, 'bmi x100', 'charges', 1850, 2490, 'healthy', 'charges')
bmi_charges_overweight = dictionary_range_maker(data_dict, 'bmi x100', 'charges', 2500, 2990, 'overweight', 'charges')
bmi_charges_obese = dictionary_range_maker(data_dict, 'bmi x100', 'charges', 3000, 'inf', 'obese', 'charges')

bmi_dict = {}

bmi_dict['underweight'] = {}
bmi_dict['underweight']['Sample'] = len(bmi_charges_underweight['underweight'])
bmi_dict['underweight']['Percentage'] = round(((bmi_dict['underweight']['Sample']) * 100 / population_lenght), 2)
bmi_dict['underweight']['Total'] = round((sum(bmi_charges_underweight['charges'])), 2)
bmi_dict['underweight']['Mean'] = round((np.mean(np.array(bmi_charges_underweight['charges']))), 2)
bmi_dict['underweight']['Median'] = round((np.median(np.array(bmi_charges_underweight['charges']))), 2)
bmi_dict['underweight']['Interquartile range'] = round((np.percentile((np.array(bmi_charges_underweight['charges'])), 75) - np.percentile((np.array(bmi_charges_underweight['charges'])), 25)), 2)

bmi_dict['healthy'] = {}
bmi_dict['healthy']['Sample'] = len(bmi_charges_healthy['healthy'])
bmi_dict['healthy']['Percentage'] = round(((bmi_dict['healthy']['Sample']) * 100 / population_lenght), 2)
bmi_dict['healthy']['Total'] = round((sum(bmi_charges_healthy['charges'])), 2)
bmi_dict['healthy']['Mean'] = round((np.mean(np.array(bmi_charges_healthy['charges']))), 2)
bmi_dict['healthy']['Median'] = round((np.median(np.array(bmi_charges_healthy['charges']))), 2)
bmi_dict['healthy']['Interquartile range'] = round((np.percentile((np.array(bmi_charges_healthy['charges'])), 75) - np.percentile((np.array(bmi_charges_healthy['charges'])), 25)), 2)

bmi_dict['overweight'] = {}
bmi_dict['overweight']['Sample'] = len(bmi_charges_overweight['overweight'])
bmi_dict['overweight']['Percentage'] = round(((bmi_dict['overweight']['Sample']) * 100 / population_lenght), 2)
bmi_dict['overweight']['Total'] = round((sum(bmi_charges_overweight['charges'])), 2)
bmi_dict['overweight']['Mean'] = round((np.mean(np.array(bmi_charges_overweight['charges']))), 2)
bmi_dict['overweight']['Median'] = round((np.median(np.array(bmi_charges_overweight['charges']))), 2)
bmi_dict['overweight']['Interquartile range'] = round((np.percentile((np.array(bmi_charges_overweight['charges'])), 75) - np.percentile((np.array(bmi_charges_overweight['charges'])), 25)), 2)

bmi_dict['obese'] = {}
bmi_dict['obese']['Sample'] = len(bmi_charges_obese['obese'])
bmi_dict['obese']['Percentage'] = round(((bmi_dict['obese']['Sample']) * 100 / population_lenght), 2)
bmi_dict['obese']['Total'] = round((sum(bmi_charges_obese['charges'])), 2)
bmi_dict['obese']['Mean'] = round((np.mean(np.array(bmi_charges_obese['charges']))), 2)
bmi_dict['obese']['Median'] = round((np.median(np.array(bmi_charges_obese['charges']))), 2)
bmi_dict['obese']['Interquartile range'] = round((np.percentile((np.array(bmi_charges_obese['charges'])), 75) - np.percentile((np.array(bmi_charges_obese['charges'])), 25)), 2)

# Children

children_dict = category_data_encapsulator(data_dict, 'children', 'charges')

# Smoker

smoker_dict = category_data_encapsulator(data_dict, 'smoker', 'charges')

# Region

region_dict = category_data_encapsulator(data_dict, 'region', 'charges')

## 4. Outcomes

In [95]:
# This cell is used to print the category dictionaries for data visualisation

"""print(lowest_insurance_costs_person)
print(highest_insurance_costs_person)
print(sex_dict)
print(smoker_dict)
print(region_dict)
for i in range(6):
    print(children_dict[i]['Median'])
for i in bmi_dict:
    print(bmi_dict[i]['Mean'])
for i in bmi_dict:
    print(bmi_dict[i]['Median'])
for i in bmi_dict:
    print(bmi_dict[i]['Interquartile range'])
for i in range(19, 65):
    print(age_dict[i]['Interquartile range'])""";

## 5. Conclusions

* The individual who pays the lower medical insurance costs is a 18 years old male person, with a BMI of 23.2 (that is inside the healthy range), who has no children, does not smoke and lives in the southeast region.
<br>
* The individual who pays the higher medical insurance costs is a 54 years old female person, with a BMI of 47.4 (in the obese range, closer to the highest BMI found value in the data set, which is 53.1), who has no children, does smoke and lives in the southeast region.
<br>
* The people who are in the data set have ages from 18 to 64 years old. The age mean and median match: 39 years old. The maximum percentage of people are 18 years old, more than double the rest of the ages percentages, so this data is slightly skewed. Taking a look to the mean and median, even though it cannot be perceived a proportional connection between age and charges, some of the top charges are the ones that have a mean higher than 20000 in charges. These would be 60, 61 and 64 years old people. On the other hand, the median does increase with age almost directly proportional. When looking at the interquartile range for each age, the values are quite heterogeneous.
<br>
* In the sex category, the total charged for men is greater than for female, but there are more men than woman in the sample (676 men against 662 women). The mean and median values are very similar, but the interquartile range is significantly higher for men, so between this group the charges they pay are more spread.
<br>
* The first look at the BMI category was shocking. Over the half of the sample are people with obesity (52.84%). Only 16.59% of the sample falls in the healthy range. This alone is worrisome data. On the other hand, underweight people in the data set are only 1.49%. People with overwehgt sum almost 30%. There is a connection bewteen bmi category and charges, as underweight are who pay the less, followed by healthy, overweight and obese people. Their mean and median values confirm that. However, looking at the interquartile range it seems that there is more variation in what obese people get charged for medical insurance costs than the other groups, whose interquartile range is very similar. 
<br>
* The sample is very spread for people with a different number of children. There is people in the data set with zero to five children. More than half of the data (67.12%) are people with zero or one children, and people with four and five children suppose a tiny percentage of the sample (1.87% and 1.35% respectively). 
<br>
The higher mean of charges is for the people with three children, and the lower for those who have five. Median and interquartile ranges show significant differentes between the groups inside children. The higher difference of what they pay comes to the people with three children, and the lower for the people with five. There is no direct connection between number of children and insurance costs, probably that's because of the other categories in the data set. 
<br>
* What about smokers? We can conclude that being a smoker really raise insurance costs. The sample provided of smokers and non-smokers is very different (274 vs 1064). Despite that, the total charges for both are the same. Looking at the mean we begin to find the first differences between the two groups, as smokers pay a mean of 32050.23, and not-smokers pay only 8434.27. Huge difference! Also, the median charge values both groups are noticeably distant: for smokers is 34456.35 and for non-smokers is 7345.4. The interquartile range for non-smokers is very similar to their median, 7376.44, but in the smoker group this extends to 20192.96. This is really something to look through!
<br>
* Northwest and southwest inhabitants get charged similar insurance costs, and the samples from those areas in the data set are equal. On the other hand, the northeast sample is very similar to the aforementioned (only one less), but their insurance costs are greater, so this region is more expensive for medical insurance costs. There is another skewed data example here, as southeast people taken into consideration for this analysis are fourty more than people from the other regions. 
<br>
If we look at the mean, it seems that U.S. southeast inhabitants pay more for their insurance cost, followed by northeast, northwest and southwest people. This is skewed, because as it was mentioned before, the sample from southeast is significantly greater than from the other regions. In this category, the medians give as more accurate information. For southeast it is the second higher, surpassed by northeast, and followed by nothwest and southwest. Attending to the interquartile range, there is more difference in what southeast inhabitants get charged. The region were the costs are more similar is southwest.

## 6. References

1. [What You Need to Know About Smoking and Health Insurance](https://www.healthmarkets.com/resources/health-insurance/smoking-and-health-insurance/) Retrieved Febrary 17, 2023
2. [Is Obesity a Disease?](https://health.clevelandclinic.org/obesity-is-now-considered-a-disease/) Retrieved Febrary 15, 2023
3. [What is the body mass index (BMI)?](https://www.nhs.uk/common-health-questions/lifestyle/what-is-the-body-mass-index-bmi/) Retrieved Febrary 15, 2023
4. [How to Calculate Insurance Premiums](https://www.investopedia.com/ask/answers/09/calculating-premium.asp) Retrieved Febrary 17, 2023