# U.S. Medical Insurance Costs

## Project Scope

### Goal

The main goal of this project is to analyse how the general health of an individual contributes to their insurance cost and possibly draw a conclusion that the healthier the individual is the less insurance cost the indivudal will have to pay.

### Data

The data used for this project is the data given by codecademy on the US medical insurance cost. The data has over 1000 records of different individuals, their characteristics and their insurance cost. The data is already cleaned and it is in csv format. No records are missing and no data is wierdly formatted.

For this analysis, three columns will be used: The `bmi` column, the `smoker` column and the `charges` column. A more detail explanation of each column can be found below.

- smoker: An indication of the smoker status of the individual. either 'yes' or 'no'
- bmi: The bmi for the individual. The data is in float format. In other words, it is not a categorical column *
- charges: The cost that the indivudal pays for insurance **


\* Due to the limitations of data, it is impossible to know the units of measurement for this property. It is known though, that the bmi si the ratio of the person's weight and the square of it's height. The universal way to calculate it is with the  weight in `kg` and the height in `m` [source](https://en.wikipedia.org/wiki/Body_mass_index)

\*\* The units of charges is unknown. Considering the data is from the US, the units of measurements is probably USD. However, according to [Registered nursing.org](https://www.registerednursing.org/healthcare-costs-by-age/), The yearly spendings of Arerican citizens is in the throusands and grows with age. This is very similar to the data we have. Knowing this, it is possible to assume that the charges are calculated yearly and in USD

### Analysis

The type of analysis that will be done is a discriptive analysis as this analysis has a goal to give another argument to smokers and help them quit by showing the impacts it has on their financial life on top of all the impact it has on their health

## Code

### Importing the libraries

The first step is to import the `csv` library as it is this library that will be used to manipulate the data that is already in `csv` format.

In [69]:
import csv

The next step would be to load the data in a dictionary. However, for this analysis, only the `bmi`, `smoker` column and the `charges` column were taken into consideration.

### Creating the Person class

To accomplish the loading, the csv.DictReader function. But first, a class `Person` will be defined to handle all the `person` data

Here is a brief documentation on the `Person` class

- **Instance variables**
    - `bmi`: The bmi of the person
    - `smoker`: If the person is a smoker or not
    - `insurance_cos`t: the yearly cost of the person

- **Class methods**
    - `get_smoker()`: Gets the smoker status of the person
    - `get_bmi()`: Gets the bmi status of the person
    - `get_insurance_cost()`: Gets the insrance cost of the person

In [70]:
class Person():
    
    def __init__(self, bmi, smoker, charges):
        self.bmi = bmi
        self.smoker = smoker
        self.insurance_cost = charges
        
    def __repr__(self):
        return "Person with a bmi of {0}, a smoker status of {1} and a yearly charge of {2} USD.".format(self.bmi, self.smoker, self.insurance_cost)
        
    
    # define getters for the values allowing for easier data retrieval
    
    def get_bmi(self):
        return float(self.bmi)
    
    def get_smoker(self):
        return self.smoker
    
    def get_insurance_cost(self):
        return float(self.insurance_cost)

Then, a list of `Person` instances is created to store the data from the `csv` file

### Reading the Data

In [71]:
people = []

Now open the file and read from it.

In [72]:
with open('insurance.csv', newline='') as insurance_csv:
    insurance_data = csv.DictReader(insurance_csv)
    
    # Extract the bmi, smoker and charges data
    for row in insurance_data:
        bmi = row['bmi']
        smoker = row['smoker']
        charges = row['charges']
        new_person = Person(bmi, smoker, charges)
        people.append(new_person)    

To test our reading, we print out the first 10 records in the `people` list. Du the `__repr__` function, these should be  readable to humans

In [73]:
for i in range(10):
    print(people[i])

Person with a bmi of 27.9, a smoker status of yes and a yearly charge of 16884.924 USD.
Person with a bmi of 33.77, a smoker status of no and a yearly charge of 1725.5523 USD.
Person with a bmi of 33, a smoker status of no and a yearly charge of 4449.462 USD.
Person with a bmi of 22.705, a smoker status of no and a yearly charge of 21984.47061 USD.
Person with a bmi of 28.88, a smoker status of no and a yearly charge of 3866.8552 USD.
Person with a bmi of 25.74, a smoker status of no and a yearly charge of 3756.6216 USD.
Person with a bmi of 33.44, a smoker status of no and a yearly charge of 8240.5896 USD.
Person with a bmi of 27.74, a smoker status of no and a yearly charge of 7281.5056 USD.
Person with a bmi of 29.83, a smoker status of no and a yearly charge of 6406.4107 USD.
Person with a bmi of 25.84, a smoker status of no and a yearly charge of 28923.13692 USD.


### Create the Smoker Analysis class

The next step to this analysis is to create the `SmokerInsuranceAnalysis` class. this is the class that will be used to do all the calculations related to the data for the smoker category

Here is a brief documentation on the class:

- **Instance Variables**
    - `people`: List of people to analyse

- **Instance Methods**
    - `get_average_non_smoker_insurance_cost()`: Gets the average non smoker insurance cost
    - `get_average_smoker_insurance_cost()`: Gets the average smoker insurance cost
    - `compare_insurance_cost_smoker_non_smoker()`: Compares the two costs

In [74]:
class SmokerInsuranceAnalysis():
    
    def __init__(self, people):
        self.people = people
        self.num_smokers = 0
        self.num_non_smokers = 0
        
        for person in self.people:
            if person.get_smoker() == "yes":
                self.num_smokers += 1
            elif person.get_smoker() == "no":
                self.num_non_smokers += 1
        
    # function returning the average smoker insurance cost
    def get_average_smoker_insurance_cost(self):
        total_cost_smokers = 0
        
        for person in self.people:
            if person.get_smoker() == "yes":
                total_cost_smokers += person.get_insurance_cost()
        
        return total_cost_smokers/self.num_smokers
    
    # function returning the average non-smoker insurance cost
    def get_average_non_smoker_insurance_cost(self):
        total_cost_non_smokers = 0
        
        for person in self.people:
            if person.get_smoker() == "no":
                total_cost_non_smokers += person.get_insurance_cost()
                
        return total_cost_non_smokers/self.num_non_smokers
    
    # function comparing the smoker cost vs the non-smoker cost
    def compare_insurance_cost_smoker_non_smoker(self):
        average_smoker_cost = self.get_average_smoker_insurance_cost()
        average_non_smoker_cost = self.get_average_non_smoker_insurance_cost()
        difference = 0
        
        if average_smoker_cost > average_non_smoker_cost:
            difference = average_smoker_cost - average_non_smoker_cost
        else:
            difference = average_non_smoker_cost - average_smoker_cost
            
        print("The average medical insurance cost for smokers is {0} USD yearly".format(round(average_smoker_cost, 2)))
        print("The average medical insurance cost for non-smokers is {0} USD yearly".format(round(average_non_smoker_cost, 2)))
        print("This represents a difference of {0} USD yearly on average".format(round(difference, 2)))
                
    
                

### Creating the BMI Analysis class

The next step is to create the class to analyse the yearly charges depending on the bmi of the person. Since the `bmi` is not a categorical column, it requires a little more analysis. First, let's take a look at the distribution of the data.

In [75]:
min_bmi = 100
max_bmi = 0
for person in people:
    bmi = person.get_bmi()
    if bmi < min_bmi:
        min_bmi = bmi
    if bmi > max_bmi:
        max_bmi = bmi
        
print("Lowest bmi: ", min_bmi)
print("Highest bmi: ", max_bmi)

Lowest bmi:  15.96
Highest bmi:  53.13


Knowing that the bmi varies from 15.96 to 53.15, it could be possible to make 8 categories of 5 starting at 15 and ending at 55

The `BmiInsuranceAnalysisClass` contains the following variables and methods

- **Class Variables**
    - `categories`: Dictionary with the 8 categories of bmi. The key is the category number and the value is the upper boundary
    - `people_classed`: Dictionary with the 8 categories and the people in each. The key is the category number and the value is a list of people in that category

- **Methods**
    - `class_people()`: classes the people in eahc category in the people_classed dictionary
    - `get_average_cost_for_class()`: Calculates the average cost of insurance for a category of people (bmi range of 5)
    - `get_average_cost_per_class()`: Creates a dictionary with keys ewuivalent to the category number and values equivalent to the average insurance cost for the category
    

In [76]:
class BmiInsuranceAnalysis():
    
    categories = {
        1: 20,
        2: 25,
        3: 30,
        4: 35,
        5: 40,
        6: 45,
        7: 50,
        8: 55,
    }
    
    people_classed = {
        1: [],
        2: [],
        3: [],
        4: [],
        5: [],
        6: [],
        7: [],
        8: [],
    }
    
    def __init__(self, people):
        self.people = people
        
        self.class_people()
    
    
    def __repr__(self):
        for key, value in self.people_classed.items():
            print(key, ": ", len(value))
        return "done"
    
    
    def class_people(self):
        
        for person in self.people:
            
            if person.get_bmi() < self.categories[1]:
                self.people_classed[1].append(person)
            elif person.get_bmi() < self.categories[2]:
                self.people_classed[2].append(person)
            elif person.get_bmi() < self.categories[3]:
                self.people_classed[3].append(person)
            elif person.get_bmi() < self.categories[4]:
                self.people_classed[4].append(person)
            elif person.get_bmi() < self.categories[5]:
                self.people_classed[5].append(person)
            elif person.get_bmi() < self.categories[6]:
                self.people_classed[6].append(person)
            elif person.get_bmi() < self.categories[7]:
                self.people_classed[7].append(person)
            elif person.get_bmi() < self.categories[8]:
                self.people_classed[8].append(person)
     
    
    def get_average_cost_for_class(self, category):
        class_list = self.people_classed.get(category, None)
        total_cost = 0
        
        if class_list == None:
            return "Cannot calculate average. Class number is invalid"
        else:
            for person in class_list:
                total_cost += person.get_insurance_cost()
        
        average_insurance_cost = round(total_cost / len(class_list), 2)
                
        return average_insurance_cost
    
    
    def get_average_cost_per_class(self):
        
        average_costs = {}
        
        for category in self.categories.keys():
            average_costs[category] = self.get_average_cost_for_class(category)
            
        return average_costs


## Results


Now for the results:

In [79]:

# For the smoker category:
print("For the smoker Category: ")

insurance_analysis_smoker = SmokerInsuranceAnalysis(people)
print("Number of smokers: ", insurance_analysis_smoker.num_smokers)
print("Number of non-smokers:", insurance_analysis_smoker.num_non_smokers)
insurance_analysis_smoker.compare_insurance_cost_smoker_non_smoker()

print("\n-----------------------------------------------------------\n")

#For the bmi category
print("For the BMI categories")

insurance_analysis_bmi = BmiInsuranceAnalysis(people)
bmi_average_costs = insurance_analysis_bmi.get_average_cost_per_class()

for key, value in bmi_average_costs.items():
    print("The average medical insurance cost for category {0} is {1} USD yearly with {2} people in it".format(key, value, len(insurance_analysis_bmi.people_classed[key])))

For the smoker Category: 
Number of smokers:  274
Number of non-smokers: 1064
The average medical insurance cost for smokers is 32050.23 USD yearly
The average medical insurance cost for non-smokers is 8434.27 USD yearly
This represents a difference of 23615.96 USD yearly on average

-----------------------------------------------------------

For the BMI categories
The average medical insurance cost for category 1 is 8838.56 USD yearly with 123 people in it
The average medical insurance cost for category 2 is 10572.37 USD yearly with 612 people in it
The average medical insurance cost for category 3 is 10987.51 USD yearly with 1158 people in it
The average medical insurance cost for category 4 is 14419.67 USD yearly with 1173 people in it
The average medical insurance cost for category 5 is 17022.26 USD yearly with 675 people in it
The average medical insurance cost for category 6 is 16569.6 USD yearly with 213 people in it
The average medical insurance cost for category 7 is 17815.04

## Discussion

Considering the results obtained in the [Results](#Results) section, it is possible to infer that the yearly cost of medical insurance grows the less healthy a person is. Considering this, if someone is looking for possible ways to lower medical insurance costs in the US, a good place to start would be to try and improve general health. A couple of solutions to accomplish this goal would be to :

- improve general eating habits
- invest in a gym membership or in a fitness app
- Work out at least three times a week for 30 minutes
- Quit smoking (applies to smokers)

There are certain limitations to this analysis. First of all, it is possible to see that the bmi of the majority of the people in the dataset were between 25 to 40. The other categories had a very small amount of people in them. This means that the costs were probably not very representative of the population for those categories.

Also, for the smoker category, there are approximately 5 times more non-smokers in the dataset  than there are smokers. This means that there is a lot more data on non-smokers than smokers resulting in the results for non-smokers being more representative of the non-smoker population than the results for non-smokers. However, taking this into consideration, according to [this article](https://news.gallup.com/poll/109048/us-smoking-rate-still-coming-down.aspx), the ratio of adult smokers to adult non-smokers is about 1 for 5. Which represents the ratio of the dataset



## Conclusion

To conclude this analysis, it was discovered that the overall health of an individual determined by the bmi and the smoker status of that person contributes greatly to an increase or decrease of insurance cost. The healthier the person is the, the less expensive insurance is.

It would've been interesting to analyse this subject further with more details on the persons health condition to have more factors influencing the insurance cost. Also, the `age` category would've been interesting to analyse as this might impact the health insurance considering that generally speaking, older people are less autonomous and might have to pay more medical insurance. This could be subject of analysis for the future