# U.S. Medical Insurance Costs

## Project Scope

### Goal

The main goal of this project is to analyze each category impacts the cost of health insurance in the goal to develop machine learning models able to predict how much impact a certain characteristic will have on the individual's medical insurance cost. This way, insurance companies will be able to maximise profit and be more competitive in the market when they are trying to sell insurance to individuals.

Obviously the goal mentioned above is out of scope considering the limitations of the data at hand and the amount of data at our disposal. However it is a good situation to aim for when analyszing the data and it will be used to orient the rest of this project. The specific scope of this project, however, is to analyze the impact of each category in the dataset on the yearly medical insurance cost of an individual.

### Data

The data used for this project is the data given by codecademy on the US medical insurance cost. The data has over 1000 records of different individuals, their characteristics and their insurance cost. The data is already cleaned and it is in csv format. No records are missing and no data is wierdly formatted.

For this analysis, all the columns will be used to get a better understanding of their impact on the cost of medical insurance. A more detail explanation of each column can be found below.

- **age**: The age of the individual in years
- **sex**: The sex of the individual. For now, their is only male or female in the dataset
- **bmi**: The bmi for the individual. The data is in float format. In other words, it is not a categorical column *
- **smoker**: An indication of the smoker status of the individual. either 'yes' or 'no'
- **region**: The region where the person lives
- **charges**: The cost that the indivudal pays for insurance **


\* Due to the limitations of data, it is impossible to know the units of measurement for this property. It is known though, that the bmi si the ratio of the person's weight and the square of it's height. The universal way to calculate it is with the  weight in `kg` and the height in `m` [source](https://en.wikipedia.org/wiki/Body_mass_index)

\*\* The units of charges is unknown. Considering the data is from the US, the units of measurements is probably USD. However, according to [Registered nursing.org](https://www.registerednursing.org/healthcare-costs-by-age/), The yearly spendings of Arerican citizens is in the throusands and grows with age. This is very similar to the data we have. Knowing this, it is possible to assume that the charges are calculated yearly and in USD

### Analysis

The type of analysis would be to optimize the cost of insurance companies to help them stay competitive in the market while maximizing their profit in their sales.

However, like mentioned in the goal section, this project will limit itself to the analysis of the data currently at hand. With that in mind, the analysis will be more of a descriptive analysis because the goal is to describe the impact of each factor on the medical insurance of individuals.

## Code

### Importing the libraries

The first step is to import the `csv` library as it is this library that will be used to manipulate the data that is already in `csv` format.

In [85]:
import csv

The next step would be to load the data in a dictionary. However, for this analysis, only the `bmi`, `smoker` column and the `charges` column were taken into consideration.

### Creating the Person class

To accomplish the loading, the csv.DictReader function. But first, a class `Person` will be defined to handle all the `person` data

Here is a brief documentation on the `Person` class

- **Instance variables**
    - `age`: The age of the person
    - `sex`: The sex of the person
    - `bmi`: The bmi of the person
    - `children`: The number of children that the person has
    - `smoker`: If the person is a smoker or not
    - `region`: The region where the person lives
    - `insurance_cos`t: the yearly cost of the person

- **Class methods**
    - `get_age()`: Gets the age of the individual
    - `get_sex()`: Gets the sex of the individual
    - `get_bmi()`: Gets the bmi status of the person
    - `get_number_children()`: gets the number of children that the individual has
    - `get_smoker()`: Gets the smoker status of the person
    - `get_region()`: Gets the region where the individual lives
    - `get_insurance_cost()`: Gets the insrance cost of the person

In [86]:
class Person():
    
    def __init__(self, age, sex, bmi, children, smoker, region, charges):
        self.age = age
        self.sex = sex
        self.bmi = bmi
        self.children = children
        self.smoker = smoker
        self.region = region
        self.insurance_cost = charges
    
    
    # The __repr__ function allows for better readbility when printing the instance
    def __repr__(self):
        return "Person \n \
        age: {0} \n \
        sex: {1} \n \
        bmi: {2} \n \
        number of children: {3} \n \
        smoker: {4} \n \
        region: {5} \n \
        insurance cost: {6}\n \
        ".format(self.age, self.sex, self.bmi, self.children, self.smoker, self.region, self.insurance_cost)
        
        
    # define getters for the values allowing for easier and more readable data retrieval 
    def get_age(self):
        return self.age
    
    
    def get_sex(self):
        return self.sex
    
    
    def get_bmi(self):
        return float(self.bmi)
    
    
    def get_number_children(self):
        return self.children
    
    
    def get_smoker(self):
        return self.smoker
    
    
    def get_region(self):
        return self.region
    
    
    def get_insurance_cost(self):
        return float(self.insurance_cost)

A list of `Person` instances is created to store the data from the `csv` file. For now, it is an empty list. It will be populated when the `csv` file is read

### Reading the Data

In [87]:
people = []

Now open the file and read from it.

In [88]:
with open('insurance.csv', newline='') as insurance_csv:
    insurance_data = csv.DictReader(insurance_csv)
    
    ### Extract the data from the csv file
    ### The age and number of children are converted to int to make calculations easier
    ### The bmi and the charges are converted to flaot for the same reasons
    for row in insurance_data:
        age = int(row['age'])
        sex = row['sex']
        bmi = float(row['bmi'])
        num_children = int(row['children'])
        smoker = row['smoker']
        region = row['region']
        charges = float(row['charges'])
        new_person = Person(age, sex, bmi, num_children, smoker, region, charges)
        people.append(new_person)    

To test our reading, the first 5 records in the `people` list. Du the `__repr__` function, these should be  readable to humans

In [89]:
for i in range(5):
    print(people[i])

Person 
         age: 19 
         sex: female 
         bmi: 27.9 
         number of children: 0 
         smoker: yes 
         region: southwest 
         insurance cost: 16884.924
         
Person 
         age: 18 
         sex: male 
         bmi: 33.77 
         number of children: 1 
         smoker: no 
         region: southeast 
         insurance cost: 1725.5523
         
Person 
         age: 28 
         sex: male 
         bmi: 33.0 
         number of children: 3 
         smoker: no 
         region: southeast 
         insurance cost: 4449.462
         
Person 
         age: 33 
         sex: male 
         bmi: 22.705 
         number of children: 0 
         smoker: no 
         region: northwest 
         insurance cost: 21984.47061
         
Person 
         age: 32 
         sex: male 
         bmi: 28.88 
         number of children: 0 
         smoker: no 
         region: northwest 
         insurance cost: 3866.8552
         


### Categorical Columns

To start, the columns, `sex`, `smoker` and `region` will be analysed. To do so, a comparison between each value of the category will be done. For example, a smoker's cost versus a non-smoker's cost or a male's cost versus a female's cost. 

#### The Sex Analysis class

The first category that will be analysed is age. The goal of this analysis is to compare a male's cost to a female's cost without taking into consideration the other factors.

To accomplish this, a class `SexInsuranceAnalysis` will be created. Here is a brief documentation of the class.

- **Instance Variables**
    - `people`: A list of Person instances to be analyzed
- **Class Methods**
    - `get_average_cost(sex)`: Returns the average cost for the sex specified
    - `get_count_per_sex()`: Counts the number of females and males in the dataset

In [90]:
class SexInsuranceAnalysis():
    
    def __init__(self, people):
        self.people = people
        
        
    def get_average_cost(self, sex):
        total_cost = 0
        total_count = 0
        sex_lower = sex.lower() ### Ensures consistent results with a given input. MAlE, Male, male will all return male
        
        if sex != 'male' and sex != 'female':
            raise KeyError('Invalid Sex given. Please choose either male or female')
        
        for person in people:
            if person.get_sex() == sex:
                total_cost += person.get_insurance_cost()
                total_count += 1
                
        return round(total_cost / total_count, 2)
    
    
    def get_count_per_sex(self):
        total_females = 0
        total_males = 0
        
        for person in self.people:
            if person.get_sex() == "male":
                total_males += 1
            elif person.get_sex() == "female":
                total_females += 1
                
        return {
            'female': total_females,
            'male:': total_males,
        }

#### The Smoker Analysis class

The next step to this analysis is to create the `SmokerInsuranceAnalysis` class. this is the class that will be used to do all the calculations related to the data for the smoker category

Here is a brief documentation on the class:

- **Instance Variables**
    - `people`: List of people to analyse

- **Instance Methods**
    - `get_average_cost(smoker)`: Returns the average non smoker or smoker insurance cost
    - `get_count_per_smoker_status()`: Returns the count of smokers and non-smokers in the dataset in the form of a dictionary

In [91]:
class SmokerInsuranceAnalysis():
    
    def __init__(self, people):
        self.people = people

    
    def get_average_cost(self, smoker):
        total_cost = 0
        total_count = 0
        smoker = smoker.lower()
        
        if smoker != "yes" and smoker != "no":
            raise KeyError("Invalid smoker status. Please enter yes or no")
        
        for person in self.people:
            if person.get_smoker() == smoker:
                total_cost += person.get_insurance_cost()
                total_count += 1
                
        return round(total_cost / total_count, 2)
    
    
    def get_count_per_smoker_status(self):
        num_smokers = 0
        num_non_smokers = 0
        
        for person in self.people:
            if person.get_smoker() == 'yes':
                num_smokers += 1
            elif person.get_smoker() == 'no':
                num_non_smokers += 1
                
        return {
            'smoker': num_smokers,
            'non-smoker': num_non_smokers,
        }
    
                

#### The Region Analysis class

Next, a class to analyze cost based on the region is described below

- **Instance Variables** 
    - `people`: The people in the dataset
- **Class Methods**
    - `get_cost_for_region(region)`: Returns the yearly average cost for a region specified as a parameter
    - `get_count_per_region()`: Returns a dictionary containing the count for each region
    

first, it is useful to know which regions are in the dataset. To do so, a for loop iterates over the `people` list and creates a list of all unique regions

In [92]:
regions = []

for person in people:
    if person.get_region() not in regions:
        regions.append(person.get_region())
        
print(regions)

['southwest', 'southeast', 'northwest', 'northeast']


In [93]:
class RegionInsuranceAnalysis():
    
    def __init__(self, people):
        self.people = people
        
        
    def get_average_cost(self, region):
        total_cost = 0
        total_count = 0
        region = region.lower()
        
        if region != 'southwest' and region != 'southeast' and region != 'northwest' and region != 'northeast':
            raise KeyError("Invalid region given. Enter either of the valid values: southwest, southeast, northwest, northeast")
            
        for person in self.people:
            if person.get_region() == region:
                total_cost += person.get_insurance_cost()
                total_count += 1
                
        return round(total_cost / total_count, 2)
    
    
    def get_count_per_region(self):
        region_count = {
            'northwest': 0,
            'northeast': 0,
            'southeast': 0,
            'southwest': 0,
        }
        
        for person in people:
            region_count[person.get_region()] += 1
            
        return region_count

### Numerical Columns

Now for the numerical columns. Each class will follow relatively the same logic. The interval of each column will be divided into equal length categories and the average cost will be for each category

#### Creating the BMI Analysis class

The next step is to create the class to analyse the yearly charges depending on the bmi of the person. Since the `bmi` is not a categorical column, it requires a little more analysis. First, let's take a look at the distribution of the data.

In [94]:
min_bmi = 100
max_bmi = 0
for person in people:
    bmi = person.get_bmi()
    if bmi < min_bmi:
        min_bmi = bmi
    if bmi > max_bmi:
        max_bmi = bmi
        
print("Lowest bmi: ", min_bmi)
print("Highest bmi: ", max_bmi)

Lowest bmi:  15.96
Highest bmi:  53.13


Knowing that the bmi varies from 15.96 to 53.15, it is possible to make 8 categories of 5 starting at 15 and ending at 55

The `BmiInsuranceAnalysisClass` class contains the following variables and methods

- **Class Variables**
    - `categories`: Dictionary with the 8 categories of bmi. The key is the category number and the value is the upper boundary
    - `people_classed`: Dictionary with the 8 categories and the people in each. The key is the category number and the value is a list of people in that category

- **Methods**
    - `class_people()`: classes the people in each category in the `people_classed dictionary`
    - `get_average_cost_for_class()`: Calculates the average cost of insurance for a category of people (bmi range of 5)
    - `get_average_cost_per_class()`: Creates a dictionary with keys equivalent to the category number and values equivalent to the average insurance cost for the category
    

In [95]:
class BmiInsuranceAnalysis():
    
    categories = {
        1: 20,
        2: 25,
        3: 30,
        4: 35,
        5: 40,
        6: 45,
        7: 50,
        8: 55,
    }
    
    people_classed = {
        1: [],
        2: [],
        3: [],
        4: [],
        5: [],
        6: [],
        7: [],
        8: [],
    }
    
    def __init__(self, people):
        self.people = people
        
        self.class_people()
    
    
    def __repr__(self):
        for key, value in self.people_classed.items():
            print(key, ": ", len(value))
        return "done"
    
    
    def class_people(self):
        
        for person in self.people:
            
            if person.get_bmi() < self.categories[1]:
                self.people_classed[1].append(person)
                
            elif person.get_bmi() < self.categories[2]:
                self.people_classed[2].append(person)
                
            elif person.get_bmi() < self.categories[3]:
                self.people_classed[3].append(person)
                
            elif person.get_bmi() < self.categories[4]:
                self.people_classed[4].append(person)
                
            elif person.get_bmi() < self.categories[5]:
                self.people_classed[5].append(person)
                
            elif person.get_bmi() < self.categories[6]:
                self.people_classed[6].append(person)
                
            elif person.get_bmi() < self.categories[7]:
                self.people_classed[7].append(person)
                
            elif person.get_bmi() < self.categories[8]:
                self.people_classed[8].append(person)
     
    
    def get_average_cost_for_class(self, category):
        class_list = self.people_classed.get(category, None)
        total_cost = 0
        
        if class_list == None:
            return "Cannot calculate average. Class number is invalid"
        else:
            for person in class_list:
                total_cost += person.get_insurance_cost()
        
        average_insurance_cost = round(total_cost / len(class_list), 2)
                
        return average_insurance_cost
    
    
    def get_average_cost_per_class(self):
        
        average_costs = {}
        
        for category in self.categories.keys():
            average_costs[category] = self.get_average_cost_for_class(category)
            
        return average_costs


#### The Age Analysis Class

Like with the BMI, lets start by seeing how dispersed the data is. The maximum and the minimum age of the dataset are found below.

In [96]:
max_age = 0
min_age = 100

for person in people:
    if person.get_age() > max_age:
        max_age = person.get_age()
    elif person.get_age() < min_age:
        min_age = person.get_age()
        
print("Max age: {}, min age: {}".format(max_age, min_age))

Max age: 64, min age: 18


Since the age varies from 18 years old to 64 years old, it is possible to make ten groups of 5 years.

The `AgeInsuranceAnalysisClass` contains the following class methods and attributes

- **Instance Variables**
    - `age_groups`: Ten age groups of 5 years ranging from 15 years to 65 years
    - `people_classed_in_age_groups`: Disctionary of people in their respective categories of age
    
- **Methods**
    - `class_people()`: classes the people in each category in the `people_classed_in_age_groups` dictionary
    - `get_average_cost_for_class()`: Calculates the average cost of insurance for a category of people (age range of 5)
    - `get_average_cost_per_class()`: Creates a dictionary with keys equivalent to the category number and values equivalent to the average insurance cost for the category
    

In [97]:
class AgeInsuranceAnalysis():
    
    age_groups = {
        1: 20,
        2: 25,
        3: 30,
        4: 35,
        5: 40,
        6: 45,
        7: 50,
        8: 55,
        9: 60,
        10: 65,
    }
    
    people_classed_in_age_groups = {
        1: [],
        2: [],
        3: [],
        4: [],
        5: [],
        6: [],
        7: [],
        8: [],
        9: [],
        10: [],
    }
    
    def __init__(self, people):
        self.people = people
        
        self.class_people()
        
    def class_people(self):
        
        for person in self.people:
            
            if person.get_age() < self.age_groups[1]:
                self.people_classed_in_age_groups[1].append(person)
                
            elif person.get_age() < self.age_groups[2]:
                self.people_classed_in_age_groups[2].append(person)
                
            elif person.get_age() < self.age_groups[3]:
                self.people_classed_in_age_groups[3].append(person)
                
            elif person.get_age() < self.age_groups[4]:
                self.people_classed_in_age_groups[4].append(person)
                
            elif person.get_age() < self.age_groups[5]:
                self.people_classed_in_age_groups[5].append(person)
                
            elif person.get_age() < self.age_groups[6]:
                self.people_classed_in_age_groups[6].append(person)
                
            elif person.get_age() < self.age_groups[7]:
                self.people_classed_in_age_groups[7].append(person)
                
            elif person.get_age() < self.age_groups[8]:
                self.people_classed_in_age_groups[8].append(person)
            
            elif person.get_age() < self.age_groups[9]:
                self.people_classed_in_age_groups[9].append(person)
            
            elif person.get_age() < self.age_groups[10]:
                self.people_classed_in_age_groups[10].append(person)
     
    
    def get_average_cost_for_class(self, age_group):
        class_list = self.people_classed_in_age_groups.get(age_group, None)
        total_cost = 0
        
        if class_list == None:
            return "Cannot calculate average. Class number is invalid"
        else:
            for person in class_list:
                total_cost += person.get_insurance_cost()
        
        return round(total_cost / len(class_list), 2)
                
    
    
    def get_average_cost_per_class(self):
        
        average_costs = {}
        
        for age_group in self.age_groups.keys():
            average_costs[category] = self.get_average_cost_for_class(age_group)
            
        return average_costs
    
        
        
    

#### The number of children analysis class

Finally, to analyze the number of children in each category, the same logic will be used but we wont divide in groups. Instead we will take each number between the minimum and the maximum number of children

In [98]:
max_children = 0
min_children = 100

for person in people:
    if person.get_number_children() > max_children:
        max_children = person.get_number_children()
        
    elif person.get_number_children() < min_children:
        min_children = person.get_number_children()
        
print("Max children: {}, min children: {}".format(max_children, min_children))

Max children: 5, min children: 0


5 categories of 1 child will be created

The `NumChildrenInsuranceAnalysis` contains the following class methods and attributes

- **Instance Variables**
    - `people_classed_by_num_children`: Disctionary of people in their respective categories of children
    
- **Methods**
    - `class_people()`: classes the people in each category in the `people_classed_in_age_groups` dictionary
    - `get_average_cost_for_class()`: Calculates the average cost of insurance for a category of people
    - `get_average_cost_per_class()`: Creates a dictionary with keys equivalent to the category number and values equivalent to the average insurance cost for the category

In [99]:
class NumChildrenInsuranceAnalysis():
    
    people_classed_by_num_children = {
        1: [],
        2: [],
        3: [],
        4: [],
        5: [],
    }
    
    def __init__(self, people):
        self.people = people
        
        self.class_people()
        
    
    def class_people(self):
        
        for person in self.people:
            
            if person.get_number_children() == 1:
                self.people_classed_by_num_children[1].append(person)
                
            elif person.get_number_children() == 2:
                self.people_classed_by_num_children[2].append(person)
                
            elif person.get_number_children() == 3:
                self.people_classed_by_num_children[3].append(person)
                
            elif person.get_number_children() == 4:
                self.people_classed_by_num_children[4].append(person)
                
            elif person.get_number_children() == 5:
                self.people_classed_by_num_children[5].append(person)
     
    
    def get_average_cost_for_class(self, num_children):
        class_list = self.people_classed_by_num_children.get(num_children, None)
        total_cost = 0
        
        if class_list == None:
            return "Cannot calculate average. Class number is invalid"
        else:
            for person in class_list:
                total_cost += person.get_insurance_cost()
        
        average_insurance_cost = round(total_cost / len(class_list), 2)
                
        return average_insurance_cost
    
    
    def get_average_cost_per_class(self):
        
        average_costs = {}
        
        for num in self.people_classed_by_num_children.keys():
            average_costs[num] = self.get_average_cost_for_class(num)
            
        return average_costs
    

## Results


Now for the results. To get the results, a `Comparison` class will be used. This class will contain all the methods necessary to get the results needed.

In [100]:
class Comparison():
    
    def __init__(self, people):
        self.people = people
        self.sex_analysis = SexInsuranceAnalysis(self.people)
        self.smoker_analysis = SmokerInsuranceAnalysis(self.people)
        self.region_analysis = RegionInsuranceAnalysis(self.people)
        self.bmi_analysis = BmiInsuranceAnalysis(self.people)
        self.age_analysis = AgeInsuranceAnalysis(self.people)
        self.num_children_analysis = NumChildrenInsuranceAnalysis(self.people)
        
    def compare_sex_costs(self):
        male_costs = self.sex_analysis.get_average_cost('male')
        female_costs = self.sex_analysis.get_average_cost('female')
        
        print("Male average costs: {} USD \nFemale average costs: {} USD".format(male_costs, female_costs))
        
        print("\n{}".format(self.sex_analysis.get_count_per_sex()))
        
    
    def compare_smoker_costs(self):
        smoker_costs = self.smoker_analysis.get_average_cost('yes')
        non_smoker_costs = self.smoker_analysis.get_average_cost('no')
        
        print("Smoker average costs: {} USD \nNon-Smoker average costs: {} USD".format(smoker_costs, non_smoker_costs))
        print("\n{}".format(self.smoker_analysis.get_count_per_smoker_status()))
        
    def compare_region_costs(self):
        regions = ['southwest', 'southeast', 'northwest', 'northeast']
        
        for region in regions:
            region_cost = self.region_analysis.get_average_cost(region)
            print("{}: {} USD".format(region, region_cost))
            
        print("{}".format(self.region_analysis.get_count_per_region()))
            
    
    def compare_bmi_costs(self):
        for bmi_group in self.bmi_analysis.people_classed.keys():
            bmi_group_cost = self.bmi_analysis.get_average_cost_for_class(bmi_group)
            
            print("[{}, {}[ kg/m2 : {} USD".format(self.bmi_analysis.categories.get(bmi_group), self.bmi_analysis.categories.get(bmi_group) - 5, bmi_group_cost))
            
    
    def compare_age_costs(self):
        for age_group in self.age_analysis.people_classed_in_age_groups.keys():
            age_group_cost = self.age_analysis.get_average_cost_for_class(age_group)
            
            print("[{}, {}[ years : {} USD".format(self.age_analysis.age_groups.get(age_group), self.age_analysis.age_groups.get(age_group) - 5, age_group_cost))
            
            
    def compare_num_children_cost(self):
        
        for num in self.num_children_analysis.people_classed_by_num_children.keys():
            num_children_cost = self.num_children_analysis.get_average_cost_for_class(num)
                
            if num == 1: 
                print("{} child : {}".format(num, num_children_cost))
            else:
                print("{} children : {}".format(num, num_children_cost))
                
                
    def getAllAnalysisResults(self):
        print("Categorical Columns\n\n")
        print("Sex Analysis:")
        self.compare_sex_costs()
        print("\nSmoker Analysis")
        self.compare_smoker_costs()
        print("\nRegion Analysis")
        self.compare_region_costs()
        print("\n\nNumerical Columns\n\n")
        print("BMI Analysis")
        self.compare_bmi_costs()
        print("\nAge Analysis")
        self.compare_age_costs()
        print("\nNumber of Children Analysis")
        self.compare_num_children_cost()
        print("\n\n\n - All Costs are in USD charged yearly")
        
                

In [101]:
comparison = Comparison(people)

comparison.getAllAnalysisResults()

Categorical Columns


Sex Analysis:
Male average costs: 13956.75 USD 
Female average costs: 12569.58 USD

{'female': 662, 'male:': 676}

Smoker Analysis
Smoker average costs: 32050.23 USD 
Non-Smoker average costs: 8434.27 USD

{'smoker': 274, 'non-smoker': 1064}

Region Analysis
southwest: 12346.94 USD
southeast: 14735.41 USD
northwest: 12417.58 USD
northeast: 13406.38 USD
{'northwest': 325, 'northeast': 324, 'southeast': 364, 'southwest': 325}


Numerical Columns


BMI Analysis
[20, 15[ kg/m2 : 8838.56 USD
[25, 20[ kg/m2 : 10572.37 USD
[30, 25[ kg/m2 : 10987.51 USD
[35, 30[ kg/m2 : 14419.67 USD
[40, 35[ kg/m2 : 17022.26 USD
[45, 40[ kg/m2 : 16569.6 USD
[50, 45[ kg/m2 : 17815.04 USD
[55, 50[ kg/m2 : 16034.31 USD

Age Analysis
[20, 15[ years : 8407.35 USD
[25, 20[ years : 9598.2 USD
[30, 25[ years : 9524.78 USD
[35, 30[ years : 11223.89 USD
[40, 35[ years : 12282.51 USD
[45, 40[ years : 13922.74 USD
[50, 45[ years : 14845.89 USD
[55, 50[ years : 16869.02 USD
[60, 55[ years : 16077.64 U

## Discussion

In the following sub-sections, I discuss the results obtained in the [Results](#Results) section.

### Sex Results

Looking at the results for sex, it is possible to infer that male individuals are likely to pay more insurance yearly than female individuals. In fact, it was found that males pay, on average, around 14 000 USD yearly, while females are close behind at around 12 500 USD on average. This is an unexpected results as, usually, women have to pay more for medical insurance. In fact according to [this article](https://www.registerednursing.org/healthcare-costs-by-age/), woman pay around 1/3 more insurance than males yearly on average. This is due in general, to their higher life expectancy (Women tend to live two years more than men) and to childbirth expenses.

What could explain the difference in the obtained results versus the ones found by the *US Department of Health & Human Services* is the fact that our sample size is not very representative of the US population. In fact, our sample only has around 1000 records.

### Smoker Results

The results for the smoker category were expected. In fact, smokers do pay more medical insurance because they need more care in general and they are less healthy that non-smokers. These factors contribute to the overall increase of medical insurance. This is called *Tobacco Rating*

Also, for the smoker category, there are approximately 5 times more non-smokers in the dataset  than there are smokers. This means that there is a lot more data on non-smokers than smokers resulting in the results for non-smokers being more representative of the non-smoker population than the results for non-smokers. However, taking this into consideration, according to [this article](https://news.gallup.com/poll/109048/us-smoking-rate-still-coming-down.aspx), the ratio of adult smokers to adult non-smokers is about 1 for 5. Which represents the ratio of the dataset

### Region Results

For the regions, there isn't much fluctuation between them. However, there is a small difference between the east and the west regions. It looks like the west is slightly less expensive than the east. For this category, there were approximately the same amount of people per region (around 300). However, that is still not a lot of records compared to the population of the United States which is around 300 million inhabitants.

### Age Results

It is possible to infer that, for age, the cost of medical insurance grows as age grows. In fact, there is a clear increase in cost of medical insurance as the individual gets older. Like for the smoker category, this is probably caused by the fact that the older an individual, the more care he or she will need.

### BMI Results

The BMI results are very similar to the smoker results and the age results. These three categories seem to follow a certain tendancy where the least healthy the individual, the more insurance it will need to pay. In fact, as the BMI grows, meaning a less than ideal health condition, the cost of medical insurance seems to grow also.

### Number of Children Results

In terms of the number of children, again there is an increase in cost of medical insurance as the number of children grows. According to this [site](https://www.bcbsm.com/index/health-insurance-help/faqs/topics/buying-insurance/family-size-impact-cost.html), the more people on an individuals healh plan the more expensive. It is unclear, with this data, what the individual's health plan is. It ould be a family plan, which would explain the increase in cost as the number of children goes up.

## Conclusion

Finally, the goal of this analysis was to analyze the individual impacts of each factor on the cost of medical insurance of people in the Unites States to make it possible for someone looking to sell insurance to develop competitive salaries while maximising their profit.

It was discovered that the overall health of an individual determined by the bmi, the age and the smoker status of that person contributes greatly to an increase or decrease of insurance cost. The healthier the person is the, the less expensive insurance is. Also, as the number of children goes up, the cost of health insurance increases also.

There was an unexpected result as to medical insurance cost when it came to the sex of the individual however. In fact, it was expected to be that the women would pay more health insurance than the man, but it turned out to be the opposite. It was determined that this unexpected result was due to the fact that the dataset was not really representative of hte US population

Now that the results are all analysed individually, it is possible to go further and to use some machine learning techniques such as multiple linear regression to analyse the effect of combining various factors to the medical_insurance cost. This way, it will be possible to predict the cost of a certain individual with x number of children and living in a region y.