# U.S. Medical Insurance

This is Nejc's portfolio project called U.S. Medical Insurance, which is a part of the Data Scientist: Analytics Specialist career pathway specialization on the Codecademy, an online provider of programming education, training and knowledge.

In the document below, there is a python code, which goes through a Microsoft Excel file called insurance, and analyzes it through reading the file, the collection and formation of the data, as well as running through the python functions to come to new findings about the data, such as the average age, bmi score, and average cost of medical insurance, aswell as it brings us to the information on how gender, number of children, smoking, and living area affect the medical insurance costs.

## Project Objectives

The objectives of this project are to obtain information regarding the
+ number of individuals in the dataset, proportion of males and females
+ average, maximum and minimum values of age, BMI score, number of children and medical insurance cost
+ number of smokers and non-smokers
+ number of residents in different regions

As a result, we want to get to the conclusion of how different variables effect the cost of medical insurance.

In [1]:
# importing csv and json libraries
import csv

In [2]:
# creating empty lists of properties in csv file
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

In [3]:
# reading through the csv file, plus collecting and formating the data into a table
with open("insurance.csv", "r") as insurance_file:
    insurance_data = csv.DictReader(insurance_file)
    for row in insurance_data:
        age.append(int(row['age']))
        sex.append(row['sex'])
        bmi.append(float(row['bmi']))
        children.append(int(row['children']))
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(float(row['charges']))

In [4]:
# information regarding the number of individuals in the dataset, by counting the columns
num_of_population = len(age)
num_of_males = sex.count("male")
num_of_females = sex.count("female")
print("Number of individuals in the dataset is " +str(num_of_population) + ",")
print("of which, " + str(num_of_males) + " are males, which is " + str(round(num_of_males/num_of_population*100,1)) + "% of the total population;")
print("and of which, " + str(num_of_females) + " are females, which is " + str(round(num_of_females/num_of_population*100,1)) + "% of the total population.")

Number of individuals in the dataset is 1338,
of which, 676 are males, which is 50.5% of the total population;
and of which, 662 are females, which is 49.5% of the total population.


## Ages

In [5]:
# getting the average age information
avg_age = sum(age) / len(age)
oldest = max(age)
youngest = min(age)

num_of_oldest = age.count(64)
num_of_youngest = age.count(18)

print("Average age among the whole list of individuals is",round(avg_age,2),)
print("The oldest individuals on the list have " + str(oldest) + " years, and there are " + str(num_of_oldest) + " of them.")
print("The youngest individuals on the list thus have " + str(youngest) + " years, and there are " + str(num_of_youngest) + " of them.")

Average age among the whole list of individuals is 39.21
The oldest individuals on the list have 64 years, and there are 22 of them.
The youngest individuals on the list thus have 18 years, and there are 69 of them.


## BMI

In [6]:
avg_bmi = sum(bmi) / len(bmi)
lowest_bmi = min(bmi)
highest_bmi = max(bmi)


underweight = 0
normal = 0
overweight = 0

bmi_male = {'underweight' : 0,
    'normal' : 0,
    'overweight' : 0
           }
bmi_female = {'underweight' : 0,
    'normal' : 0,
    'overweight' : 0
             }

for index in range(num_of_population):
    if bmi[index] < 16.9:
        underweight += 1
        if sex[index] == 'male':
            bmi_male['underweight'] += 1
        else:
            bmi_female['underweight'] += 1
    elif bmi[index] >= 17 and bmi[index] < 29.9:
        normal += 1
        if sex[index] == 'male':
            bmi_male['normal'] += 1
        else:
            bmi_female['normal'] += 1
    elif bmi[index] >= 30 and bmi[index] < 40:
        overweight += 1
        if sex[index] == 'male':
            bmi_male['overweight'] += 1
        else:
            bmi_female['overweight'] += 1

print("Average BMI score among the individuals in the dataset is",round(avg_bmi,2))
print("The lowest BMI score is " + str(lowest_bmi) + ", and the highest is " + str(highest_bmi) + ".")
print("Amount of underweight individuals is " + str(underweight) + ", which is " + str(round(underweight/num_of_population * 100,1)) + "% of the total population, and of that, regarding the gender, " + str(round(bmi_male["underweight"])) + " are males, and " + str(round(bmi_female["underweight"])) + " are females.")
print("Number of individuals with normal BMI score is " + str(normal) + ", which is " + str(round(normal/num_of_population * 100,1)) + "% of the total population, and of that, " + str(round(bmi_male['normal'])) + " are males, and " + str(round(bmi_female['normal'])) + " are females.")
print("Amount of overweight individuals is " + str(overweight) + ", which is " + str(round(overweight/num_of_population * 100,1)) + "% of the total population, and of that, " + str(round(bmi_male['overweight'])) + " are males, " + str(round(bmi_female['overweight'])) + " are females.")

Average BMI score among the individuals in the dataset is 30.66
The lowest BMI score is 15.96, and the highest is 53.13.
Amount of underweight individuals is 3, which is 0.2% of the total population, and of that, regarding the gender, 2 are males, and 1 are females.
Number of individuals with normal BMI score is 616, which is 46.0% of the total population, and of that, 298 are males, and 318 are females.
Amount of overweight individuals is 616, which is 46.0% of the total population, and of that, 322 are males, 294 are females.


## Children

In [7]:
count_of_children = sum(children)
avg_num_children = count_of_children / len(children)
max_children = max(children)
min_children = min(children)

print("Total amount of children, registered in the database is " + str(count_of_children) + ",")
print("while the average number of children on the individual is ",round(avg_num_children,2))
print("Maximum amount of children registered to an individual is " + str(max_children) + ",")
print("and the lowest amount of children registered to an individual is " + str(min_children) + ".")

Total amount of children, registered in the database is 1465,
while the average number of children on the individual is  1.09
Maximum amount of children registered to an individual is 5,
and the lowest amount of children registered to an individual is 0.


In [8]:
min_child_count = children.count(0)
one_child = children.count(1)
two_child = children.count(2)
three_child = children.count(3)
four_child = children.count(4)
max_child_count = children.count(5)

print("The number of people without children is "+ str(min_child_count) + ", or " + str(round(min_child_count/num_of_population * 100,2)) + "% of the total population,")
print("number of individuals with only one child is " + str(one_child) + ", or " + str(round(one_child/num_of_population * 100,2)) + "% of the total population,")
print("number of individuals with two childs is " + str(two_child) + ", or " + str(round(two_child/num_of_population * 100,2)) + "% of the total population,")
print("number of individuals with three children is " + str(three_child) + ", or " + str(round(three_child/num_of_population * 100,2)) + "% of the total population,")
print("number of individuals with four children is " + str(four_child) + ", or " + str(round(four_child/num_of_population * 100,2)) + "% of the total population,")
print("and the number of individuals with five or maximum number of children is " + str(max_child_count) + ", or " +str(round(max_child_count/num_of_population * 100,2)) + "% of the total population,")

The number of people without children is 574, or 42.9% of the total population,
number of individuals with only one child is 324, or 24.22% of the total population,
number of individuals with two childs is 240, or 17.94% of the total population,
number of individuals with three children is 157, or 11.73% of the total population,
number of individuals with four children is 25, or 1.87% of the total population,
and the number of individuals with five or maximum number of children is 18, or 1.35% of the total population,


## Smokers

In [9]:
# counting the number of smokers and non-smokers
number_of_smokers = smoker.count("yes")
number_of_non_smokers = smoker.count("no")

# creating a dictionary to count number of male and female smokers
sex_smoker={'male':[], 'female':[]}

for count in range(len(sex)):
    if sex[count] == "male":
        sex_smoker["male"].append(smoker[count])
    elif sex[count] == "female":
        sex_smoker["female"].append(smoker[count])

male_smokers_count = sex_smoker["male"].count("yes")
female_smokers_count = sex_smoker["female"].count("yes")

print("Number of smokers in the dataset is " + str(number_of_smokers) + ", which is " + str(round(number_of_smokers/num_of_population*100,1)) + "% of the total population.")
print("Number of non-smokers in the dataset is " + str(number_of_non_smokers) + ", which is " + str(round(number_of_non_smokers/num_of_population*100,1)) + "% of the total population.")
print("Regarding the gender, there are " + str(male_smokers_count) + " male smokers, which is " + str(round(male_smokers_count/number_of_smokers*100,1)) + "% of the total smokers, ")
print("and " + str(female_smokers_count) + " female smokers, which is " + str(round(female_smokers_count/number_of_smokers*100,1)) + "% of the total smokers.")

Number of smokers in the dataset is 274, which is 20.5% of the total population.
Number of non-smokers in the dataset is 1064, which is 79.5% of the total population.
Regarding the gender, there are 159 male smokers, which is 58.0% of the total smokers, 
and 115 female smokers, which is 42.0% of the total smokers.


## Regions

In [10]:
# counting the number of residents in selected regions
northwest = region.count("northwest")
southwest = region.count("southwest")
northeast = region.count("northeast")
southeast = region.count("southeast")

print("From NorthWestern region there are " + str(northwest) + " individuals, which is the "+ str(round(northwest/num_of_population*100,1)) + "% of the total population.")
print("From SouthWestern region there are " + str(southwest) + " individuals, which is the "+ str(round(southwest/num_of_population*100,1)) + "% of the total population.")
print("From NorthEastern region there are " + str(northeast) + " individuals, which is the "+ str(round(northeast/num_of_population*100,1)) + "% of the total population.")
print("From SouthEastern region there are " + str(southeast) + " individuals, which is the "+ str(round(southeast/num_of_population*100,1)) + "% of the total population.")

From NorthWestern region there are 325 individuals, which is the 24.3% of the total population.
From SouthWestern region there are 325 individuals, which is the 24.3% of the total population.
From NorthEastern region there are 324 individuals, which is the 24.2% of the total population.
From SouthEastern region there are 364 individuals, which is the 27.2% of the total population.


## Insurance Costs

In [11]:
# calculating total sum of charges
charges_total = sum(charges)
avg_charge = sum(charges) / len(charges)
max_charge = max(charges)
min_charge = min(charges)
sorted_charges = sorted(charges)
median_charges = sorted_charges[int(round((len(charges)+1)/2,0))]

print("Total value of annual insurance costs is $" + str(round(charges_total,2)) + ".")
print("Average annual insurance cost is $" + str(round(avg_charge,2)) + ",")
print("the highest annual insurance cost someone pays is $" + str(round(max_charge,2)) + ",")
print("while the lowest annual insurance cost someone pays is $" + str(round(min_charge,2)) + ".")
print('Median insurance cost is $' + str(median_charges) + ' dollars.')

Total value of annual insurance costs is $17755824.99.
Average annual insurance cost is $13270.42,
the highest annual insurance cost someone pays is $63770.43,
while the lowest annual insurance cost someone pays is $1121.87.
Median insurance cost is $9391.346 dollars.


In [50]:
def higher_cost_analysis(threshold, age, sex, bmi, children, smoker,region, charges, num_of_population):
    # The "higher_" prefix of each of the following variables reminds us that we are analyzing people with high insurance costs
    higher_tot = 0 # total number of individuals paying a insurance cost >= treshold
    higher_charges_tot = 0 # variable that will give the average insurance charge
    higher_males = 0 # counter for male individuals
    higher_females = 0 # counter for female individuals
    higher_smoker = 0 # conuter for smoker individuals
    higher_children = 0 # variable that will give the average number of children
    higher_bmi = 0 # variable that will give the average bmi
    higher_age = 0 # variable that will give the average age
    # 4 variables detecting the region the individuals come from
    higher_ne = 0
    higher_se = 0
    higher_nw = 0
    higher_sw = 0
    for index in range(num_of_population):
        if charges[index] >= threshold:
            higher_tot += 1 # increment the number of individuals with insurance costs >= threshold
            higher_charges_tot += charges[index]
            higher_age += age[index]
            # let's find if they are male or female
            if sex[index] == 'male':
                higher_males += 1
            else:
                higher_females += 1
            # let's find if they are smokers
            if smoker[index] == 'yes':
                higher_smoker += 1
            # let's sum the number of children
            higher_children += children[index]
            
            #let's sum the bmi
            higher_bmi += bmi[index]
            
            # finally, let's definre which region they come from
            if region[index] == 'northwest':
                higher_nw += 1
            elif region[index] == 'northeast':
                higher_ne += 1
            elif region[index] == 'southwest':
                higher_sw += 1
            elif region[index] == 'southeast':
                higher_se += 1

    print('Individuals that pay more for health insurance inside the population are: \n')
    # males vs females
    if higher_males > higher_females:
         print('\t - Mainly male, with a percentage of ' + str(round(higher_males/higher_tot*100,1)) + '%\n')
    elif higher_males < higher_females:
        print('\t - Mainly female, with a percentage of ' + str(round(higher_females/higher_tot*100,1)) + '%\n')
    else:
        print('\t - Equally distributed between man and women\n')
    
    # age
    print('\t - ' + str(round(higher_age/higher_tot,1)) + ' years old in average\n')
    
    # smoker vs non-smoker
    print('\t - Smoking in the ' + str(round(higher_smoker/higher_tot*100,1)) + '% of the cases\n')
            
    # bmi
    print('\t - With an average bmi of ' + str(round(higher_bmi/higher_tot,1)) + '\n')
                  
    # children
    print('\t - With an average of ' + str(round(higher_children/higher_tot,1)) + ' children\n')
                  
    # region
    print('\t - Geographically distributed in the following way:\n')
    print('\t \t * Northwest: ' + str(round(higher_nw/higher_tot*100,1)) + '%\n')
    print('\t \t * Northeast: ' + str(round(higher_ne/higher_tot*100,1)) + '%\n')
    print('\t \t * Southwest: ' + str(round(higher_sw/higher_tot*100,1)) + '%\n')
    print('\t \t * Southeast: ' + str(round(higher_se/higher_tot*100,1)) + '%\n')
            
    # mean insurance cost
    print('Those individuals pay for health insurance in average $' + str(round(higher_charges_tot/higher_tot,1)) + '.')

In [51]:
threshold = round(charges_total/num_of_population,1)
higher_cost_analysis(threshold, age, sex, bmi, children, smoker,region, charges, num_of_population)

Individuals that pay more for health insurance inside the population are: 

	 - Mainly male, with a percentage of 52.6%

	 - 42.5 years old in average

	 - Smoking in the 65.0% of the cases

	 - With an average bmi of 31.0

	 - With an average of 1.1 children

	 - Geographically distributed in the following way:

	 	 * Northwest: 22.9%

	 	 * Northeast: 26.0%

	 	 * Southwest: 20.2%

	 	 * Southeast: 31.0%

Those individuals pay for health insurance in average $27751.3.


## Conclusion

Looking at the results, we can observe that:

+ gender doesn't describe a higher cost of the health insurance,
+ smoking impacts the insurance costs,
+ the average bmi is in the "Mild Obesity" range,
+ number of children doesn't impact the cost of health insurance,
+ higher insurance costs are paid in Southeastern region



In [52]:
def deepen_region_analysis(target, ages, bmi, children, smoker, region, charges, total_population):
    # Let's find out the smoking level, mean # of children and mean bmi in the Southwest region
    sw_total = 0
    sw_smoker = 0
    sw_children = 0
    sw_bmi = 0
    sw_age = 0
    sw_age_list =[]
    for index in range(total_population):
        # select only individuals from southwest region
        if region[index] == target:
            sw_total += 1
            sw_age += ages[index]
            sw_age_list.append(ages[index])
            # let's find if they are smokers
            if smoker[index] == 'yes':
                sw_smoker += 1
            # let's sum the number of children
            sw_children += children[index]
            
            #let's sum the bmi
            sw_bmi += bmi[index]
    
    # Print out the results
    print('In the ' + target + ' region the ' + str(round(sw_smoker/sw_total*100,1)) + '% of the population smokes\n')
    print('The average age is: ' + str(round(sw_age/sw_total,1)) + '\n')
    print('The average bmi is: ' + str(round(sw_bmi/sw_total,1)) + '\n')
    print('The average number of children is: ' + str(round(sw_children/sw_total,1)) + '\n')

In [53]:
deepen_region_analysis('southeast', age, bmi, children, smoker, region, charges, num_of_population)

In the southeast region the 25.0% of the population smokes

The average age is: 38.9

The average bmi is: 33.4

The average number of children is: 1.0



In [54]:
deepen_region_analysis('southwest', age, bmi, children, smoker, region, charges, num_of_population)

In the southwest region the 17.8% of the population smokes

The average age is: 39.5

The average bmi is: 30.6

The average number of children is: 1.1



In [55]:
deepen_region_analysis('northwest', age, bmi, children, smoker, region, charges, num_of_population)

In the northwest region the 17.8% of the population smokes

The average age is: 39.2

The average bmi is: 29.2

The average number of children is: 1.1



In [56]:
deepen_region_analysis('northeast', age, bmi, children, smoker, region, charges, num_of_population)

In the northeast region the 20.7% of the population smokes

The average age is: 39.3

The average bmi is: 29.2

The average number of children is: 1.0



Regarding the Southeastern region, which has higher medical insurance prices, it can be see above, that southeastern region has among other regions, the highest percentage of the smoking population (25%), as well as the above average BMI score of 33.4.

As a conclusion, if any of the consumers would be wondering how to decrease the amount of annual medical insurance costs, we would suggest them to stop smoking, and switch to a more healthier lifestyle, in example start exercising, playing a sport or eating less calories to decrease their BMI score.