# U.S. Medical Insurance Costs

In this project, we will be investigating a medical insurance cost dataset in a **.csv** file. This project is part of the Data Scientist: Machine Learning Career Path on CodeCademy. We will apply the skills in Python developed through the courses.
The main goals of this project are:
- Import the dataset from the file insurance.csv.
- Analyze the dataset by building out functions or class methods.

First, we will import **insurance.csv** into Python and inspect its content. To do this we need to import the csv library.

In [1]:
# import csv library
import csv

In this project we will not use the pandas library which is ideal to explore data such as the one presented in insurance.csv file. For this reason we take a quick look at it through a spreadsheet software, and think about how to save the dataset, read it and analyze it.

These are the main preliminary findings:
- The dataset contains 7 variables (each column is a variable) and 1,338 records (each row is a record from a patient).
 - There is no missing data.
 - There are four quantitative variables:
    - Patient Age
    - Patient Number of Children
    - Patient BMI
    - Patient Yearly Medical Insurance Cost.
 - There are three categorical variables:
   - Patient Sex
   - Patient Smoking Status
   - Patient Geographical Region

We will save the variables of this dataset by storing them in Python lists that we can later use for analysis. To store this information, we will create seven empty lists, one for each variable in the insurance.csv file. Then, we will define a function that will help us to save the information from the csv file into the lists in an efficient way.

In [2]:
# Create seven empty lists, one for each variable in the insurance.csv file

age = []
number_of_children = []
bmi = []
yearly_cost = []
sex = []
smoking_status = []
region = []

In [3]:
# Define function to extract the data from the csv file and save into the lists.

def load_csv_data(list_var, csv_file, variable):
    # open csv file and save in the object insurance_data.
    with open(csv_file) as insurance_data:
        # use method .DictReader to read the data and save as a dictionary.
        insurance_dict = csv.DictReader(insurance_data)
        # iterate through each row of insurance_dict and append the record to the list.
        for row in insurance_dict:
            list_var.append(row[variable])
    # return the list.
    return list_var


In [4]:
# Populate the seven lists with help of the function we created above.

age = load_csv_data(age, 'insurance.csv', 'age')
number_of_children = load_csv_data(number_of_children, 'insurance.csv', 'children')
bmi = load_csv_data(bmi, 'insurance.csv', 'bmi')
yearly_cost = load_csv_data(yearly_cost, 'insurance.csv', 'charges')
sex = load_csv_data(sex, 'insurance.csv', 'sex')
smoking_status = load_csv_data(smoking_status, 'insurance.csv', 'smoker' )
region = load_csv_data(region, 'insurance.csv', 'region')


Our next step is to build out functions to perform an exploratory analysis on the insurance dataset. There are many aspects of the dataset that can be explored. In this project we will focus on the following research questions:
- What is the average age of the patients?
- What is the gender distribution (male vs. female) in the dataset?
- In which regions do the patients live?
- What is the proportion of smokers vs non-smokers?
- What is the average annual insurance cost per patient?
- How does the average annual insurance cost differ between smokers and non-smokers?
- What is the average number of children per patient? What does the distribution of this variable look like?

In [5]:
# Define a function to calculate the average

def calc_avg(list_var):
    # start our counter variable.
    counter = 0
    # iterate through all elements in the list.
    for item in list_var:
        # add each item to the counter variable
        counter += float(item)
    # return the average of the list    
    return counter / len(list_var)

# Define a function that calculates the absolute and relative frequencies of gender.

def frequency_gender(list_var):
    # start count variable for males and females.
    male_count = 0
    female_count = 0
    # iterate through each patient and add to the respective count depending on the gender.
    for item in list_var:
        if item == 'male':
            male_count += 1
        elif item == 'female':
            female_count += 1
    # calculate the percentage of each gender in the dataset.
    male_percentage = round((male_count / (male_count + female_count)),2) * 100
    female_percentage = round((female_count / (male_count + female_count)), 2) * 100
    # print the results.
    print('There are {males} males representing the {male_percentage} percentage of the dataset.'.format(males = male_count, male_percentage = male_percentage))
    print('There are {females} females representing the {female_percentage} percentage of the dataset.'.format(females = female_count, female_percentage = female_percentage))
    return male_count, female_count, male_percentage, female_percentage

# Define a function to find the unique values of the list.

def unique_values(list_var):
    # create an empty list 
    unique_values = []
    # iterate through each value of the list and add to the unique_values list only if the value is not there.
    for item in list_var:
        if item not in unique_values:
            unique_values.append(item)
        else:
            continue
    return unique_values
        
# Define a function that calculates the absolute and relative frequencies of smoking status.

def frequency_smoking(list_var):
    # start count variable for smokers and non_smokers.
    smoker_count = 0
    nonsmoker_count = 0
    # iterate through each patient and add to the respective count depending on the smoking status.
    for item in list_var:
        if item == 'yes':
            smoker_count += 1
        elif item == 'no':
            nonsmoker_count += 1
    # calculate the percentage of each smoking status in the dataset.
    smoker_percentage = round((smoker_count / (smoker_count + nonsmoker_count)),2) * 100
    nonsmoker_percentage = round((nonsmoker_count / (smoker_count + nonsmoker_count)), 2) * 100
    # print the results.
    print('There are {smoker} smokers representing the {smoker_percentage} percentage of the dataset.'.format(smoker = smoker_count, smoker_percentage = smoker_percentage))
    print('There are {nonsmoker} non-smokers representing the {nonsmoker_percentage} percentage of the dataset.'.format(nonsmoker = nonsmoker_count, nonsmoker_percentage = nonsmoker_percentage))
    return smoker_count, nonsmoker_count, smoker_percentage, nonsmoker_percentage

# Define a function to calculate the average insurance cost by smoking status

def calc_avg_smoking(list_charges, list_smoker):
    # start a counter variable on charges and number of patients for smokers and non-smokers.
    counter_smoker_charges = 0
    counter_nonsmoker_charges = 0
    counter_smoker_num = 0
    counter_nonsmoker_num = 0
    # Iterate through the list of insurance costs and sum the insurance cost and the patient to the correspondent counter variable.
    for index in range(0, len(list_charges)):
        if list_smoker[index] == 'yes':
            counter_smoker_charges += float(list_charges[index])
            counter_smoker_num += 1
        elif list_smoker[index] == 'no':
            counter_nonsmoker_charges += float(list_charges[index])
            counter_nonsmoker_num += 1
    # calculate the average insurance cost for smokers and non_smokers.
    avg_smoker = round((counter_smoker_charges / counter_smoker_num),2)
    avg_nonsmoker = round((counter_nonsmoker_charges / counter_nonsmoker_num), 2)
    # print the results.
    print('The average annual insurance cost for smokers is ' + str(avg_smoker) + ' dollars.')
    print('The average annual insurance cost for non-smokers is ' + str(avg_nonsmoker) + ' dollars.')

# Define a function to analyze the distribution of the number of children variable.

def num_of_children_distribution(list_var):
    # start a counter variable for each possible value of the variable.
    count_0 = 0
    count_1 = 0
    count_2 = 0
    count_3 = 0
    count_4 = 0
    count_5 = 0
    # Iterate through the list and add one patient to the correspondent counter variable.
    for item in list_var:
        if int(item) == 0:
            count_0 += 1
        elif int(item) == 1:
            count_1 += 1
        elif int(item) == 2:
            count_2 += 1
        elif int(item) == 3:
            count_3 += 1
        elif int(item) == 4:
            count_4 += 1
        elif int(item) == 5:
            count_5 += 1
    # Create a list with the absolute frequency for each value of the variable.
    num_children = [count_0, count_1, count_2, count_3, count_4, count_5]
    i  = 0
    # Iterate through the list above and print the results.
    for item in num_children:
        print('There are ' + str(item) + ' patients with ' + str(i) + ' children.')
        i += 1

The next step is to use the functions so we can analyze the data and answer the questions we proposed for this project. Our main findings are the following:

In [8]:
# Calculate average age of the patients
calc_avg(age)

39.20702541106129

The average age of the patients in this dataset is 39.2 years old.

In [7]:
# Calculate gender distribution (male vs. female)
frequency_gender(sex)

There are 676 males representing the 51.0 percentage of the dataset.
There are 662 females representing the 49.0 percentage of the dataset.


(676, 662, 51.0, 49.0)

As we can see from the function's output, there are 676 males and 662 females in the database, representing 51 and 49 percent of the total, respectively.

In [9]:
# Check in which regions the patients live
unique_values(region)

['southwest', 'southeast', 'northwest', 'northeast']

There are four unique regions in the dataset: southwest, southeast, northwest and northeast.

In [13]:
# Smokers vs. non-smokers
frequency_smoking(smoking_status)

There are 274 smokers representing the 20.0 percentage of the dataset.
There are 1064 non-smokers representing the 80.0 percentage of the dataset.


(274, 1064, 20.0, 80.0)

We can see the numbers of smokers and non-smokers patients in the output above. 20 percent of the patients are smokers and 80 percent are non-smokers.

In [11]:
# Average annual insurance cost
calc_avg(yearly_cost)

13270.422265141257

The average annual insurance cost per patient is 13,270.42 dollars.

In [14]:
# Average annual insurance cost: smokers vs. non-smokers
calc_avg_smoking(yearly_cost, smoking_status)

The average annual insurance cost for smokers is 32050.23 dollars.
The average annual insurance cost for non-smokers is 8434.27 dollars.


The average annual insurance cost is substantially higher for smokers than for non-smokers. While a smoker patient pays on average 32,050.23 dollars per year, a non-smoker pays 8,434.27 dollars.

In [17]:
print(calc_avg(number_of_children))
num_of_children_distribution(number_of_children)

1.0949177877429
There are 574 patients with 0 children.
There are 324 patients with 1 children.
There are 240 patients with 2 children.
There are 157 patients with 3 children.
There are 25 patients with 4 children.
There are 18 patients with 5 children.


Each patient has 1.09 children, on average. When we look at the distribution of the variable, it can be seen that 574 patients have no children, 324 patients have 1 children, 240 patients have 2 children, 157 patients have 3 children, and only 43 patients have 4 children or more.

In this project we applied python fundamentals to perform an exploratory analysis of the U.S. Medical Insurance Costs dataset. There are many ways to expand on this project, some potential extra features to add are: 
- Create a dictionary combining the information of the seven variables.
- Explore areas where the data may include bias and how that would impact potential use cases.
- Make predictions about what variables are most likely influential for an individual's medical insurance annual cost.