# U.S. Medical Insurance Costs

# 1. Look over your dataset

Download a zip file here with the necessary datasets and an empty Jupyter Notebook where you can write your code.

Open insurance.csv and take a look at the file. Take note of how information is organized. How will this affect how you analyze the data in Python? Is there anything of particular interest to you in the dataset that you want to investigate? Think about these things before you jump into analyzing it.


* The data in the csv file is organized as a table: information of each person is stored in a row with 7 attributes in the columns:

    age, sex, bmi, children, smoker, region, charges
    
* Some of the columns are categorical (sex, smoker, region) while other are numerical (age: int, bmi: float, children: int, charges: float), all numbers are positive.
    
* There is no missing data

A library reading .csv files will be useful for importing the data. It would make sense to store the data in a dictionary with keys named according to the columns (variables) and values containing the list of all observations for each variable.

It would be intersting to know how much each of the 6 features - age, sex, BMI, number of children, whether the person smokes and region of residence - affects the charges (how much does the person pay for the medical insurance). But it is out of scope for now, we could go back to this question after we covered visualization and prediction (linear regression).


# 2. Scoping Your Project

Now that you have looked over your dataset, plan out what you want to analyze. What is it that you want to find out about this dataset? Based on the way information is organized, certain inspections may be easier to perform than others. As you map out the process, consider the scope of your analysis as well.

Properly scoping your project will greatly benefit you; scoping creates structure while requiring you to think through your entire project before you begin. You should start by stating the goals for your project, then gathering the data, and considering the analytical steps required. A proper project scope can be a great road map for your project, but keep in mind that some down-stream tasks may become dead ends which will require adjustment to the scope. 

We want to analyze the population itself and answer the following questions:
* Are people more likely to be healthy (no smoking, normal BMI) at younger age or older, with children or without them, if they are male or female?

    * Children vs. smoker - Are people more likely to be smokers with children or not?
    * Sex vs. smoker - Are females more likely to be smokers than males?
    * Age vs. smoker - which part of the population smokes the most? At what age do they usually start?
    * BMI vs. age - How does BMI distribution looks like for different age groups? Does BMI increase with age?
    * BMI vs. sex - How does BMI distribution looks like for males and for females? Which sex is more predisposed to obesity?
    * BMI vs. smoker - Do smokers have a lower BMI than non-smokers?
    * BMI vs. children - Does BMI increase with the number of children?
    
Since BMI and age variables are numerical, we should subdivide them into a few categories to simplify the analysis (since we decided to investigate the data without plotting).

# 3. Import your dataset

Import insurance.csv into your Python file and inspect the contents.


In [100]:
%ls

insurance.csv  LICENSE  README.md  us-medical-insurance-costs.ipynb


In [101]:
import csv
dataset = {}
with open("insurance.csv", newline='') as insurance_csv:
    dict_reader = csv.DictReader(insurance_csv) # converts the lines of our CSV file to Python dictionaries
    column_names = dict_reader.fieldnames
    for column_name in column_names:
        dataset[column_name] = []
    for row in dict_reader:
        # row is an ordered dictionary entry containing a tuple with the column heading and the data for each observation
        # key is the column heading and the value is data at each observation
        for key, value in row.items():
            dataset[key].append(value)
# test
for key, value in dataset.items():
    print("{}: {}".format(key, len(value)))

age: 1338
sex: 1338
bmi: 1338
children: 1338
smoker: 1338
region: 1338
charges: 1338


In [102]:
#print(dataset)

# 4. Save your dataset via Python variables

Save the features of your dataset (the columns) from insurance.csv by storing them in variables that can be used for analysis. As you consider what types of variables to use and how many you plan to create, think ahead about the parameters you wish to investigate and how your organization will impact this analysis.

# 5. Build out analysis functions or class methods

You now have everything you need to begin your analysis. You have organized the information from insurance.csv and have spent some time thinking about what it is you would like to investigate.

Now is the time to build out how you perform these investigations. Use the Python fundamentals you have learned so far to accomplish these tasks. There are many different ways you can achieve these analyses. In our hint, we will provide some ideas for how you can use Python to analyze data.

In [103]:
def variable_count(data, variable_key): # data is dataset, variable_key is the column you want to count
    count_dict = {}
    for i in range(len(data[variable_key])):
        if data[variable_key][i] not in count_dict:
            count_dict[data[variable_key][i]] = 1
        else:
            count_dict[data[variable_key][i]] += 1
    return count_dict

In [104]:
def get_tuple_list(data, key1, key2):
    return list(zip(data[key1], data[key2]))

# Investigating smokers vs sex

The purpose of the investigation is to find out the percentage of female and male smokers in the population and whether females or males are more likely to be smokers.

In [105]:
#function for counting smokers vs other keys i.e. columns
#data is tthe tuple_list, key2 the key we want to count by, smoker_dict is the dictionary we want to put the findings into
def smoker_count(data, key2, smoker_dict):
    for item in data:
        if item[0] == 'yes':
            if item[1] not in smoker_dict:
                smoker_dict[item[1]] = 1
            else:
                smoker_dict[item[1]] += 1
    return smoker_dict

In [106]:
data = get_tuple_list(dataset, 'smoker', 'sex')
#data

In [107]:
sex_smoker = {}
smoker_count(data, 'sex', sex_smoker)

{'female': 115, 'male': 159}

In [108]:
num_sex = variable_count(dataset, 'sex')
num_sex

{'female': 662, 'male': 676}

In [109]:
for key in num_sex.keys():
    percentage = sex_smoker.get(key) / num_sex.get(key) * 100
    print("{percentage}% of the {sex}s is a smoker.".format(percentage = round(percentage,2), sex = key))

17.37% of the females is a smoker.
23.52% of the males is a smoker.


# Investigating percentage of smokers with children

The purpose of the comparison is to find out whether people with one or more children are more or less likely to be smokers.


In [110]:
data = get_tuple_list(dataset, 'smoker', 'children')
#data

In [111]:
num_children = variable_count(dataset, 'children')
num_children

{'0': 574, '1': 324, '3': 157, '2': 240, '5': 18, '4': 25}

In [112]:
num_smoker = variable_count(dataset, 'smoker')
num_smoker

{'yes': 274, 'no': 1064}

In [113]:
children_smoker = {}
children_smoker = smoker_count(data, 'children', children_smoker)
children_smoker

{'0': 115, '1': 61, '2': 55, '3': 39, '4': 3, '5': 1}

In [114]:
#percentage
for key in children_smoker.keys():
    percentage = children_smoker[key] / num_smoker['yes'] * 100
    print("{percentage}% of the smokers is with {children} children. ".format(percentage = round(percentage,2), children = key))

41.97% of the smokers is with 0 children. 
22.26% of the smokers is with 1 children. 
20.07% of the smokers is with 2 children. 
14.23% of the smokers is with 3 children. 
1.09% of the smokers is with 4 children. 
0.36% of the smokers is with 5 children. 


In [115]:
total_children_smoker = num_smoker['yes'] - children_smoker['0']
percent_total = total_children_smoker / num_smoker['yes'] * 100
print('{percentage} % of smokers have at least one child.'.format(percentage = round(percent_total,2)))

58.03 % of smokers have at least one child.


# Conclusions

There was no significant difference between the quantity of smokers among females (17%) and males (24%).

Surprisingly, smokers were more likely to have at least one child (58% of the smokers) which is a significant difference compared to the childless smokers (42% of the smokers). Parents should be educated about the effects on passive smoking on children's health and how important being a role model is. The more children a parent had, he/she was less likely to be a smoker, but also because there were only a few people with 4+ children (43 people out of 1338).


# 6. Project Extensions

You’re welcome to expand your analysis beyond what you have already done! Some potential extra features to add to your portfolio project are the following:

* Organize your findings into dictionaries, lists, or another convenient datatype.
* Make predictions about what features are the most influential for an individual’s medical insurance charges based on your analysis.
* Explore areas where the data may include bias and how that would impact potential use cases.

Congrats on completing your portfolio project!
