# U.S. Medical Insurance Costs

## Purpose

In this project, a **CSV** file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within **insurance.csv** to learn more about the patient information in the file and gain insight into potential use cases for the dataset.


### Part 1: Preparing Data for Analaysis

To start, all necessary libraries must be imported. For this project the only library needed is the `csv` library in order to work with the **insurance.csv** data. There are other potential libraries that could help with this project; however, for this analysis, using just the `csv` library will suffice.

In [None]:
# import csv library

The next step is to look through **insurance.csv** in order to get aquanted with the data. The following aspects of the data file will be checked in order to plan out how to import the data into a Python file:
* The names of columns and rows
* Any noticeable missing data
* Types of values (numerical vs. categorical)

In [None]:
# Create empty lists for the various attributes in insurance.csv

**insurance.csv** contains the following columns:
* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost

There are no signs of missing data. To store this information, seven empty lists will be created hold each individual column of data from **insurance.csv**.


In [None]:
# helper function to load csv data

# open csv file

# read csv file

#loop through the data in each row of the csv

# add the data from each row to a list

# return the list

The helper function above was created to make loading data into the lists as efficient as possible. Without this function, one would have to open **insurance.csv** and rewrite the `for` loop seven times; however, with this function, one can simply call `load_list_data()` each time as shown below.

In [None]:
# look at the data in insurance_csv_dict

### Part 2: Analyzing Data

Now that all the data from **insurance.csv** neatly organized into labeled lists, the analysis can be started. This is where one must plan out what to investigate and how to perform the analysis. There are many aspects of the data that could be looked into. The following operations will be implemented:
* find average age of the patients
* return the number of males vs. females counted in the dataset
* find geographical location of the patients
* return the average yearly medical charges of the patients
* creating a dictionary that contains all patient information

To perform these inspections, a class called `PatientsInfo` has been built out which contains fives methods:
* `analyze_ages()`
* `analyze_sexes()`
* `unique_regions()`
* `average_charges()`
* `create_dictionary()`

The class has been built out below. 

In [None]:
# init method that takes each list parameter

# method that calculates the average ages of the patients in insurance.csv

# initialize total age at zero

# iterates through all ages in the ages list

# sum of the total age

# returns the total age divided by the length of the patient list

# method that calculates the number of males and femails in insurance.csv

# initialize number of males and females to zero

# itrerates through each sex in the sexes list

# if female add to female variable

# if male add to male variable

# print out the number of each

# method to find each unique region patients are from

# intialize empty list

# iterate through each region in the regions list

# if the region is not already in the unique regions list
# then add it to the unique regions list

# return unique regions list

# method that finds the aveerage yearly medical charges for patients in insurance.csv

# initialize total charges at zero

# iterate through each charge in the patients charges list
# add each charge to total_charge

# return the average charges rounded to the hundredths place

# method to create dictionary with all patients information

The next step is to create an instance of the class called `patient_info`. With this instance, each method can be used to see the results of the analysis.

The average age of the patients in **insurance.csv** is about 39 years old. This is important to check in order to ensure the data in **insurance.csv** is representative for a broader population. If it is decided to use the dataset to make inferences about other populations, the data must abundant and broad enough for such use cases.

A further analysis would have to be done to make sure the [range](https://www.mathsisfun.com/data/range.html#:~:text=The%20Range%20is%20the%20difference,is%209%20%E2%88%92%203%20%3D%206.) and [standard deviation](https://www.mathsisfun.com/data/standard-deviation.html) of the patient age group in **insurance.csv** is indicative of a random sampling of individuals. 

The next step of the analysis is to check the balance of males vs. females in **insurance.csv**. Similar to above, it is important to check that this dataset is representative of a broader population of individuals. If a person were to use this dataset to create a classification model, it would be imperitive to make sure that the attributes are balanced.

Quite often in the real-world, data is not balanced; this is an issue because it can lead to statistical issues when performing analysis. This is something that will be explored further in future portfolio projects!

There are four unique geographical regions in this dataset, and it is important to note that all the patients come from the United States.

The average yearly medical insurance charge per individual is 13270 US dollars. Some further analysis could be done to see what patient attributes contribute most strongly to low and/or high medical insurance charges. For example, one could check if patient age correlates with the amount of money they spend yearly.

All patient data is now neatly organized in a dictionary. This is convenient for further analysis if a decision is made to continue making investigations for the attributes in **insurance.csv**.