# U.S. Medical Insurance Costs
In this project, a CSV file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within insurance.csv to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

For this project the only libraries needed are the csv library and pandas in order to work with the insurance.csv data. 

In [58]:
# import csv library
import csv
import pandas as pd

The next step is to look through **insurance.csv** in order to get acquainted with the data. The following aspects of the data file will be checked in order to plan out how to import the data into a Python file:

* The names of columns and rows
* Any noticeable missing data
* Types of values (numerical vs. categorical)

In [75]:
with open('insurance.csv') as insurance_dataset_file:
    print(insurance_dataset_file.read())

age,sex,bmi,children,smoker,region,charges
19,female,27.9,0,yes,southwest,16884.924
18,male,33.77,1,no,southeast,1725.5523
28,male,33,3,no,southeast,4449.462
33,male,22.705,0,no,northwest,21984.47061
32,male,28.88,0,no,northwest,3866.8552
31,female,25.74,0,no,southeast,3756.6216
46,female,33.44,1,no,southeast,8240.5896
37,female,27.74,3,no,northwest,7281.5056
37,male,29.83,2,no,northeast,6406.4107
60,female,25.84,0,no,northwest,28923.13692
25,male,26.22,0,no,northeast,2721.3208
62,female,26.29,0,yes,southeast,27808.7251
23,male,34.4,0,no,southwest,1826.843
56,female,39.82,0,no,southeast,11090.7178
27,male,42.13,0,yes,southeast,39611.7577
19,male,24.6,1,no,southwest,1837.237
52,female,30.78,1,no,northeast,10797.3362
23,male,23.845,0,no,northeast,2395.17155
56,male,40.3,0,no,southwest,10602.385
30,male,35.3,0,yes,southwest,36837.467
60,female,36.005,0,no,northeast,13228.84695
30,female,32.4,1,no,southwest,4149.736
18,male,34.1,0,no,southeast,1137.011
34,female,31.92,1,yes,northeast,37701

In [74]:
insurance_dataset = pd.read_csv('insurance.csv')
insurance_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


Thanks to the previous lines of code, we discover that **insurance.csv** contains the following columns:
* Patient Age
* Patient Sex
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost

There are no signs of missing data. To store this information, seven empty lists will be created to hold each individual column of data from **insurance.csv**.
Then, we will save the information in the corresponding list using the for loop.

In [76]:
# Creating 7 lists to save each values into separated lists:
ages = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

# Saving the information in their corresponding list:
with open('insurance.csv', newline='') as insurance_dataset:
    dict_reader = csv.DictReader(insurance_dataset)
    for row in dict_reader:
        ages.append(row['age'])
        sex.append(row['sex'])
        bmi.append(row['bmi'])
        children.append(row['children'])
        smoker.append(row['smoker'])
        region.append(row['region'])
        charges.append(row['charges'])
    

# Analysis

Now that we have all the information organized in their corresponding lists, we will start our analysis of the data, separated in different goals to obtain:
* The average charge of all insurances recorded.
* The average age of all patients recorded.
* The count of males and females patients.
* The average charges in smoker and non smoker users.
* The distribution of contracted insurances between regions represented in %.
* A dictionary with all patients information

In [77]:
# Calculating the average charge of all insurances:
total_charges = 0
for i in charges:
    total_charges += float(i)
    
average_charge = total_charges/len(charges)
print('The average charge of all insurances recorded is $' + str(round((average_charge), 2)))

The average charge of all insurances recorded is $13270.42


In [78]:
# Calculating the average age of all patients recorded:
total_ages = 0
for i in ages:
    total_ages += float(i)
    
average_age = total_ages/len(ages)

print('The average age of the patients in the dataset is ' + str(int(average_age)) +'.')

The average age of the patients in the dataset is 39.


The average age of the patients in **insurance.csv** is about 39 years old. This is important to check in order to ensure the data in insurance.csv is representative for a broader population.

If it is decided to use the dataset to make inferences about other populations, the data must abundant and broad enough for such use cases.
A further analysis would have to be done to make sure the range and standard deviation of the patient age group in insurance.csv is indicative of a random sampling of individuals.

In [79]:
# method that calculates the number of males and females in insurance.csv

# initialize number of males and females to zero
count_females = 0
count_males = 0
    
for gender in sex:
    if gender == 'female':
        count_females += 1
    elif gender == 'male':
        count_males += 1

print('Female patients quantity:', count_females)
print('Male patients quantity:', count_males)

Female patients quantity: 662
Male patients quantity: 676


The next step of the analysis is to check the balance of males vs. females in **insurance.csv**. Female and male patients are fairly distributed. Similar to above, it is important to check that this dataset is representative of a broader population of individuals.

In [80]:
# Calculating the average charges in smoker and non smoker users:
total_smoker_charges = 0
num_smokers = 0

total_no_smoker_charges = 0
num_no_smokers = 0

with open('insurance.csv', newline='') as insurance_dataset:
    dict_reader = csv.DictReader(insurance_dataset)
    for row in dict_reader:
        if row['smoker'] == 'yes':
            total_smoker_charges += float(row['charges'])
            num_smokers+= 1
        else: total_no_smoker_charges += float(row['charges'])
        num_no_smokers+= 1
    
            
    average_smoker_charges = total_smoker_charges / num_smokers
    print(f'Average charges for smokers: {average_smoker_charges:.2f}')
    
    average_no_smoker_charges = total_no_smoker_charges / num_no_smokers
    print(f'Average charges for non smokers: {average_no_smoker_charges:.2f}')

Average charges for smokers: 32050.23
Average charges for non smokers: 6707.07


As stated above, the average charges for smokers are pretty higher rather than for non smoker patients. This is useful to confirm that the smoking attribute strongly contribute to increase the insurance charges.

In [81]:
# Count of each area recorded:
totalcount_region = len(region)

southwest_count = region.count('southwest')
print('The count of SouthWest regions recorded are ' + str(southwest_count) + ', that represents ' + str(round((southwest_count/totalcount_region)*100, 2)) +"%"+ ' out of the total.')

northwest_count = region.count('northwest')
print('The count of NorthWest regions recorded are ' + str(northwest_count) + ', that represents ' + str(round((northwest_count/totalcount_region)*100, 2)) +"%"+ ' out of the total.')

southeast_count = region.count('southeast')
print('The count of SouthEast regions recorded are ' + str(southeast_count) + ', that represents ' + str(round((southeast_count/totalcount_region)*100, 2)) +"%"+ ' out of the total.')

northeast_count = region.count('northeast')
print('The count of NorthEast regions recorded are ' + str(northeast_count) + ', that represents ' + str(round((northeast_count/totalcount_region)*100, 2)) +"%"+ ' out of the total.')


The count of SouthWest regions recorded are 325, that represents 24.29% out of the total.
The count of NorthWest regions recorded are 325, that represents 24.29% out of the total.
The count of SouthEast regions recorded are 364, that represents 27.2% out of the total.
The count of NorthEast regions recorded are 324, that represents 24.22% out of the total.


In [82]:
# Importing all lists to a new zipped dictionary. The main key will be the id number that represents each record with all patients information:

id = []
for i in range(len(ages)):
    id.append(i+1)

dictionary_dataset = {}
for i in range(len(id)):
    dictionary_dataset[i] = {'Age':ages[i], 'Sex':sex[i], 'BMI':bmi[i], 'Children':children[i], 'Smoker':smoker[i], 'Region':region[i], 'Charges':charges[i]}
print(dictionary_dataset)

{0: {'Age': '19', 'Sex': 'female', 'BMI': '27.9', 'Children': '0', 'Smoker': 'yes', 'Region': 'southwest', 'Charges': '16884.924'}, 1: {'Age': '18', 'Sex': 'male', 'BMI': '33.77', 'Children': '1', 'Smoker': 'no', 'Region': 'southeast', 'Charges': '1725.5523'}, 2: {'Age': '28', 'Sex': 'male', 'BMI': '33', 'Children': '3', 'Smoker': 'no', 'Region': 'southeast', 'Charges': '4449.462'}, 3: {'Age': '33', 'Sex': 'male', 'BMI': '22.705', 'Children': '0', 'Smoker': 'no', 'Region': 'northwest', 'Charges': '21984.47061'}, 4: {'Age': '32', 'Sex': 'male', 'BMI': '28.88', 'Children': '0', 'Smoker': 'no', 'Region': 'northwest', 'Charges': '3866.8552'}, 5: {'Age': '31', 'Sex': 'female', 'BMI': '25.74', 'Children': '0', 'Smoker': 'no', 'Region': 'southeast', 'Charges': '3756.6216'}, 6: {'Age': '46', 'Sex': 'female', 'BMI': '33.44', 'Children': '1', 'Smoker': 'no', 'Region': 'southeast', 'Charges': '8240.5896'}, 7: {'Age': '37', 'Sex': 'female', 'BMI': '27.74', 'Children': '3', 'Smoker': 'no', 'Region'

All patient data is now neatly organized in a dictionary. This is convenient for further analysis if a decision is made to continue making investigations for the attributes in **insurance.csv**.