# U.S. Medical Insurance Costs

## Project Goals
<br>
- Looking at the dataset to find out:<br>
&emsp;&emsp;+ Age: Mean, Medium, Mode <br>
&emsp;&emsp;+ Sex: ratio of female & make <br>
&emsp;&emsp;+ Smoker: % of smoker<br>
&emsp;&emsp;+ BMI: Mean, Medium, Mode<br>
&emsp;&emsp;+ Children: Mean, Medium, Mode, % of ppl who have children <br>
&emsp;&emsp;+ Region: Most occured region<br> 
&emsp;&emsp;+ Charge: Mean, Medium, Mode<br>
<br>
<br>
- Further analysis: <br>
&emsp;&emsp;+ Average age for people with at least 1 child<br>
&emsp;&emsp;+ Average Insurance Cost for Smoker vs. Non-Smoker <br>
&emsp;&emsp;+ Average Insurance Cost for each No. of children Group <br>
       

## 1- Loading CSV file and create dataset 

- Create list data for each set of data from the CSV file
- Double check the list after running function

### 1.1- Import CSV file and create data list:

In [44]:
import csv

In [45]:
#Create all data lists from the insurance data
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

In [46]:
# # Method 1: Create function to load data, 
# # at the same time converting the string value in csv file to either float or integer value. 

# def load_csv_to_data_list(csv_file,data_list,data_name):
#     with open(csv_file) as csv_data:
#         csv_read=csv.DictReader(csv_data)
#         for row in csv_read:
#             try:
#                 float(row[data_name])
#                 if row[data_name].isdigit() and int(row[data_name])==0:
#                     data_list.append(int(row[data_name]))
#                 if row[data_name].isdigit() and float(row[data_name]):
#                     data_list.append(int(row[data_name]))
#                 if row[data_name].isdigit() == False and type(float(row[data_name]))==float:
#                     data_list.append(float(row[data_name]))
#             except ValueError:
#                 data_list.append(row[data_name])
                

In [47]:
### Method 2: Much simpler ^^ 

def load_csv_to_data_list(csv_file, data_list, data_name):
    with open(csv_file) as csv_data:
        csv_reader = csv.DictReader(csv_data)
        for row in csv_reader:
            try:
                value = float(row[data_name])
                if value.is_integer():
                    data_list.append(int(value))
                else:
                    data_list.append(value)
            except ValueError:
                data_list.append(row[data_name])

In [48]:
# Start loading data

load_csv_to_data_list('insurance.csv',age,'age')
load_csv_to_data_list('insurance.csv',sex,'sex')
load_csv_to_data_list('insurance.csv',bmi,'bmi')
load_csv_to_data_list('insurance.csv',smoker,'smoker')
load_csv_to_data_list('insurance.csv',children,'children')
load_csv_to_data_list('insurance.csv',region,'region')
load_csv_to_data_list('insurance.csv',charges,'charges')


### 1.2- Creat a library from the list in case the data in list is modified:

In [49]:
def create_library(age,sex,bmi,smoker,children,region,charges):
    Insurance_library = dict()

    Insurance_library['age']=age
    Insurance_library['sex']=sex
    Insurance_library['bmi']=bmi
    Insurance_library['smoker']=smoker
    Insurance_library['children']=children
    Insurance_library['region']=region
    Insurance_library['charges']=charges
    
    return Insurance_library

Insurance_library = create_library(age,sex,bmi,smoker,children,region,charges)

### 1.3 - Preview some data

In [50]:
print(age[0:5])
print(sex[0:5])
print(bmi[0:5])
print(smoker[0:5])
print(children[0:5])
print(region[0:5])
print(charges[0:5])

[19, 18, 28, 33, 32]
['female', 'male', 'male', 'male', 'male']
[27.9, 33.77, 33, 22.705, 28.88]
['yes', 'no', 'no', 'no', 'no']
[0, 1, 3, 0, 0]
['southwest', 'southeast', 'southeast', 'northwest', 'northwest']
[16884.924, 1725.5523, 4449.462, 21984.47061, 3866.8552]


## 2- Analyze data 

### 2.1- Function: 
Write function to find Mean, Median, Mode of a data list

In [51]:
def find_mean(data_list):
    mean=round(sum(data_list)/len(data_list),2)
    return mean

def find_median(data_list):
    sorted_data=sorted(data_list)
    if len(sorted_data)%2==0:
        median1=sorted_data[len(sorted_data)//2]
        median2=sorted_data[len(sorted_data)//2-1]
        median = (median1+median2)/2
    else:
        median=sorted_data[len(sorted_data)//2]
    return median

def find_mode(data_list):
    from statistics import multimode
    if len(multimode(data_list))==len(data_list):
        print("This data list does not have a mode")
    else:
        data_mode=multimode(data_list)
    return data_mode

def find_mode_percentage(data_list):
    sum_mode=0
    for a in find_mode(data_list):
        sum_mode+=data_list.count(a)
    mode_percent=round(sum_mode/len(data_list)*100,2)
    return mode_percent
        

def find_mean_median_mode(data_list, list_name):
    if type(data_list[1])==str:
        print(list_name +" has Mode: "+str(find_mode(data_list))+ " which account to "+ str(find_mode_percentage(data_list))+"% .")
    else:
        print(list_name + " has Mean: "+str(find_mean(data_list))+" , Median: "+ str(find_median(data_list))+" , and Mode: "+str(find_mode(data_list)) + " which account to "+ str(find_mode_percentage(data_list))+"% .")

## 2.2- Analyze data set
Using the function created in 2.1 to analyze the dataset
### 2.2.1- Looking at all data list and find mean, median, mode of each list

In [52]:
print("Data set of "+str(len(age))+" people shows:")

find_mean_median_mode(age, 'age')
find_mean_median_mode(sex, 'sex')
find_mean_median_mode(bmi, 'bmi')
find_mean_median_mode(smoker, 'smoker')
find_mean_median_mode(children, 'children')
find_mean_median_mode(region, 'region')
find_mean_median_mode(charges, 'charges')


Data set of 1338 people shows:
age has Mean: 39.21 , Median: 39.0 , and Mode: [18] which account to 5.16% .
sex has Mode: ['male'] which account to 50.52% .
bmi has Mean: 30.66 , Median: 30.4 , and Mode: [32.3] which account to 0.97% .
smoker has Mode: ['no'] which account to 79.52% .
children has Mean: 1.09 , Median: 1.0 , and Mode: [0] which account to 42.9% .
region has Mode: ['southeast'] which account to 27.2% .
charges has Mean: 13270.42 , Median: 9382.033 , and Mode: [1639.5631] which account to 0.15% .


<font color='orange'>==>> Some data note: <br>
<br>
The US-medical-Insurance-cost Data set consists data of 1338 people with each individual information regarding age, sex, bmi, whether they are smoking, number of children, region, and their insurance charges. 
<br>
    <br>
The Mode calculation shows that there is a relatively even number between female and male participants. Meanwhile, almost 80% of the insuranced people is non-smoker, and nearly 43% does not have any children. There is approximately 27% (which is about 360 people) who are from Southeast region. 
<br> 
    <br>
For Age, BMI and Children data , they all have the Mean and Median very close to each other (which is 39.21 & 39, 30.66 & 30.4, 1.09 & 1.0 respectively), indicating a symmetrical distribution in the dataset. 
<br> 
    <br>
On the other hand, Insurance charge's Mean is about \\$13.2k, its Median is about $9.3k, and Mode is with lowest value of only \\$1.6k, signalling a more right-skewed dataset. 
<br> 
<br>
These data give us a more overview look of the data's distribution in the US-medical-Insurance-cost Data set.
<br> 
<br>
</font>



### 2.2.2- Average age of people with at least (...) child: 

#### a) Average age of people with ... child:

In [53]:
#write function to find the number of people with ... children and their average age
def find_average_age (children_number):
    sum_age_children=0
    count_age_children=0
    for a in range(len(age)):
        if Insurance_library['children'][a]==children_number:
            sum_age_children+=Insurance_library['age'][a]
            count_age_children+=1
        else:
            pass
    average_age_children=round(sum_age_children/count_age_children,2)
    return count_age_children, average_age_children

In [54]:
# Testing the function:
children_number=range(0,max(children)+1)
for a in children_number:
    print("Number of people with "+ str(a) + " children is " + str(find_average_age (a)[0]) + " and the average age is "+str(find_average_age (a)[1]))

Number of people with 0 children is 574 and the average age is 38.44
Number of people with 1 children is 324 and the average age is 39.45
Number of people with 2 children is 240 and the average age is 39.45
Number of people with 3 children is 157 and the average age is 41.57
Number of people with 4 children is 25 and the average age is 39.0
Number of people with 5 children is 18 and the average age is 35.61


#### b) Average age of people with at least 1 child:

In [55]:
# Writing the function
def find_average_age_at_least_child (children_number_at_least):
    sum_age_children=0
    count_age_children=0
    for a in range(len(age)):
        if Insurance_library['children'][a]>=children_number_at_least:
            sum_age_children+=Insurance_library['age'][a]
            count_age_children+=1
        else:
            pass
    average_age_children=round(sum_age_children/count_age_children,2)
    return count_age_children, average_age_children

In [56]:
# Testing
for a in children_number:
    print("Number of people with at least "+ str(a) + " children is " + str(find_average_age_at_least_child (a)[0]) + " and the average age is "+str(find_average_age_at_least_child(a)[1]))

Number of people with at least 0 children is 1338 and the average age is 39.21
Number of people with at least 1 children is 764 and the average age is 39.78
Number of people with at least 2 children is 440 and the average age is 40.02
Number of people with at least 3 children is 200 and the average age is 40.71
Number of people with at least 4 children is 43 and the average age is 37.58
Number of people with at least 5 children is 18 and the average age is 35.61


### 2.2.3- Relationship between smoking and insurance cost :

We will look at the list Smoking and Insurance Cost by comparing the average cost for smoker and non-smoker group 

In [57]:
# Create function to calculate average insurance cost 
def calculate_nonsmoker_insurance_cost(smoker,yesorno):
    sum_ic = 0
    count_ns=0
    for a in range(len(smoker)):
        if smoker[a]== yesorno:
            sum_ic+=charges[a]
            count_ns+=1
        else:
            pass
    average_ic=sum_ic/count_ns
    return average_ic

test1=calculate_nonsmoker_insurance_cost(smoker,'yes')
test2=calculate_nonsmoker_insurance_cost(smoker,'no')

#Print out result
print("The average Insurance cost for Smoker is "+str("{:,}".format(round(test1,2)))+" USD.")
print("The average Insurance cost for Non-Smoker is "+str("{:,}".format(round(test2,2)))+" USD.")
            

The average Insurance cost for Smoker is 32,050.23 USD.
The average Insurance cost for Non-Smoker is 8,434.27 USD.


<font color='orange'> From the calculation, we can see that the average Insurance charge for smoker is at least 4 times higher than that for Non-Smoker. </font>

### 2.2.4- Average Insurance cost for each Children group

In [40]:
#Create function to calculate average insurance cost for each children group
def calculate_ic_children(children,no_children):
    sum_ic=0
    count_children=0
    for i in range(len(children)):
        if children[i]==no_children:
            sum_ic+=charges[i]
            count_children+=1
        else:
            pass
    average_ic=sum_ic/count_children
    return average_ic

#calculate average insurance cost for each children group:
for a in children_number: 
    print("The average Insurance Cost for family with "+str(a)+" children is "+str("{:,}".format(round(calculate_ic_children(children,a),2)))+" USD.")
    


            

The average Insurance Cost for family with 0 children is 12,365.98 USD.
The average Insurance Cost for family with 1 children is 12,731.17 USD.
The average Insurance Cost for family with 2 children is 15,073.56 USD.
The average Insurance Cost for family with 3 children is 15,355.32 USD.
The average Insurance Cost for family with 4 children is 13,850.66 USD.
The average Insurance Cost for family with 5 children is 8,786.04 USD.


<font color='orange'> 
From the result above, the increase and decrease of Insurance charges does not in parallel with the number of children that person has. The lowest average insurance cost falls into family-with-5-children group , while the highest number belongs to family-with-3-children group.
</font>