# U.S. Medical Insurance Costs

The purpose of this project is to analyze the medical insurance cost in the U.S and observe its various attributes within the given dataset. I am particularly interested in the factors that determine the difference of insurance cost between individuals for instance smoker and non-smoker, children they own, age, and etc. I will define a class object in this project to get individual data from each columns to show how extensive my knowledge in Python especially using its object-oriented feature.

In [13]:
import pandas as pd
import numpy as np

It is a standard that data analysis will eventually end up in the data visualization. However, in this particular project, I will be just performing data exploration using libraries such as pandas and NumPy. If I have a chance in the future, I will probably make the visualization later on after I studied more about data visualization. At the meantime, I am still studying in learning seaborn and plotly to visualize data. 

In [14]:
data = pd.read_csv('insurance.csv')

In [15]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
None


The columns in the dataset, are:
* Patient Age
* Patient Sex
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient Demographical Region
* Patient Insurance Cost
There are no missing values in the tables and this dataset has 1338 rows including the header and 7 attributes.

In [16]:
print(data.head())

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520


Observing first few data to get a little sense of what the data is about and getting the column name.

In [17]:
print(data.describe())

               age          bmi     children       charges
count  1338.000000  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397     1.094918  13270.422265
std      14.049960     6.098187     1.205493  12110.011237
min      18.000000    15.960000     0.000000   1121.873900
25%      27.000000    26.296250     0.000000   4740.287150
50%      39.000000    30.400000     1.000000   9382.033000
75%      51.000000    34.693750     2.000000  16639.912515
max      64.000000    53.130000     5.000000  63770.428010


By using the `.describe()` function, there are few points that we can conclude as part of the analysis, as follows:
* Individual who holds an insurance is ranging from 18 to 64 years old.
* The average of the age is around 39 and this corresponds to the median of the data. This reflects to the majority of people who has an insurance.
* Without direspecting any groups of parties from my analysis, it can be concluded that the average of the BMI is around 30.0 which has already passed the recommended BMI.
* The average insurance cost per individual is 13270 US dollars. Further analysis can be conducted to figure out in which customer attributes that mostly impacts on difference in insurance costs.

# Further Analysis

## Age

By observing the overall descriptions of the data above. I am quite intrigued by the minimum age of the individual in the dataset. As I am from Indonesia, most of the teenagers (younger than 21 years old) do not really have their own insurance account. This is interesting to me as I think it has become a norm in a developed countries to have an insurance account in the early age. Because of that I want to know how many people that are younger than 21 years who have an insurance by defining a function.

In [18]:
def insurance_age_count(high_ages, low_ages=0, age_data=data['age']):
    count=0
    for age in age_data:
        if (age <= high_ages) & (age >= low_ages):
            count += 1
    return count

In [19]:
insurance_age_count(21)

194

From the functions above, it is indeed really surprising to me that there are almost 200 people who joined an insurance in their early age. Even though these people only account for 14.5% of the total data. This number indicates that many teenagers are already aware about how important it is for someone to have an insurance.

## Gender

In this section, I would like to get the total counts of the 2 genders. This is important to figure out which gender has a tendency to have an insurance.

In [20]:
def gender_count(gender_data = data['sex']):
    female = 0
    male = 0
    for gender in gender_data:
        if gender == 'female':
            female += 1
        elif gender == 'male':
            male += 1
    print(f"Count for female: {female}")
    print(f"Count for male: {male}")

In [22]:
gender_count()

Count for female: 662
Count for male: 676
50.56095736724009


Looking at the function, we can identify that the majority of gender who has an insurance is male by having a percentage of around 50.6% of the total population in the dataset.

## Smoking Status

In this section, I would like to again separate the dataset between smoker and non-smoker. Firstly, I want to know how many people who smokes and do not by defining a `.smokers_count()` function.

In [49]:
def smokers_count(smoke_data=data['smoker']):
    smokers = 0
    non_smokers = 0
    for smoker in smoke_data:
        if smoker=='yes':
            smokers += 1
        else:
            non_smokers += 1
    print(f"The total smokers: {smokers}")
    print(f"The total non-smokers: {non_smokers}")

In [50]:
smokers_count()

The total smokers: 274
The total non-smokers: 1064


In [30]:
data.groupby(data['smoker']).mean()

Unnamed: 0_level_0,age,bmi,children,charges
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,39.385338,30.651795,1.090226,8434.268298
yes,38.514599,30.708449,1.113139,32050.231832


After the analysis, we found out that the majority of people does not smoke. After finding this number, I am curious about the average age of people who smokes or not by performing the `groupby().mean()` function. This results in the few average value in some attributes from the dataset between smokers and non-smokers. We can see that the average age, bmi, and children do not have a significance difference. 

However, if we look at the last attribute which is the insurance cost (charges). We can see that the difference of charges between smokers and non-smokers are quite substantial. The insurance cost for smokers are almost 4 times more expensive compared to non-smokers population. But, the table above does not represent some outliers or skews as it can be seen from the dataset that some individuals under 21 years old that are categorized as smokers.

# Conclusion

In conclusion of this analysis, the number of smokers are significantly lower than the non-smokers. While smoking records do have a strong impact in determining the total of insurance cost per individual. Other attributes, such as age, gender, and BMI also plays an important role that could affect the cost of insurance.