# U.S. Medical Insurance Analysis

In this project, I'll be working with a dataset that includes demographic data and insurance costs for a variety of individuals. The goal of the project is to analyze a few factors in the data in order to guide further analysis.

### Focus of Project

We're going to look primarily at the regional data included in the dataset, to identify some differences across the various regions.

In [54]:
# First, import the CSV libary
import csv

# Create some empty lists to work with
bmi_data = []
region_data = []
sex_data = []
age_data = []
cost_data = []

# Open the data set and use a list comprehension to convert it into a dictionary which I can access
with open('insurance.csv') as insurance_csv:
    csv_data = csv.DictReader(insurance_csv)
    insurance_data = [row for row in csv_data]

# Define a function that can be used to create lists
def create_list(list_name, column_name):
    for row in insurance_data:
        list_name.append(row[column_name])
    return list_name

# Fill in the lists with the CSV data
create_list(bmi_data, 'bmi')
create_list(region_data, 'region')
create_list(sex_data, 'sex')
create_list(age_data, 'age')
create_list(cost_data, 'charges')

# Find the total BMI from the data
total_bmi = 0
for item in bmi_data:
    total_bmi += float(item)
avg_bmi = total_bmi / len(bmi_data)
print("Average BMI of dataset: " + str(round(avg_bmi, 2)))

Average BMI of dataset: 30.66


<b>Note:</b> We aren't using all of the created lists in this analysis, but the lists could be accessed to conduct further analysis in the future.


### BMI by Region

Next, we'll calculate the average BMI by region in the dataset. This can help us to make inferences regarding regional health differences, and could inform further data analysis to look into the accuracy of the conclusions found, and factors that may contribue to differences.

In [21]:
# A function that calculates the average BMI of a given region in the dataset
def regional_avg_bmi(region):
    total_bmi = 0
    count = 0
    for row in insurance_data:
        if row['region'] == region:
            total_bmi += float(row['bmi'])
            count += 1

    if count > 0:
        avg_bmi = total_bmi / count
        return round(avg_bmi, 2)
    else:
        return None

# Calculate and print the average BMIs for each region
print("The average BMI found in the southwest region was " + str(regional_avg_bmi('southwest')))
print("The average BMI found in the southeast region was " + str(regional_avg_bmi('southeast')))
print("The average BMI found in the northwest region was " + str(regional_avg_bmi('northwest')))
print("The average BMI found in the northeast region was " + str(regional_avg_bmi('northeast')))
print()

The average BMI found in the southwest region was 30.6
The average BMI found in the southeast region was 33.36
The average BMI found in the northwest region was 29.2
The average BMI found in the northeast region was 29.17



<b>Conclusions:</b> According to the calculations, the Southeast region had the highest BMI on average in our dataset. However, we should also compare other factors such as the number of people sampled from each region. If the count for each region isn't similar, this could lead to inaccurate statistical analysis when we compare each region.

In [27]:
# A function to count the data points in each region
def regional_count(region):
    count = 0
    for row in insurance_data:
        if row['region'] == region:
            count += 1
    return count

# Calculate and print the count for each region
print("The number of data points in the southwest region was " + str(regional_count('southwest')))
print("The number of data points in the southeast region was " + str(regional_count('southeast')))
print("The number of data points in the northwest region was " + str(regional_count('northwest')))
print("The number of data points in the northeast region was " + str(regional_count('northeast')))
print()

The number of data points in the southwest region was 325
The number of data points in the southeast region was 364
The number of data points in the northwest region was 325
The number of data points in the northeast region was 324



As we can see above, there were more individuals from the Southeast included in the dataset compared to the other regions. If this included quite a few outliers, it could have skewed the average BMI.

### Smoking Habits by Region

In addition to the number of data points in each set, another factor that could correlate with BMI is smoking habits. Let's look at what percentage of people smoked in each region surveyed.


In [45]:
# A function that calculates the percentage of smokers in a given region
def regional_smokers(region):
    count = 0
    for row in insurance_data:
        if row['region'] == region:
            if row['smoker'] == 'yes':
                count += 1
    percent_smokers = (count / regional_count(region)) * 100
    return round(percent_smokers, 2)

# Calculate and print the percentage of smokers in each region
print("The percentage of smokers found in the southwest region was " + str(regional_smokers('southwest')) + "%.")
print("The percentage of smokers found in the southeast region was " + str(regional_smokers('southeast')) + "%.")
print("The percentage of smokers found in the northwest region was " + str(regional_smokers('northwest')) + "%.")
print("The percentage of smokers found in the northeast region was " + str(regional_smokers('northeast')) + "%.")
print()

The percentage of smokers found in the southwest region was 17.85%.
The percentage of smokers found in the southeast region was 25.0%.
The percentage of smokers found in the northwest region was 17.85%.
The percentage of smokers found in the northeast region was 20.68%.



<b>Conclusions:</b> As you can see above, the Southeast region included the highest percentage of smokers, with approximately 1 in 4 people surveyed reporting that they smoke. It's possible this could correlate with the higher average BMI found in the region, but further analysis would be required to determine the statistical signifcance of various factors. Interestingly, the region with the lowest average BMI (Northeast) actually included a slightly higher percentage of smokers than the next two regions (Southwest and Northwest).

### Notes for Further Exploration

We've looked at a few factors in the data during this project. Some other analysis that could be explored includes:
- Checking how number of children correlates to BMI
- Checking how average number of children varies across region
- Checking the number of men and women surveyed in each region, as women tend to have a higher average BMI than men, which could skew the data
- Analyzing how age correlates to BMI across all regions, and if there are regional differences