# Coding Applications in Medicine: Data Science - Pandas Practice Question

Review the following notebooks before attempting this practice question.

- Data Science - Handling Data with Pandas

In [None]:
import pandas as pd
import numpy as np

For this practice question, we will be using the same medical insurance dataset from the previous notebook.

We would like to find the average insurance cost for overweight individuals living in the northeast region. Below is a guide to follow through to solve this problem. Note that there can be multiple approaches to solve this problem.

In [None]:
# Source data for reset convenience.
insuranceDF = pd.read_csv("data/insurance.csv")

# Provided BMI category table.
bmiCategoriesDF = pd.read_csv("data/bmiCategories.csv")

# Step 0: Check the data you are working with and think about what manipulations are needed.

# Step 1: Modify the BMI category table to include the min and max BMI for each category.
# Step 1a: Obtain the min and max BMI from the BMI column.

### bmiCategoriesDF["min bmi"] = ________
### bmiCategoriesDF["max bmi"] = ________

# Step 1b: Reassign the non-numerical BMI values to reasonable numbers. We will use the at function.
# Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html

### bmiCategoriesDF.at[________, ________] = ________
### bmiCategoriesDF.at[________, ________] = ________

# Step 1c: Convert the min and max bmi column to be float data type.

### bmiCategoriesDF[________] = bmiCategoriesDF[________].astype(float)
### bmiCategoriesDF[________] = bmiCategoriesDF[________].astype(float)

# Step 2: Combine the two data frames. In this example, we will use the merge function.
# Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

### mergedDF = ________.merge(________, how="cross")

# Step 3: Filter out the data frame to retain only rows where the BMI matches the BMI category.

### mergedDF = mergedDF[(________) & (________)]
### mergedDF.reset_index(drop=True, inplace=True)

# Step 4: Group mergedDF by region and the BMI weight status and aggregate the result by the mean.

### mergedMeanDF = mergedDF.________.________.reset_index()

# Step 5: Filter the mergedMeanDF by the region and BMI weight status.

### resultDF = ________

# Step 6: Select only the relevant data.

### result = (________, ________, ________)

### print("The average insurance charge for an {0} person living in the {1} region is ${2}."
###       .format(result[0], result[1], result[2]))

Below is one way to solve this problem.

In [None]:
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#


In [None]:
# Source data for reset convenience.
insuranceDF = pd.read_csv("data/insurance.csv")
bmiCategoriesDF = pd.read_csv("data/bmiCategories.csv")

# Modify the BMI category table to include the min and max BMI for each category.
# Example using regex.
### bmiCategoriesDF["min bmi"] = bmiCategoriesDF["BMI"].str.split(r"\s(?:-|and)*\s*").str[0]
### bmiCategoriesDF["max bmi"] = bmiCategoriesDF["BMI"].str.split(r"\s(?:-|and)*\s*").str[1]
 
# Example without using regex.
bmiCategoriesDF["min bmi"] = bmiCategoriesDF["BMI"].str.split().str[0]
bmiCategoriesDF["max bmi"] = bmiCategoriesDF["BMI"].str.split().str[-1]

# Reassign the non-numerical BMI values to reasonable numbers.
bmiCategoriesDF.at[0, "min bmi"] = 0
bmiCategoriesDF.at[3, "max bmi"] = 100

# Convert the min and max bmi column to be float data type.
bmiCategoriesDF["min bmi"] = bmiCategoriesDF["min bmi"].astype(float)
bmiCategoriesDF["max bmi"] = bmiCategoriesDF["max bmi"].astype(float)

# Combine the two data frames.
#insuranceDF = insuranceDF.merge(bmiCategoriesDF, how="cross")
mergedDF = insuranceDF.merge(bmiCategoriesDF, how="cross")

# Filter out the data frame to retain only rows where the BMI matches the BMI category.
mergedDF = mergedDF[(mergedDF["bmi"] >= mergedDF["min bmi"]) 
                          & (mergedDF["bmi"] < mergedDF["max bmi"])]
mergedDF.reset_index(drop=True, inplace=True)

# Group insuranceDF by region and the BMI weight status and aggregate the result by the mean.
mergedMeanDF = mergedDF.groupby(["region", "Weight Status"]).agg("mean").reset_index()

# Filter the insuranceDF by the region and BMI weight status.
resultDF = mergedMeanDF[(mergedMeanDF["region"] == "northeast") 
                          & (mergedMeanDF["Weight Status"] == "Overweight")].reset_index()

# Select only the relevant data.
resultFromDF = (resultDF["Weight Status"][0].lower(), 
                resultDF["region"][0], 
                str(round(resultDF["charges"][0], 2)))

print("The average insurance charge for an {0} person living in the {1} region is ${2}."
      .format(resultFromDF[0], resultFromDF[1], resultFromDF[2]))

**Source:**


Module adapted from Kaggle: https://www.kaggle.com/code/mariapushkareva/medical-insurance-cost-with-linear-regression/notebook

Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets