# Coding Applications in Medicine: Data Science

Module adapted from Kaggle: https://www.kaggle.com/code/mariapushkareva/medical-insurance-cost-with-linear-regression/notebook

Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets

## Introduction

Data Science is a multidisciplinary field that integrates computation, math/statistics, and domain knowledge to understand the world and solve problems. 

The data science lifecycle consists of:
1. Question/problem formulation
2. Data acquisition and cleaning
3. Exploratory data analysis and visualization
4. Prediction and inference

In this notebook, we will explore a sample data on medical insurance. 

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Pandas Basics 

After obtaining the data, the first step is to read the data into the Python notebook. We will use the Pandas data structure to store our data and use it for further analysis.

In [25]:
# Read the csv file into a Pandas data frame
insuranceDF= pd.read_csv('data/insurance.csv')

# Preview the dataframe (useful to double-checking your work)
insuranceDF

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [26]:
# Note: The index can be non-numerical and non-unique

# Set the "sex" column to be the index
insuranceDF.set_index("sex", inplace=True)

Unnamed: 0_level_0,age,bmi,children,smoker,region,charges
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,19,27.900,0,yes,southwest,16884.92400
male,18,33.770,1,no,southeast,1725.55230
male,28,33.000,3,no,southeast,4449.46200
male,33,22.705,0,no,northwest,21984.47061
male,32,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...
male,50,30.970,3,no,northwest,10600.54830
female,18,31.920,0,no,northeast,2205.98080
female,18,36.850,0,no,southeast,1629.83350
female,21,25.800,0,no,southwest,2007.94500


In [35]:
# For our usecase, we would like to keep the default index setting
insuranceDF.reset_index(inplace=True)
insuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.900,0,yes,southwest,16884.92400
1,male,18,33.770,1,no,southeast,1725.55230
2,male,28,33.000,3,no,southeast,4449.46200
3,male,33,22.705,0,no,northwest,21984.47061
4,male,32,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,male,50,30.970,3,no,northwest,10600.54830
1334,female,18,31.920,0,no,northeast,2205.98080
1335,female,18,36.850,0,no,southeast,1629.83350
1336,female,21,25.800,0,no,southwest,2007.94500


In [44]:
# Use loc to select rows/columns by label name

# View the information for person number 5 to 10 (inclusive)
insuranceDF.loc[5:10]

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
5,female,31,25.74,0,no,southeast,3756.6216
6,female,46,33.44,1,no,southeast,8240.5896
7,female,37,27.74,3,no,northwest,7281.5056
8,male,37,29.83,2,no,northeast,6406.4107
9,female,60,25.84,0,no,northwest,28923.13692
10,male,25,26.22,0,no,northeast,2721.3208


In [45]:
# View only the demographic information for person number 5 to 10 (inclusive)
insuranceDF.loc[5:10, "sex":"region"]

Unnamed: 0,sex,age,bmi,children,smoker,region
5,female,31,25.74,0,no,southeast
6,female,46,33.44,1,no,southeast
7,female,37,27.74,3,no,northwest
8,male,37,29.83,2,no,northeast
9,female,60,25.84,0,no,northwest
10,male,25,26.22,0,no,northeast


In [39]:
# View only the sex, bmi, and smoker information for everyone
insuranceDF.loc[:, ["sex", "bmi", "smoker"]]

Unnamed: 0,sex,bmi,smoker
0,female,27.900,yes
1,male,33.770,no
2,male,33.000,no
3,male,22.705,no
4,male,28.880,no
...,...,...,...
1333,male,30.970,no
1334,female,31.920,no
1335,female,36.850,no
1336,female,25.800,no


In [47]:
# Pandas can also use iloc to select rows/columns by number
# Note: Counting start with 0 and the end is not inclusive

# View only the sex, bmi, and smoker information for the 5th to 10th person (inclusive)
insuranceDF.iloc[4:10, [0, 2, 4]]

Unnamed: 0,sex,bmi,smoker
4,male,28.88,no
5,female,25.74,no
6,female,33.44,no
7,female,27.74,no
8,male,29.83,no
9,female,25.84,no


In [54]:
# Determine if the person is female for everyone

# Note: The [] operator is the same as the loc operator
insuranceDF["sex"] == "female" 

# This is the same as 
# insuranceDF.loc[:, "sex"] == "female"

0        True
1       False
2       False
3       False
4       False
        ...  
1333    False
1334     True
1335     True
1336     True
1337     True
Name: sex, Length: 1338, dtype: bool

In [57]:
# Filter out our existing dataframe to include only the first 10 smokers on the list
insuranceDF.loc[insuranceDF["smoker"] == "yes"].head(10)

# This is the same as
# insuranceDF.loc[insuranceDF["smoker"] == "yes"].iloc[:10]

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.9,0,yes,southwest,16884.924
11,female,62,26.29,0,yes,southeast,27808.7251
14,male,27,42.13,0,yes,southeast,39611.7577
19,male,30,35.3,0,yes,southwest,36837.467
23,female,34,31.92,1,yes,northeast,37701.8768
29,male,31,36.3,2,yes,southwest,38711.0
30,male,22,35.6,0,yes,southwest,35585.576
34,male,28,36.4,1,yes,southwest,51194.55914
38,male,35,36.67,1,yes,northeast,39774.2763
39,male,60,39.9,0,yes,southwest,48173.361


In [59]:
# Statistics that is reported by default for dataframes
insuranceDF.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [65]:
# Statistics that is reported by default for series (numerical)
insuranceDF["bmi"].describe()

count    1338.000000
mean       30.663397
std         6.098187
min        15.960000
25%        26.296250
50%        30.400000
75%        34.693750
max        53.130000
Name: bmi, dtype: float64

In [66]:
# Statistics that is reported by default for series (non-numerical)
insuranceDF["smoker"].describe()

count     1338
unique       2
top         no
freq      1064
Name: smoker, dtype: object

In [68]:
# Obtain the demographics of the 10 people with the lowest insurance charges
insuranceDF.sort_values("charges", ascending=True).head(10)

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
940,male,18,23.21,0,no,southeast,1121.8739
808,male,18,30.14,0,no,southeast,1131.5066
1244,male,18,33.33,0,no,southeast,1135.9407
663,male,18,33.66,0,no,southeast,1136.3994
22,male,18,34.1,0,no,southeast,1137.011
194,male,18,34.43,0,no,southeast,1137.4697
866,male,18,37.29,0,no,southeast,1141.4451
781,male,18,41.14,0,no,southeast,1146.7966
442,male,18,43.01,0,no,southeast,1149.3959
1317,male,18,53.13,0,no,southeast,1163.4627


In [77]:
# Add a column named "parent" that determines whether a given person is a parent given the number of children they have.

def isParent(children):
    if children > 0:
        return "yes"
    else:
        return "no"

insuranceDF["parent"] = insuranceDF["children"].apply(isParent)
insuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges,parent
0,female,19,27.900,0,yes,southwest,16884.92400,no
1,male,18,33.770,1,no,southeast,1725.55230,yes
2,male,28,33.000,3,no,southeast,4449.46200,yes
3,male,33,22.705,0,no,northwest,21984.47061,no
4,male,32,28.880,0,no,northwest,3866.85520,no
...,...,...,...,...,...,...,...,...
1333,male,50,30.970,3,no,northwest,10600.54830,yes
1334,female,18,31.920,0,no,northeast,2205.98080,no
1335,female,18,36.850,0,no,southeast,1629.83350,no
1336,female,21,25.800,0,no,southwest,2007.94500,no


In [80]:
# Drop the newly created "parent" column
insuranceDF = insuranceDF.drop("parent", axis="columns")
insuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.900,0,yes,southwest,16884.92400
1,male,18,33.770,1,no,southeast,1725.55230
2,male,28,33.000,3,no,southeast,4449.46200
3,male,33,22.705,0,no,northwest,21984.47061
4,male,32,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,male,50,30.970,3,no,northwest,10600.54830
1334,female,18,31.920,0,no,northeast,2205.98080
1335,female,18,36.850,0,no,southeast,1629.83350
1336,female,21,25.800,0,no,southwest,2007.94500


In [92]:
# Group all the people by region, aggregate by median value
insuranceDF.groupby("region").agg('median')

Unnamed: 0_level_0,age,bmi,children,charges
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
northeast,39.5,28.88,1.0,10057.652025
northwest,39.0,28.88,1.0,8965.79575
southeast,39.0,33.33,1.0,9294.13195
southwest,39.0,30.3,1.0,8798.593


In [97]:
# Keep only regions where the mean BMI is greater than 30

def isHighMeanBMI(df):
    return df['bmi'].mean() > 30

insuranceDF.groupby("region").filter(isHighMeanBMI)

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.90,0,yes,southwest,16884.92400
1,male,18,33.77,1,no,southeast,1725.55230
2,male,28,33.00,3,no,southeast,4449.46200
5,female,31,25.74,0,no,southeast,3756.62160
6,female,46,33.44,1,no,southeast,8240.58960
...,...,...,...,...,...,...,...
1330,female,57,25.74,2,no,southeast,12629.16560
1331,female,23,33.40,0,no,southwest,10795.93733
1332,female,52,44.70,3,no,southwest,11411.68500
1335,female,18,36.85,0,no,southeast,1629.83350


We have shown some of the most commonly used pandas operations/functions, but we have barely scratched the surface! To learn more about all the other existing pandas functions, go to https://pandas.pydata.org/ for the user guide and API reference.

## Text Wrangling and Regex Basics

As you may have already noticed, there are many different forms of data, including numerical, boolean (true or false), text, etc. In general, textual data may need to be manipulated and/or clean-up in order to be useful in later analysis.

The following section will show ways to manipulate strings (textual data) and can be applied to dataframes via user-defined functions (example of applying user-defined function is shown in the previous section).