# Coding Applications in Medicine: Data Science - Handling Data with Pandas

In this notebook, we will explore a sample dataset on medical insurance to practice handling data using Pandas.

## Introduction

In [2]:
import pandas as pd
import numpy as np

## Pandas Basics 

After obtaining the data, the first step is to read the data into the Python notebook. We will use the Pandas data structure to store our data and use it for further analysis.

In [4]:
# Read the csv file into a Pandas data frame.
insuranceDF= pd.read_csv("insurance_modified.csv")

# Preview the dataframe by calling the variable (useful to double-checking your work).
# Note: Missing data is normally shown as NaN (similar to null).
insuranceDF

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,,yes,southwest,16884.92400
1,18,male,33.770,1.0,no,southeast,1725.55230
2,28,male,33.000,3.0,no,southeast,4449.46200
3,33,male,22.705,,no,northwest,21984.47061
4,32,male,28.880,,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3.0,no,northwest,10600.54830
1334,18,female,31.920,,no,northeast,2205.98080
1335,18,female,36.850,,no,southeast,1629.83350
1336,21,female,25.800,,no,southwest,2007.94500


In [11]:
# When you hear the word 'index', think 'row labels'. That is what it is!
# Note: The index can be non-numerical and non-unique.  

# Set the "sex" column to be the index of the dataframe.
insuranceDF.set_index("sex", inplace=True)
insuranceDF

Unnamed: 0_level_0,age,bmi,children,smoker,region,charges
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,19,27.900,,yes,southwest,16884.92400
male,18,33.770,1.0,no,southeast,1725.55230
male,28,33.000,3.0,no,southeast,4449.46200
male,33,22.705,,no,northwest,21984.47061
male,32,28.880,,no,northwest,3866.85520
...,...,...,...,...,...,...
male,50,30.970,3.0,no,northwest,10600.54830
female,18,31.920,,no,northeast,2205.98080
female,18,36.850,,no,southeast,1629.83350
female,21,25.800,,no,southwest,2007.94500


In [12]:
# For our usecase, we would like to keep the default index setting.

insuranceDF.reset_index(inplace=True)
insuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.900,,yes,southwest,16884.92400
1,male,18,33.770,1.0,no,southeast,1725.55230
2,male,28,33.000,3.0,no,southeast,4449.46200
3,male,33,22.705,,no,northwest,21984.47061
4,male,32,28.880,,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,male,50,30.970,3.0,no,northwest,10600.54830
1334,female,18,31.920,,no,northeast,2205.98080
1335,female,18,36.850,,no,southeast,1629.83350
1336,female,21,25.800,,no,southwest,2007.94500


In [5]:
# Use loc to select rows/columns by label name.

# View the information for person number 5 to 10 (inclusive) in the df.
insuranceDF.loc[5:10]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
5,31,female,25.74,,no,southeast,3756.6216
6,46,female,33.44,1.0,no,southeast,8240.5896
7,37,female,27.74,3.0,no,northwest,7281.5056
8,37,male,29.83,2.0,no,northeast,6406.4107
9,60,female,25.84,,no,northwest,28923.13692
10,25,male,26.22,,no,northeast,2721.3208


In [6]:
# View only the demographic information for person number 5 to 10 (inclusive).
insuranceDF.loc[5:10, "sex":"region"] # the syntac here is df[row_start:row:end, col_start:col_end]- this is a very useful way to view and work with your data!

Unnamed: 0,sex,bmi,children,smoker,region
5,female,25.74,,no,southeast
6,female,33.44,1.0,no,southeast
7,female,27.74,3.0,no,northwest
8,male,29.83,2.0,no,northeast
9,female,25.84,,no,northwest
10,male,26.22,,no,northeast


In [7]:
# View only the sex, bmi, and smoker information for everyone.
insuranceDF.loc[:, ["sex", "bmi", "smoker"]] # Same syntax as above, but using only the colon for the row
                                             # section means we display all of the rows. We only show the columns listed

Unnamed: 0,sex,bmi,smoker
0,female,27.900,yes
1,male,33.770,no
2,male,33.000,no
3,male,22.705,no
4,male,28.880,no
...,...,...,...
1333,male,30.970,no
1334,female,31.920,no
1335,female,36.850,no
1336,female,25.800,no


In [8]:
# Pandas can also use iloc to select rows/columns by number.
# Note: Counting start with 0 and the end is not inclusive.

# View only the sex, bmi, and smoker information for the 5th to 10th person (inclusive).
insuranceDF.iloc[4:10, [0, 2, 4]]

Unnamed: 0,age,bmi,smoker
4,32,28.88,no
5,31,25.74,no
6,46,33.44,no
7,37,27.74,no
8,37,29.83,no
9,60,25.84,no


#### Note- loc and iloc are two different ways of parsing a df. It is literally just two methods to accomplish the same goal

In [13]:
# Check everyone to see if they are a smoker.
# This will go through the 'smoker' column and replace 'yes' with True and anything else with 'False'
# This is a boolean mask and is a very helpful concepy

# Note: The [] operator is the same as the loc operator.
insuranceDF["smoker"] == "yes" 

# This is the same as 
### insuranceDF.loc[:, "smoker"] == "yes"

0        True
1       False
2       False
3       False
4       False
        ...  
1333    False
1334    False
1335    False
1336    False
1337     True
Name: smoker, Length: 1338, dtype: bool

In [14]:
# Filter out our existing dataframe to include only the first 10 smokers on the list.
insuranceDF.loc[insuranceDF["smoker"] == "yes"].head(10)

# This is the same as
### insuranceDF.loc[insuranceDF["smoker"] == "yes"].iloc[:10]

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.9,,yes,southwest,16884.924
11,female,62,26.29,,yes,southeast,27808.7251
14,male,27,42.13,,yes,southeast,39611.7577
19,male,30,35.3,,yes,southwest,36837.467
23,female,34,31.92,1.0,yes,northeast,37701.8768
29,male,31,36.3,2.0,yes,southwest,38711.0
30,male,22,35.6,,yes,southwest,35585.576
34,male,28,36.4,1.0,yes,southwest,51194.55914
38,male,35,36.67,1.0,yes,northeast,39774.2763
39,male,60,39.9,,yes,southwest,48173.361


In [15]:
# NaN values will either be treated as 0 or will be skipped in certain calculations.

# Statistics that is reported by default for dataframes.
insuranceDF.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,764.0,1338.0
mean,39.207025,30.663397,1.917539,13270.422265
std,14.04996,6.098187,0.983351,12110.011237
min,18.0,15.96,1.0,1121.8739
25%,27.0,26.29625,1.0,4740.28715
50%,39.0,30.4,2.0,9382.033
75%,51.0,34.69375,3.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [16]:
# Statistics that is reported by default for series (numerical).
insuranceDF["bmi"].describe()

count    1338.000000
mean       30.663397
std         6.098187
min        15.960000
25%        26.296250
50%        30.400000
75%        34.693750
max        53.130000
Name: bmi, dtype: float64

In [17]:
# Statistics that is reported by default for series (non-numerical).
insuranceDF["smoker"].describe()

count     1338
unique       2
top         no
freq      1064
Name: smoker, dtype: object

In [None]:
# Obtain the demographics of the 10 people with the lowest insurance charges.
insuranceDF.sort_values("charges", ascending=True).head(10)

In [None]:
# Group all the people by region, aggregate by median value.
insuranceDF.groupby("region").agg('median')

In [None]:
# Keep only regions where the mean BMI is greater than 30.

def isHighMeanBMI(df):
    return df["bmi"].mean() > 30

insuranceDF.groupby("region").filter(isHighMeanBMI)

In [None]:
# Check how many NaN values we have for each column.
insuranceDF.isna().sum()

In [None]:
# Check the unique values in the children column.
insuranceDF["children"].unique()

In [None]:
# One way to handle missing data is to remove rows where there are NaN values.

# Drops rows where NaN values exists.
insuranceDF.dropna(axis=0)

In [None]:
# Another way to handle missing data is to remove columns where there are NaN values.

# Drops columns where NaN values exists
insuranceDF.dropna(axis=1)

In [None]:
# Another way to handle missing data is to replace the NaN values with a reasonable value.

# Replace the NaN values in "children" column with 0.
insuranceDF["children"].fillna(0)

# This is the same as 
### insuranceDF["children"].replace(np.nan, 0)

In [None]:
# Add a column named "parent" that determines whether a given person is a parent
#  given the number of children they have.

def isParent(children):
    if children > 0:
        return "yes"
    else:
        return "no"

insuranceDF["parent"] = insuranceDF["children"].apply(isParent)
insuranceDF

In [None]:
# Drop the newly created "parent" column.
insuranceDF = insuranceDF.drop("parent", axis="columns")
insuranceDF

## Text Wrangling and Regex Basics

As you may have already noticed, there are many different forms of data, including numerical, boolean (true or false), text, etc. In general, textual data may need to be manipulated and/or clean-up in order to be useful in later analysis.

The following section will show ways to manipulate strings (textual data) on Pandas series.

In [None]:
# Change the text for the "region" column to be upper case and append the word "region" at the end.
insuranceDF["region"] = insuranceDF["region"].str.upper() + " region"
insuranceDF

In [None]:
# Replace the space character with the hyphen character for the "region" column.
insuranceDF["region"] = insuranceDF["region"].str.replace(' ', '-')
insuranceDF

In [None]:
# Split the "region" column by the hypen character and take only the first element of the result.
insuranceDF["region"] = insuranceDF["region"].str.split('-').str[0]
insuranceDF

In [None]:
# Take the first character from the "sex" column to convert it to upper case.
insuranceDF["sex"] = insuranceDF["sex"].str[0:1].str.upper()
insuranceDF

In [None]:
# Check if the region is in the north.
insuranceDF["In North Region"] = insuranceDF["region"].str.contains("NORTH")
insuranceDF

In [None]:
# Check the length of the string of the "smoker" column.
insuranceDF["smoker.len"] = insuranceDF["smoker"].str.len()
insuranceDF

Regex (regular expression) describes a sequence of characters that specifies a search pattern. Regex is a powerful way to search of specific patterns within text when done correctly, but can be quite complex/confusing. 


For more information: https://docs.python.org/3/howto/regex.html

Website to check/test your regex expression: https://regex101.com

In [None]:
# Pattern: 
# - First character is 'S'.
# - Followed by any characters exactly two times.
# - Followed by any character that is not 'a' to 'z' at least once.
# - Followed by any word character zero times or more.
# - Followed by 'T'.
pattern = r"S.{2}[^a-z]+\w*T"

# Find all matches to the above pattern within the 'region' column.
insuranceDF["region"].str.findall(pattern)

We have shown some of the most commonly used pandas operations/functions, but we have barely scratched the surface! To learn more about all the other existing pandas functions and more information, check the following:
- User Guide (Pandas): https://pandas.pydata.org/docs/user_guide/index.html#
- API Reference (Pandas): https://pandas.pydata.org/docs/reference/index.html

Other helpful guides and references to get you started:
- User Guide (Python): https://docs.python.org/3/tutorial/
- Library Reference (Python): https://docs.python.org/3/library/index.html
- Language Reference (Python): https://docs.python.org/3/reference/index.html
- User Guide (Numpy): https://numpy.org/doc/stable/user/index.html#
- API Reference (Numpy): https://numpy.org/doc/stable/reference/index.html

The following websites can help you visualize your code, one line at a time (with certain limitations).
- Python Tutor (basic Python libraries only): https://pythontutor.com/visualize.html
- Pandas Tutor (you will need to write out the sample data you are working with): https://pandastutor.com/vis.html 

**Source:**


Module adapted from Kaggle: https://www.kaggle.com/code/mariapushkareva/medical-insurance-cost-with-linear-regression/notebook

Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets