# Coding Applications in Medicine: Data Science - Part 1

Module adapted from Kaggle: https://www.kaggle.com/code/mariapushkareva/medical-insurance-cost-with-linear-regression/notebook

Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets

## Introduction

Data Science is a multidisciplinary field that integrates computation, math/statistics, and domain knowledge to understand the world and solve problems. 

The data science lifecycle consists of:
1. Question/problem formulation
2. Data acquisition and cleaning
3. Exploratory data analysis and visualization
4. Prediction and inference

In this notebook, we will explore a sample data on medical insurance and practice handling data using Pandas.

In [None]:
import pandas as pd

## Pandas Basics 

After obtaining the data, the first step is to read the data into the Python notebook. We will use the Pandas data structure to store our data and use it for further analysis.

In [None]:
# Read the csv file into a Pandas data frame
insuranceDF= pd.read_csv('data/insurance.csv')

# Preview the dataframe (useful to double-checking your work)
insuranceDF

In [None]:
# Note: The index can be non-numerical and non-unique

# Set the "sex" column to be the index
insuranceDF.set_index("sex", inplace=True)

In [None]:
# For our usecase, we would like to keep the default index setting
insuranceDF.reset_index(inplace=True)
insuranceDF

In [None]:
# Use loc to select rows/columns by label name

# View the information for person number 5 to 10 (inclusive)
insuranceDF.loc[5:10]

In [None]:
# View only the demographic information for person number 5 to 10 (inclusive)
insuranceDF.loc[5:10, "sex":"region"]

In [None]:
# View only the sex, bmi, and smoker information for everyone
insuranceDF.loc[:, ["sex", "bmi", "smoker"]]

In [None]:
# Pandas can also use iloc to select rows/columns by number
# Note: Counting start with 0 and the end is not inclusive

# View only the sex, bmi, and smoker information for the 5th to 10th person (inclusive)
insuranceDF.iloc[4:10, [0, 2, 4]]

In [None]:
# Determine if the person is female for everyone

# Note: The [] operator is the same as the loc operator
insuranceDF["sex"] == "female" 

# This is the same as 
# insuranceDF.loc[:, "sex"] == "female"

In [None]:
# Filter out our existing dataframe to include only the first 10 smokers on the list
insuranceDF.loc[insuranceDF["smoker"] == "yes"].head(10)

# This is the same as
# insuranceDF.loc[insuranceDF["smoker"] == "yes"].iloc[:10]

In [None]:
# Statistics that is reported by default for dataframes
insuranceDF.describe()

In [None]:
# Statistics that is reported by default for series (numerical)
insuranceDF["bmi"].describe()

In [None]:
# Statistics that is reported by default for series (non-numerical)
insuranceDF["smoker"].describe()

In [None]:
# Obtain the demographics of the 10 people with the lowest insurance charges
insuranceDF.sort_values("charges", ascending=True).head(10)

In [None]:
# Add a column named "parent" that determines whether a given person is a parent given the number of children they have.

def isParent(children):
    if children > 0:
        return "yes"
    else:
        return "no"

insuranceDF["parent"] = insuranceDF["children"].apply(isParent)
insuranceDF

In [None]:
# Drop the newly created "parent" column
insuranceDF = insuranceDF.drop("parent", axis="columns")
insuranceDF

In [None]:
# Group all the people by region, aggregate by median value
insuranceDF.groupby("region").agg('median')

In [None]:
# Keep only regions where the mean BMI is greater than 30

def isHighMeanBMI(df):
    return df['bmi'].mean() > 30

insuranceDF.groupby("region").filter(isHighMeanBMI)

We have shown some of the most commonly used pandas operations/functions, but we have barely scratched the surface! To learn more about all the other existing pandas functions, go to https://pandas.pydata.org/ for the user guide and API reference.

## Text Wrangling and Regex Basics

As you may have already noticed, there are many different forms of data, including numerical, boolean (true or false), text, etc. In general, textual data may need to be manipulated and/or clean-up in order to be useful in later analysis.

The following section will show ways to manipulate strings (textual data) on Pandas series. This can be applied 

In [None]:
# Change the text for the "region" column to be upper case and append the word "region" at the end
insuranceDF["region"] = insuranceDF["region"].str.upper() + " region"

In [None]:
# Replace the space character with the hyphen character for the "region" column
insuranceDF["region"] = insuranceDF["region"].str.replace(' ', '-')
insuranceDF

In [None]:
# Split the "region" column by the hypen character and take only the first element of the result
insuranceDF["region"] = insuranceDF["region"].str.split('-').str[0]
insuranceDF

In [None]:
# Take the first character from the "sex" column to convert it to upper case
insuranceDF["sex"] = insuranceDF["sex"].str[0:1].str.upper()
insuranceDF

In [None]:
# Check if the region is in the north
insuranceDF["In North Region"] = insuranceDF["region"].str.contains("NORTH")
insuranceDF

In [None]:
# Check the length of the string of the "smoker" column
insuranceDF["smoker.len"] = insuranceDF["smoker"].str.len()
insuranceDF

Regex (regular expression) describes a sequence of characters that specifies a search pattern. Regex is a powerful way to search of specific patterns within text when done correctly, but can be quite complex/confusing. 


For more information: https://docs.python.org/3/howto/regex.html

Website to check/test your regex expression: https://regex101.com

In [None]:
# Pattern: 
# - First character is 'S'
# - Followed by any characters exactly two times
# - Followed by any character that is not 'a' to 'z' at least once
# - Followed by any word character zero times or more
# - Followed by 'T'
pattern = r"S.{2}[^a-z]+\w*T"

# Find all matches to the above pattern within the 'region' column
insuranceDF["region"].str.findall(pattern)