# Coding Applications in Medicine: Data Science - Part 1

Module adapted from Kaggle: https://www.kaggle.com/code/mariapushkareva/medical-insurance-cost-with-linear-regression/notebook

Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets

## Introduction

Data Science is a multidisciplinary field that integrates computation, math/statistics, and domain knowledge to understand the world and solve problems. 

The data science lifecycle consists of:
1. Question/problem formulation
2. Data acquisition and cleaning
3. Exploratory data analysis and visualization
4. Prediction and inference

In this notebook, we will explore a sample data on medical insurance and practice handling data using Pandas.

In [None]:
import pandas as pd
import numpy as np
import sqlite3, csv

## Pandas Basics 

After obtaining the data, the first step is to read the data into the Python notebook. We will use the Pandas data structure to store our data and use it for further analysis.

In [None]:
# Read the csv file into a Pandas data frame.
insuranceDF= pd.read_csv("data/insurance_modified.csv")

# Preview the dataframe by calling the variable (useful to double-checking your work).
# Note: Missing data is normally shown as NaN (similar to null).
insuranceDF

In [None]:
# Note: The index can be non-numerical and non-unique.

# Set the "sex" column to be the index of the dataframe.
insuranceDF.set_index("sex", inplace=True)
insuranceDF

In [None]:
# For our usecase, we would like to keep the default index setting.

insuranceDF.reset_index(inplace=True)
insuranceDF

In [None]:
# Use loc to select rows/columns by label name.

# View the information for person number 5 to 10 (inclusive).
insuranceDF.loc[5:10]

In [None]:
# View only the demographic information for person number 5 to 10 (inclusive).
insuranceDF.loc[5:10, "sex":"region"]

In [None]:
# View only the sex, bmi, and smoker information for everyone.
insuranceDF.loc[:, ["sex", "bmi", "smoker"]]

In [None]:
# Pandas can also use iloc to select rows/columns by number.
# Note: Counting start with 0 and the end is not inclusive.

# View only the sex, bmi, and smoker information for the 5th to 10th person (inclusive).
insuranceDF.iloc[4:10, [0, 2, 4]]

In [None]:
# Check everyone to see if they are a smoker.

# Note: The [] operator is the same as the loc operator.
insuranceDF["smoker"] == "yes" 

# This is the same as 
### insuranceDF.loc[:, "smoker"] == "yes"

In [None]:
# Filter out our existing dataframe to include only the first 10 smokers on the list.
insuranceDF.loc[insuranceDF["smoker"] == "yes"].head(10)

# This is the same as
### insuranceDF.loc[insuranceDF["smoker"] == "yes"].iloc[:10]

In [None]:
# NaN values will either be treated as 0 or will be skipped in certain calculations.

# Statistics that is reported by default for dataframes.
insuranceDF.describe()

In [None]:
# Statistics that is reported by default for series (numerical).
insuranceDF["bmi"].describe()

In [None]:
# Statistics that is reported by default for series (non-numerical).
insuranceDF["smoker"].describe()

In [None]:
# Obtain the demographics of the 10 people with the lowest insurance charges.
insuranceDF.sort_values("charges", ascending=True).head(10)

In [None]:
# Group all the people by region, aggregate by median value.
insuranceDF.groupby("region").agg('median')

In [None]:
# Keep only regions where the mean BMI is greater than 30.

def isHighMeanBMI(df):
    return df["bmi"].mean() > 30

insuranceDF.groupby("region").filter(isHighMeanBMI)

In [None]:
# Check how many NaN values we have for each column.
insuranceDF.isna().sum()

In [None]:
# Check the unique values in the children column.
insuranceDF["children"].unique()

In [None]:
# One way to handle missing data is to remove rows where there are NaN values.

# Drops rows where NaN values exists.
insuranceDF.dropna(axis=0)

In [None]:
# Another way to handle missing data is to remove columns where there are NaN values.

# Drops columns where NaN values exists
insuranceDF.dropna(axis=1)

In [None]:
# Another way to handle missing data is to replace the NaN values with a reasonable value.

# Replace the NaN values in "children" column with 0.
insuranceDF["children"].fillna(0)

# This is the same as 
### insuranceDF["children"].replace(np.nan, 0)

In [None]:
# Add a column named "parent" that determines whether a given person is a parent
#  given the number of children they have.

def isParent(children):
    if children > 0:
        return "yes"
    else:
        return "no"

insuranceDF["parent"] = insuranceDF["children"].apply(isParent)
insuranceDF

In [None]:
# Drop the newly created "parent" column.
insuranceDF = insuranceDF.drop("parent", axis="columns")
insuranceDF

## Text Wrangling and Regex Basics

As you may have already noticed, there are many different forms of data, including numerical, boolean (true or false), text, etc. In general, textual data may need to be manipulated and/or clean-up in order to be useful in later analysis.

The following section will show ways to manipulate strings (textual data) on Pandas series. This can be applied 

In [None]:
# Change the text for the "region" column to be upper case and append the word "region" at the end.
insuranceDF["region"] = insuranceDF["region"].str.upper() + " region"
insuranceDF

In [None]:
# Replace the space character with the hyphen character for the "region" column.
insuranceDF["region"] = insuranceDF["region"].str.replace(' ', '-')
insuranceDF

In [None]:
# Split the "region" column by the hypen character and take only the first element of the result.
insuranceDF["region"] = insuranceDF["region"].str.split('-').str[0]
insuranceDF

In [None]:
# Take the first character from the "sex" column to convert it to upper case.
insuranceDF["sex"] = insuranceDF["sex"].str[0:1].str.upper()
insuranceDF

In [None]:
# Check if the region is in the north.
insuranceDF["In North Region"] = insuranceDF["region"].str.contains("NORTH")
insuranceDF

In [None]:
# Check the length of the string of the "smoker" column.
insuranceDF["smoker.len"] = insuranceDF["smoker"].str.len()
insuranceDF

Regex (regular expression) describes a sequence of characters that specifies a search pattern. Regex is a powerful way to search of specific patterns within text when done correctly, but can be quite complex/confusing. 


For more information: https://docs.python.org/3/howto/regex.html

Website to check/test your regex expression: https://regex101.com

In [None]:
# Pattern: 
# - First character is 'S'.
# - Followed by any characters exactly two times.
# - Followed by any character that is not 'a' to 'z' at least once.
# - Followed by any word character zero times or more.
# - Followed by 'T'.
pattern = r"S.{2}[^a-z]+\w*T"

# Find all matches to the above pattern within the 'region' column.
insuranceDF["region"].str.findall(pattern)

We have shown some of the most commonly used pandas operations/functions, but we have barely scratched the surface! To learn more about all the other existing pandas functions and more information, check the following:
- User Guide (Pandas): https://pandas.pydata.org/docs/user_guide/index.html#
- API Reference (Pandas): https://pandas.pydata.org/docs/reference/index.html

Other helpful guides and references to get you started:
- User Guide (Python): https://docs.python.org/3/tutorial/
- Library Reference (Python): https://docs.python.org/3/library/index.html
- Language Reference (Python): https://docs.python.org/3/reference/index.html
- User Guide (Numpy): https://numpy.org/doc/stable/user/index.html#
- API Reference (Numpy): https://numpy.org/doc/stable/reference/index.html

## Practice: Putting it all together

Find the average insurance cost for overweight individuals living in the northeast region. We have provided a guide to follow through to solve this problem. Note that there can be multiple approaches to do so.

In [None]:
# Source data for reset convenience.
insuranceDF = pd.read_csv("data/insurance.csv")

# Provided BMI category table.
bmiCategoriesDF = pd.read_csv("data/bmiCategories.csv")

# Step 0: Check the data you are working with and think about what manipulations are needed.

# Step 1: Modify the BMI category table to include the min and max BMI for each category.
# Step 1a: Obtain the min and max BMI from the BMI column.

### bmiCategoriesDF["min bmi"] = ________
### bmiCategoriesDF["max bmi"] = ________

# Step 1b: Reassign the non-numerical BMI values to reasonable numbers. We will use the at function.
# Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html

### bmiCategoriesDF.at[________, ________] = ________
### bmiCategoriesDF.at[________, ________] = ________

# Step 1c: Convert the min and max bmi column to be float data type.

### bmiCategoriesDF[________] = bmiCategoriesDF[________].astype(float)
### bmiCategoriesDF[________] = bmiCategoriesDF[________].astype(float)

# Step 2: Combine the two data frames. In this example, we will use the merge function.
# Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

### mergedDF = ________.merge(________, how="cross")

# Step 3: Filter out the data frame to retain only rows where the BMI matches the BMI category.

### mergedDF = mergedDF[(________) & (________)]
### mergedDF.reset_index(drop=True, inplace=True)

# Step 4: Group mergedDF by region and the BMI weight status and aggregate the result by the mean.

### mergedMeanDF = mergedDF.________.________.reset_index()

# Step 5: Filter the mergedMeanDF by the region and BMI weight status.

### resultDF = ________

# Step 6: Select only the relevant data.

### result = (________, ________, ________)

### print("The average insurance charge for an {0} person living in the {1} region is ${2}."
###       .format(result[0], result[1], result[2]))

Below is one way to solve this problem.

In [None]:
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#


In [None]:
# Source data for reset convenience.
insuranceDF = pd.read_csv("data/insurance.csv")
bmiCategoriesDF = pd.read_csv("data/bmiCategories.csv")

# Modify the BMI category table to include the min and max BMI for each category.
# Example using regex.
### bmiCategoriesDF["min bmi"] = bmiCategoriesDF["BMI"].str.split(r"\s(?:-|and)*\s*").str[0]
### bmiCategoriesDF["max bmi"] = bmiCategoriesDF["BMI"].str.split(r"\s(?:-|and)*\s*").str[1]
 
# Example without using regex.
bmiCategoriesDF["min bmi"] = bmiCategoriesDF["BMI"].str.split().str[0]
bmiCategoriesDF["max bmi"] = bmiCategoriesDF["BMI"].str.split().str[-1]

# Reassign the non-numerical BMI values to reasonable numbers.
bmiCategoriesDF.at[0, "min bmi"] = 0
bmiCategoriesDF.at[3, "max bmi"] = 100

# Convert the min and max bmi column to be float data type.
bmiCategoriesDF["min bmi"] = bmiCategoriesDF["min bmi"].astype(float)
bmiCategoriesDF["max bmi"] = bmiCategoriesDF["max bmi"].astype(float)

# Combine the two data frames.
#insuranceDF = insuranceDF.merge(bmiCategoriesDF, how="cross")
mergedDF = insuranceDF.merge(bmiCategoriesDF, how="cross")

# Filter out the data frame to retain only rows where the BMI matches the BMI category.
mergedDF = mergedDF[(mergedDF["bmi"] >= mergedDF["min bmi"]) 
                          & (mergedDF["bmi"] < mergedDF["max bmi"])]
mergedDF.reset_index(drop=True, inplace=True)

# Group insuranceDF by region and the BMI weight status and aggregate the result by the mean.
mergedMeanDF = mergedDF.groupby(["region", "Weight Status"]).agg("mean").reset_index()

# Filter the insuranceDF by the region and BMI weight status.
resultDF = mergedMeanDF[(mergedMeanDF["region"] == "northeast") 
                          & (mergedMeanDF["Weight Status"] == "Overweight")].reset_index()

# Select only the relevant data.
resultFromDF = (resultDF["Weight Status"][0].lower(), 
                resultDF["region"][0], 
                str(round(resultDF["charges"][0], 2)))

print("The average insurance charge for an {0} person living in the {1} region is ${2}."
      .format(resultFromDF[0], resultFromDF[1], resultFromDF[2]))

### Other Considerations 1: Data cleaning/handling outside of Pandas library

In this notebook, we are directly modifying the data after directly reading the data file into Pandas dataframe. There may be instances where you might not be using the Pandas library for the data analysis. In those situations, the data cleaning and processing steps will need to be done via scripts.

The general approach would be to:
1. Read the file and store the corresponding data in data structures
2. Modify the data with user-defined functions
3. Write the data into a new file

References:
- Python3 I/O tutorial: https://docs.python.org/3/tutorial/inputoutput.html
- Python3 CSV I/O tutorial: https://docs.python.org/3/library/csv.html

The following code blocks will mimic the same code we have written above used to modify the bmiCategories table.

In [None]:
# User-defined function to modify the bmi value. 
# Category: String with the bmi categorization data.
# Index: 0 for Min and -1 for Max.
def modifyBmi(category, index):
    # Split the bmi category text by ' ' and take the element at the index position of the resulting list.
    bmi = category.split()[index]
    # Check to see if the extracted text is a float (replace only one instance of '.' and check if it is a digit).
    if category.split()[index].replace('.', '', 1).isdigit():
        return float(bmi)
    # Replacement for non-numerical min bmi.
    if index == 0:
        return 0.0
    # Replacement for non-numerical max bmi.
    return 100.0

# Open the file with bmi categories data in read-only format.
with open('data/bmiCategories.csv', 'r') as f:
    # Create a dictionary based on data in the file.
    dr = csv.DictReader(f)
    # Create a list of tuples that contains data obtained from the dictionary.
    # Here we are also introducing the max and min bmi using our previously defined function.
    rowList = [(row["BMI"], row["Weight Status"], modifyBmi(row["BMI"], 0), 
                modifyBmi(row["BMI"], -1)) for row in dr]
    # List of field names used for the header line of the csv we will write later.
    fieldnameList = dr.fieldnames + ["Min BMI", "Max BMI"]

# Open the file with bmi categories data in write format.
# This will create an empty new file.
with open('data/modifiedBmiCategories.csv', 'w', newline='') as f:
    # Writes in csv format (comma as delimiter).
    writer = csv.writer(f, delimiter=',')
    # Writes field names as the first row.
    writer.writerow(fieldnameList)
    # Writes the list of data in row format (new line for each row).
    for row in rowList:
        writer.writerow(row)
        

### Other Considerations 2: Working with databases

In this notebook, we are obtaining the data directly from a file. There may be instances where the data is stored in a database, and you need to export part of the data from the database. One of the most popular languages to handle databases is SQL. In general, if you also need to manipulate the queried data, it is better to do so with scripts after querying and exporting rather than through SQL/within database queries.

References:
- SQLite Documentation: https://www.sqlite.org/docs.html
- Python3 SQLite Tutorial: https://docs.python.org/3/library/sqlite3.html

The following code blocks will mimic the same code we have written above after modifying the bmiCategories table.

In [None]:
# Establishes a connection to the local database file.
# Normally, you would establish a connection to a remote database.
con = sqlite3.connect("data/insurance.db")
cur = con.cursor()

# Checks to see if "insurance" table already exists in the database. 
# If it already exists, then we will clear the data from the table (mainly used for reset convenience).
if cur.execute("SELECT COUNT(*) FROM sqlite_master " +
               "WHERE type = 'table' AND name = 'insurance'").fetchone()[0] > 0:
    cur.execute("DELETE FROM insurance")
    con.commit()

    # This will remove the table instead.
    ###cur.execute("DROP TABLE insurance")
    ###con.commit()

# Creates the "insurance" table (if it does not exist) based on the provided schema.
cur.execute("CREATE TABLE IF NOT EXISTS " + 
                "insurance (age NUMBER(3), sex VARCHAR2(10), bmi NUMBER(6, 3), children NUMBER(2), " + 
                           "smoker VARCHAR2(3), region VARCHAR2(9), charges NUMBER(11, 5));")

# Open the file with the insurance data in a read-only format.
with open("data/insurance.csv", 'r') as f:
    # Create a dictionary based on data in the file.
    dr = csv.DictReader(f)
    # Create a list of tuples that contains data obtained from the dictionary.
    rowList = [(row["age"], row["sex"], row["bmi"], row["children"],
                row["smoker"], row["region"], row["charges"]) for row in dr]

# Bulk insert operation into the "insurance" table.
cur.executemany("INSERT INTO insurance (age, sex, bmi, children, smoker, region, charges) " +
                    "VALUES (?, ?, ?, ?, ?, ?, ?);", rowList)
con.commit()

# Checks to see if "bmicategories" table already exists in the database. 
# If it already exists, then we will clear the data from the table (mainly used for reset convenience).
if cur.execute("SELECT COUNT(*) FROM sqlite_master " + 
               "WHERE type = 'table' AND name = 'bmicategories'").fetchone()[0] > 0:
    cur.execute("DELETE FROM bmicategories")
    con.commit()

    # This will remove the table instead.
    #cur.execute("DROP TABLE bmicategories")
    #con.commit()

# Creates the "insurance" table (if it does not exist) based on the provided schema.
cur.execute("CREATE TABLE IF NOT EXISTS " +
                "bmicategories (BMI VARCHAR2(15), 'Weight Status' VARCHAR2(15), "
                               "'Min BMI' NUMBER(6, 3), 'Max BMI' NUMBER(4, 1));")

# Open the file with the insurance data in a read-only format.
with open("data/modifiedBmiCategories.csv", 'r') as f:
    # Create a dictionary based on data in the file.
    dr = csv.DictReader(f)
    # Create a list of tuples that contains data obtained from the dictionary.
    rowList = [(row["BMI"], row["Weight Status"], row["Min BMI"], row["Max BMI"]) for row in dr]

# Bulk insert operation into the "bmicategories" table.
cur.executemany("INSERT INTO bmicategories (BMI, 'Weight Status', 'Min BMI', 'Max BMI') " +
                    "VALUES (?, ?, ?, ?);", rowList)
con.commit()

In [None]:
# Query statement to obtain data. Generally SQL query statement are readable.
query = ("SELECT bmicategories.'Weight Status', insurance.region, AVG(insurance.charges) " + 
         "FROM insurance FULL OUTER JOIN bmicategories " +
         "WHERE insurance.bmi >= bmicategories.'Min BMI' " +
         "AND insurance.bmi < bmicategories.'Max BMI' " +
         "GROUP BY insurance.region, bmicategories.'Weight Status' " +
         "HAVING insurance.region = 'northeast' " +
         "AND bmicategories.'Weight Status' = 'Overweight';")

# Execute the query statement and fetch the next row (in this case, the first and only row).
res = cur.execute(query).fetchone()

# Select only the relevant data.
resultFromSQL = (res[0].lower(), 
                 res[1], 
                 str(round(res[2], 2)))

print("The average insurance charge for an {0} person living in the {1} region is ${2}."
      .format(resultFromSQL[0], resultFromSQL[1], resultFromSQL[2]))

In [None]:
# Close the database connection.
con.close()