# Example 2: Hospital Charges for Inpatients in the U.S.

In this example we will be exploring a dataset provided by kaggle.com about hospital charges for inpatiens across a variety of providers in the U.S. The data and some accompanying information can be found on kaggle here:

https://www.kaggle.com/speedoheck/inpatient-hospital-charges


## Data Combining

In Example 1 we saw an introduction to data science in Python, with a walkthrough using some of the most commonly applied functionality. As an avid user of R also, I included a porting the "5 verbs of dplyr" to Python in the pandas:

* arrange: sort or order a data frame
* select: subset your dataset down to particular columns/variables of interest
* filter: subset your dataset down to particular rows/observations of interest
* mutate: create and append new variables/columns to your dataset
* summarize: compute aggregations, statistics, or other summaries of the data

with the obligatory inclusion of being able to "group by." Besides the parallel with dplyr functionality in R, these operations also constitute the core of most SQL queries. So, to say they are foundational is still probably an understatement!

In this example we're going to combine the dataset from Example with the hospital charges dataset linked above. Then, we're going to look at some basic visualiations using the matplotlib library.

### Reading the Data In

In [None]:
import pandas as pd
import numpy as np

energy = pd.read_csv("Energy Census and Economic Data US 2010-2014.csv")
hospital = pd.read_csv("inpatientCharges.csv")

In [None]:
energy.head()

In [None]:
hospital.head()

### Combining These Two Datasets

Notice that these two datasets actually only have the state code in common, so we'll want to merge on that. The pandas library offers syntax and functionality very similar to SQL joins: left, right, inner, and outer. The entry point for all of these is with the merge() function, but there are more specific ways of performing joins if you take a deeper look at the pandas documentation.

In [None]:
alldata = pd.merge(energy, hospital, left_on = "StateCodes", right_on = "Provider State", how = "inner")

In [None]:
alldata.head()

### Cleaning the Hospital Charges

Notice that the hospital charge variables are actually character as they include the dollar signs. Let's remove those and convert the variables to numeric for better use.

In [None]:
# alldata["Average Covered Charges"].head()
alldata["Average Covered Charges"] = pd.to_numeric(alldata["Average Covered Charges"].str.replace("\\$|,", ""))
# alldata["Average Covered Charges"].head()

alldata["Average Total Payments"] = pd.to_numeric(alldata["Average Total Payments"].str.replace("\\$|,", ""))
alldata["Average Medicare Payments"] = pd.to_numeric(alldata["Average Medicare Payments"].str.replace("\\$|,", ""))

# alldata.head()

## Some Basic Visualizations Using matplotlib

In [None]:
import matplotlib.pyplot as plt

In [None]:
fig1, ax1 = plt.subplots()

ax1.hist(alldata["Average Total Payments"])

In [None]:
fig2, ax2 = plt.subplots()
ax2.scatter(x = alldata["Average Medicare Payments"], y = alldata["Average Total Payments"], marker = 'o', c = alldata["Coast"])

## Some Basic Visualizations Using altair

In [None]:
import altair as alt

# Subset for plotting
subdata = alldata[7000:10000].copy()

alt.Chart(subdata).mark_point().encode(
    x = "Average Medicare Payments",
    y = "Average Total Payments",
    color = "Coast")

subdata["Coast_c"] = subdata["Coast"] == 1

In [None]:
alt.Chart(subdata).mark_point().encode(
    x = "Average Medicare Payments",
    y = "Average Total Payments",
    color = "Coast_c")

In [None]:
alt.Chart(subdata).mark_bar().encode(
    alt.X("Average Medicare Payments", bin = True),
    y = 'count()')

## Conclusion

We saw here how to perform simple, SQL-like merges/joins and the robustness of the pandas library to accomplish this as well.

We also saw some basic visualizations using the matplotlib and altair libraries. 

This was all still just the tip of the iceberg!

### Resources

There are tons of galleries of examples online for visualizations using both of these libraries, but here are the home pages for them to start at:

#### matplotlib

https://matplotlib.org/2.0.2/index.html

#### altair

https://altair-viz.github.io/index.html

### scikit-learn library for statistical analysis/modeling

https://scikit-learn.org/stable/