# Class 09: Today's Data Wrangling Example 
![Heart](data/valentines-day-2023-6753651837109573.3-law.gif)

Data from Kaggle see [https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset)
You can find out more about the original dataset [here.](https://archive.ics.uci.edu/dataset/45/heart+disease)
1. age
2. sex
3. chest pain type (4 values)
4. resting blood pressure
5. serum cholestoral in mg/dl
6. fasting blood sugar > 120 mg/dl (1=yes, 0=no)
7. resting electrocardiographic results (values 0,1,2)
8. maximum heart rate achieved
9. exercise induced angina (1=yes, 0=no)
10. oldpeak = ST depression induced by exercise relative to rest
11. the slope of the peak exercise ST segment
12. number of major vessels (0-3) colored by flourosopy
13. thal:  A blood disorder called thalassemia 0 = normal; 1 = fixed defect; 2 = reversable defect

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

*target* (0 = no heart disease and 1 = heart disease)

<img src="https://www.wikidoc.org/images/5/53/SinusRhythmLabels.png" alt="EKG Image" width=500, height="auto" class="blog-image">

In [None]:
from datascience import *
import numpy as np
# import for plotting
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# Fix for datascience plots
import collections as collections
import collections.abc as abc
collections.Iterable = abc.Iterable

In [None]:
path = 'data/'
data = path + 'heart.csv'
heart = Table.read_table(data)
heart

# group(): Grouping is a way to summarize rows around one or more quantities

 Table.group(column_or_label, collect=None)
 
 collect: a function applied to values in other columns for each group
 
 ### Grouping by age
We would expect the incidence of heart disease to increase with age

In [None]:
# The default operation when grouping is count()
# The column you group becomes rows, one for each distinct value.
by_age = heart.group("age")
by_age

In [None]:
# Returns only one column because the count is the same for all features.
by_age.sort('age') 

In [None]:
# The default operation when grouping is count(), but here we ask for median.
import numpy as np

# We get back multiple columns because each feature will have different values
heart.group("age", np.median)

## Does the median value of cholestoral increase with age for these patients?

In [None]:
heart.group("age", np.median).select('age', 'chol median')

In [None]:
heart.group("age", np.median).select('age', 'chol median').sort('age')

In [None]:
heart.group("age", np.median).select('age', 'chol median').sort('age').scatter('age', 'chol median')

In [None]:
heart.group("age", np.median).select('age', 'chol median').sort('age').scatter('age', 'chol median', fit_line=True)

**The same operations without chaining**

In [None]:
# Step 1: 
heart_group_by_age = heart.group("age", np.median)
heart_group_by_age

In [None]:
# Step 2
heart_group_by_age = heart_group_by_age.select('age', 'chol median')
heart_group_by_age

In [None]:
# Step 3
heart_group_by_age = heart_group_by_age.sort('age')
heart_group_by_age

In [None]:
# Step 4
heart_group_by_age.scatter('age', 'chol median')

In [None]:
# Step 5
heart_group_by_age.scatter('age', 'chol median', fit_line=True)

### Age distribution of study population

In [None]:
heart.hist("age")

In [None]:
bins = np.arange(20, 90, 5)
heart.hist("age", bins=bins)

# Note percent per unit in this case means percent per year

In [None]:
heart.hist("chol", group='target', bins=20)

A surprising result! Can you think of any explanations?

## Grouping on more than one column

In [None]:
# Grouping on multiple columns 
# Creates one row for each unique combination of the grouped column values
heart.group(["sex", "target"])

### Another example of using a different collection function than the default count()

In [None]:
heart.group("target", np.min)

### You can also use your own function.
Let's say we want the range of values for each feature as whether or not the patient has heart disease.

In [None]:
def range(x):
    return max(x) - min(x)

In [None]:
# Test our function before we apply it!
from datascience import *

x = make_array(2, 4, 7, 9, 1)
range(x)

In [None]:
# Use our function with group()
heart.group("target", range)

# pivot(): Use the values of a column as the rows for a new table.

According to Wikipedia, "A pivot table is a table of statistics that summarizes the data of a more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way. Pivot tables are a technique in data processing. They enable a person to arrange and rearrange (or "pivot") statistics in order to draw attention to useful information."

We "pivot" the table on one of the columns.

Values in a column become rows

 Table.pivot(columns, rows, values=None, collect=None, zero=None)[source]

    Generate a table with a column for each unique value in columns, with rows for each unique value in rows. Each row counts/aggregates the values that match both row and column based on collect.

In [None]:
# Look at the first three rows of starting data set
heart.show(3)

In [None]:
# Just as with group() the default collection operation is count()
#------------columns---rows----
heart.pivot('slope', 'target')

In [None]:
#-----------columns---rows---values-----apply-to-values---
heart.pivot('age', 'target', 'chol', collect=np.median)

In [None]:
#--------------------------column----row----values---apply-to-values---
chol_by_age = heart.pivot('target', 'age', 'chol', collect=np.median)
chol_by_age

In [None]:
chol_by_age.scatter('age')

In [None]:
# Eliminate the zeros
chol_by_age = chol_by_age.where('0', are.above(0)).where('1', are.above(0))
chol_by_age.scatter('age')

In [None]:
chol_by_age.scatter('age', fit_line=True, overlay=False)

# Summary
Both group() and pivot() are ways to analyze and summarize large datatables. The are powerful techniques for exploratory data analysis.

# PostScript - A Mystery
According to the metadata:

13. thal:  A blood disorder called thalassemia 0 = normal; 1 = fixed defect; 2 = reversable defect

In [None]:
heart.group('thal')

So why are there four values?