In [None]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.set_theme()

In [None]:
penguins_df = sns.load_dataset("penguins")

print(f"Data for {len(penguins_df)} specific penguins")
display(penguins_df.head())

In [None]:
tips_df = sns.load_dataset("tips")

print(f"Data for {len(tips_df)} specific tips")
display(tips_df.head())

In [None]:
diamonds_df = sns.load_dataset("diamonds")

print(f"Data for {len(diamonds_df)} specific diamonds")
display(diamonds_df.head())

In [None]:
titanic_df = sns.load_dataset("titanic")

print(f"Data for {len(titanic_df)} specific passengers on the Titanic")
titanic_df.head()

In [None]:
fmri_df = sns.load_dataset("fmri")

print(f"Data for {len(fmri_df.groupby(['timepoint']))} specific fmri activation curves")
display(fmri_df.head())

# The fMRI dataset is really interesting.

For a given subject, there is a time series of signals from multiple brain regions:

In [None]:
one_subject = fmri_df[fmri_df['subject']=='s5']
one_subject.head()

In [None]:
sns.relplot(data=one_subject, x='timepoint', y='signal', kind='line', col='region', row='event')

## So the manifold is discrete across subjects, regions, and events
and the fiber is 1 dimensional, over time.

The *dots* dataset is similar in structure but contains neuronal firing time series.  I don't know the details.

In [None]:
dots_df = sns.load_dataset("dots")
print(f"Data for {len(dots_df.groupby(['time']))} specific neuronal firing curves")
display(dots_df.head())

# Note that some of these data columns are Categorical
Categorical is an actual Pandas type.  It is *not* the same as just being strings.  If you know R, you can think of Categorical columns as being Factors in the R sense.

In [None]:
tips_df.dtypes

In [None]:
tips_df['smoker']

The following examples are mostly from the [Pandas Categorical Data documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) .  If you are doing statistics in Pandas, you want to understand this.

In [None]:
df = pd.DataFrame({"A": ["a", "b", "c", "a"]})

df["B"] = df["A"].astype("category")

df

In [None]:
df.dtypes

In [None]:
df['B']

## Pandas provides methods to bin numerical data into categories

In [None]:
df = pd.DataFrame({"value": np.random.randint(0, 100, 20)})

labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]

df["group"] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)

df.head(10)

In [None]:
df['group']

## The levels in a category can be ordered
but sometimes they are not.

In [None]:
df["group"].cat.ordered

In [None]:
df["group"].head()

In [None]:
df["group"].head().cat.as_unordered()

In [None]:
s = pd.Series(pd.Categorical(["a", "b", "c", "a"], ordered=False))

s.sort_values(inplace=True) # we will get lexical ordering

s

In [None]:
try:
    print(s.min(), s.max())
except TypeError:
    print("see? It doesn't work.")

In [None]:
s = s.cat.set_categories(['b', 'c', 'a'], ordered=True)
s

In [None]:
s.min(), s.max()

In [None]:
s.sort_values(inplace=True)
s

# Visualizing Statistical Relationships
This mostly explores `relplot()`, meaning "relationship plot"

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker", data=tips_df).set(title="What's wrong with this plot?")

## Serious issue- points plotted later hide those plotted early

The labels are taken in category order, but the samples get plotted in the order in which they appear.  We can see this by shuffling the rows of the dataframe with the Pandas `DataFrame.sample()` method:

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker", data=tips_df.sample(frac=1))

## This is a serious problem, and very common.
Transparency can help; larger plotting points make the problem worse.

One solution is to switch to a density plot like the one below (to be explained in the next section)


In [None]:
sns.displot(data=tips_df, x="total_bill", y="tip", hue="smoker", kind="kde")

## How long does the bootstrapping really take?

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri_df, col="region", hue="event");

## Bootstrapped Confidence Intervals vs. Standard Deviation

Consider this in the context of [Statistical estimation and error bars](https://seaborn.pydata.org/tutorial/error_bars.html)

In [None]:
sns.__version__


### Turn off the error bars entirely

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri_df, col="region", hue="event", errorbar=None);

### *percentile interval* (non-parametric)

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri_df, col="region", hue="event", errorbar="pi");

### *confidence interval* (non-parametric)

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri_df, col="region", hue="event", errorbar="ci");

### *standard deviation* (parametric)

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri_df, col="region", hue="event", errorbar="sd");

### *standard error* (parametric)

In [None]:
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri_df, col="region", hue="event", errorbar="se");

# Visualizing Distributions of Data

This explores `displot()`, meaning "distribution plots".

We'll just look through this one, with a few comments:
* The question of how many bins to use is really important.  It comes up all the time.
* Do you understand how KDE works?
* Do you understand what marginal distributions are?


# Plotting Categorical Data
This explores `catplot()`, meaning "category plot".  This is the figure-level interface to a variety of lower-level routines.  

**NOTE** that the Categorical nature of some data columns now becomes important!

Start with a subtle difference: the different categories are not connected in space, so let's change the grid background to make it less suggestive of a spatial relationship.

In [None]:
sns.catplot(x="day", y="total_bill", data=tips_df)

In [None]:
sns.set_theme(style="ticks", color_codes=True)
sns.catplot(x="day", y="total_bill", data=tips_df)

## Note the steps taken to control over-plotting.
The 'jitter' parameter on `kind="strip"` and the offsetting used in `kind="swarm"` are used to avoid the over-plotting problem we saw earlier.  Why won't they work for `relplot()` ?

In [None]:
tips_df.dtypes

## Understanding boxplots and violinplots
Boxplots and related idioms like violinplots are really ubiquitous in real-world scientific data analysis.

In [None]:
samps = np.random.normal(size=1000)
fig, axes = plt.subplots(1,2)
sns.boxplot(data=samps, ax=axes[0])
sns.violinplot(data=samps, ax=axes[1])

Note that the Z range of the two figures is different!

### Parts of a boxplot:
* The box itself spans from Q1 to Q3, with the central line at the median.  Note that non-parametric statistics are used, not mean and standard deviation.
* The whiskers cut off at $Q1 - 1.5*IQR$ and $Q3 + 1.5*IQR$ .
* The points beyond that are *fliers*, presumed to be outliers.  Of course in this case they aren't really outliers, since the whole sample came from a standard normal distribution.  They just happen to be extreme values.

### Why doesn't the violin plot look like a normal distribution?
Because KDE is a very crude process.  Beware.

## In this example, matplotlib semantics meet seaborn

In [None]:
sns.catplot(x="class", y="survived", hue="sex",
            palette={"male": "g", "female": "m"},
            markers=["^", "o"], linestyles=["-", "--"],
            kind="point", data=titanic_df)

# Visualizing Regression Models
This explores *regplot()*, the figure-level interface for plotting regression models.

We'll just discuss this on the fly, since I don't know how much you know about general linear models.
