<a href="https://colab.research.google.com/github/odu-cs625-datavis/public-fall24-mcw/blob/main/Distributions_in_Vega_Altair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Visualizing Distributions Examples in Vega-Altair**

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import altair as alt

Load the diamonds dataset

In [2]:
url = "https://github.com/byuidatascience/data4python4ds/raw/master/data-raw/diamonds/diamonds.csv"
diamonds = pd.read_csv(url)

By default Altair will throw an error on datasets with more than 5000 rows. For this example, we're going to get around this by disabling the `MaxRows` check, but if you plan to use Altair in the future, you should read the documentation page on [Large Datasets](https://altair-viz.github.io/user_guide/large_datasets.html).


In [3]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

Set the ordering of the ordered attributes (originally categorical)

In [4]:
diamonds['cut'] = pd.Categorical(diamonds.cut,
  ordered = True,
  categories =  ["Fair", "Good", "Very Good", "Premium", "Ideal" ])

diamonds['color'] = pd.Categorical(diamonds.color,
  ordered = True,
  categories =  ["D", "E", "F", "G", "H", "I", "J"])

diamonds['clarity'] = pd.Categorical(diamonds.clarity,
  ordered = True,
  categories =  ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"])

In [5]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


## **Histogram**

To help with the large data problem, we can make sure that we're only passing in necessary columns to Altair (as described in [Preaggregate and Filter in pandas](https://altair-viz.github.io/user_guide/large_datasets.html#preaggregate-and-filter-in-pandas)).

To do this, instead of just passing in `diamonds`, we pass in only the `price` column with `diamonds[['price']]`.

In [6]:
alt.Chart(diamonds[['price']]).properties(width = 500, height = 300).mark_bar().encode(
    x= alt.X('price:Q', bin=alt.Bin(step=500), title='price ($)'),
    y= alt.Y('count()')
)

Change the binwidth to $1000

In [7]:
alt.Chart(diamonds[['price']]).properties(width = 500, height = 300).mark_bar().encode(
    x= alt.X('price:Q', bin=alt.Bin(step=1000), title='price ($)'),
    y= alt.Y('count()')
)

Changing the binwidth to $100

In [8]:
alt.Chart(diamonds[['price']]).properties(width = 500, height = 300).mark_bar().encode(
    x= alt.X('price:Q', bin=alt.Bin(step=100), title='price ($)'),
    y= alt.Y('count()')
)

## **Boxplot**

We're again looking at the price of diamonds. This time, we'll use boxplots (`mark_boxplot()`).

In [9]:
alt.Chart(diamonds[['price']]).mark_boxplot().encode(
    y = alt.Y("price:Q")
)

Now, let's look at these by cut and have a separate box plot for each cut type. For this, we need to pass in both the `price` and `cut` columns from the data.

In [10]:
alt.Chart(diamonds[['price', 'cut']]).mark_boxplot().encode(
    x = alt.X("cut:N"),
    y = alt.Y("price:Q")
).properties(width = 400)

## **Empirical CDF**

For the empirical CDF, we use `transform_density()` and draw a line.

In [11]:
alt.Chart(diamonds[['price']]).transform_density(density = 'price',
    cumulative=True  # Make it a cumulative density function
).mark_line().encode(
    x = alt.X("value:Q", title = "price ($)"),
    y = alt.Y("density:Q")
).properties(width = 500, height = 300)