# Groups and visualization

## Anscombe’s quartet

by Koenraad De Smedt at UiB

---

Many formats can be read into a dataframe and there are many possible operations on dataframes. This notebook shows how to:

1.   Read a JSON file into a dataframe
2.   Make groups by values in a column
3.   Describe the groups
4.   Plot the data in each group.

These operations are illustrated with [Anscombe’s quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet), a well-known example in statistics illustrating the importance of visualization. The dataset for this example, `anscombe.json`, is formatted in JSON, a structured data type which is similar to a *dict*.

---

Use *pandas* to read the data from JSON into a dataframe. If you use Google Colab, it is in the sample data (see Files in the left margin). Alternatively, if you are not using Colab, use a file from another location.

Inspect the data. There are four groups called `Series` (not to be confused with the Series datatype). Each group has 11 `X` and `Y` values.


In [None]:
import pandas as pd
df = pd.read_json('/content/sample_data/anscombe.json')
# df = pd.read_json('https://raw.githubusercontent.com/vega/vega/main/docs/data/anscombe.json')
df

Group by the values in the Series column. Describe the groups. You can see that the groups have the same number of data points and the same (or very similar) summary statistics, such as the means and standard deviations for X and Y.

In [None]:
quartet = df.groupby('Series')
quartet.describe()

The groups also have very similar correlations between X and Y.

In [None]:
quartet.corr()

The surprising part comes when we plot the data points in each group. This illustrates the importance of data visualization, as pointed out by Anscombe.

In [None]:
quartet.plot.scatter('X','Y')

Here is an alternative way to plot the data using Seaborn.

In [None]:
import seaborn as sns
sns.relplot(data=df, x='X', y='Y', col='Series', col_wrap=2)

###Exercises

1. (optional) Read [the article about the Datasaurus dozen](https://www.autodesk.com/research/publications/same-stats-different-graphs), which is a more elaborate dataset in the same spirit.
You can download the [data from Kaggle](https://www.kaggle.com/datasets/tombutton/datasaurusdozen) and write code to describe and visualize it, or use the [notebook on Kaggle](https://www.kaggle.com/code/tombutton/datasaurus-dozen).