# Anscombe's quartet

Anscome’s Quartet  ["Graphs in Statistical Analysis" Anscombe, F. J. (1973)](https://www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf) is a set of four distinct data sets each consisting of 11 `(x,y)` pairs where each dataset produces the same summary statistics: 

| Property | Value | Accuracy |
| :- |:-------------: | -: |
| Mean of x : $\overline{x}$ | 9 | exact |
| Sample variance of x : $\sigma^{2}$ | 11 | exact |
| Mean of y : $\overline{y}$ |	7.50 | to 2 decimal places |
| Sample variance of y : $\sigma^{2}$ | 4.125 | ±0.003 |
| Correlation between x and y |	0.816 | to 3 decimal places |
| Linear regression line | $${y=3.00+0.500x}$$ |	to 2 and 3 decimal places, respectively |
| Coefficient of determination of the linear regression : $R^{2}$ | 0.67 |	to 2 decimal places |


Along this notebook we will be computing statics over the Anscombe's quartet data set and visualizing it.  

## Reading the data
We will first load Anscombe data set using pandas [`read_csv(...)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method. The dataset consists on 44 rows and 3 columns corresponding to `x`,`y` pairs and a `dataset` field indicating the data set index (`I/II/III/IV`) the pair beolongs to. Let's take a look at it:

In [None]:
import pandas as pd          # import pandas
import seaborn as sns        # import seaborn, we will be using anscombe dataset and sns.FacetGrid

pd.set_option("max_rows", 8)              # only display up to 8 rows when printing dataframes (reduce visual clutter)
anscome_df = sns.load_dataset("anscombe") # load anscombe dataset from seaborn
anscome_df['x'] = anscome_df['x'].astype(float)
anscome_df['y'] = anscome_df['y'].astype(float)
anscome_df                                # check data has been loaded

## Visualizing data sets

In order to visually inspect the Anscombe dataset, let's plot each point cloud using scatterplots from [matplotlib](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.scatter.html) and multi-plot grids from [seaborn](https://seaborn.pydata.org/tutorial/axis_grids.html). Despite data share the aforementioned statistics, notice how different the plots are from each other.

In [None]:
import matplotlib.pyplot as plt  # import pyplot to create scatter plots
sns.set(style="ticks")

g = sns.FacetGrid(anscome_df, col="dataset", hue="dataset") # define a grid of plots. Each element shows a different memeber from the quartet in a different column and color
g.map(plt.scatter, "x", "y", alpha=.7)                      # render scatterplots

## Computing statistics
Now, lets verify statics are shared among all data partitions.

In [None]:
anscome_dataset_labels = anscome_df['dataset'].unique() # labels for each dataset

# First, compute mean
anscome_mean_of_xy = {q: anscome_df[anscome_df['dataset'] == q].mean() for q in anscome_dataset_labels}

# Second, compute variances 
anscome_var_of_xy  = {q: anscome_df[anscome_df['dataset'] == q].var() for q in anscome_dataset_labels}

# Third, compute correlation between x and y
anscome_corr_of_xy = {q: anscome_df[anscome_df['dataset'] == q]['x'].corr(anscome_df[anscome_df['dataset'] == q]['y']) for q in anscome_dataset_labels}

In [None]:
# Show statistics
pd.concat([pd.DataFrame.from_dict(anscome_mean_of_xy).rename({"x": "Mean of x", "y": "Mean of y"}),
           pd.DataFrame.from_dict(anscome_var_of_xy).rename({"x": "Variance of x", "y": "Variance of y"}),
           pd.DataFrame({k: [v] for k, v in anscome_corr_of_xy.items()}).rename({0: "Correlation between x and y"})])\
          .style.set_precision(3).set_caption("Anscombe's quartet statistics")

## Linear regression
Finally, we perform a linear regression over each point set. We will be using the [`LinearRegression().fit(X,y)`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) method from [scikit-learn](https://scikit-learn.org/stable/index.html)

In [None]:
from sklearn.linear_model import LinearRegression

# Fit the data. Notice
reg_of_xy = {q: LinearRegression().fit(anscome_df[anscome_df['dataset'] == q]['x'].values.reshape(-1, 1),
                                       anscome_df[anscome_df['dataset'] == q]['y'].values.reshape(-1, 1)) 
                                       for q in anscome_dataset_labels}

# Compute the coefficient of determination R^2 of the prediction
coef_of_xy = {q: reg_of_xy[q].score(anscome_df[anscome_df['dataset'] == q]['x'].values.reshape(-1, 1),
                                    anscome_df[anscome_df['dataset'] == q]['y'].values.reshape(-1, 1)) 
                                    for q in anscome_dataset_labels}

In [None]:
# Get and show coeficients of linear fits
linear = {q: reg_of_xy[q].coef_      for q in anscome_dataset_labels}
const  = {q: reg_of_xy[q].intercept_ for q in anscome_dataset_labels}
print("Linear fits for each dataset")
for q in anscome_dataset_labels:
    print('y={:.3}x+{:.3} / R2={:.3} : {}'.format(linear[q].flatten()[0],const[q].flatten()[0],coef_of_xy[q],q))

In [None]:
# Show the results of a linear regression within each dataset
sns.set(style="ticks")
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=anscome_df,
           col_wrap=4, ci=None, palette="muted", height=4,
           scatter_kws={"s": 50, "alpha": .7})

## Bonus: DataSaurus
As a final bonus, let's look at the DataSaurus dataset from the paper ["Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing" Justin Matejka and George Fitzmaurice (2017)](https://dl.acm.org/doi/pdf/10.1145/3025453.3025912). Each dataset has the same summary statistics to two decimal places.

In [None]:
# Read data from csv and preview
datasaurus_df = pd.read_csv("https://raw.githubusercontent.com/emmanueliarussi/DataScienceCapstone/master/datasets/DatasaurusDozen.csv")
datasaurus_df

In [None]:
# Plot each dataset
g = sns.FacetGrid(datasaurus_df, col_wrap=4, col="dataset", hue="dataset") # define a grid of plots. Each element shows a different memeber from the quartet in a different column and color
g.map(plt.scatter, "x", "y", alpha=.7)                                     # render scatterplots

In [None]:
dataset_datasaurus_labels = datasaurus_df['dataset'].unique() # labels for each dataset

# First, compute mean
datasaurus_mean_of_xy = {q: datasaurus_df[datasaurus_df['dataset'] == q].mean() for q in dataset_datasaurus_labels}

# Second, compute variances 
datasaurus_var_of_xy  = {q: datasaurus_df[datasaurus_df['dataset'] == q].var() for q in dataset_datasaurus_labels}

# Third, compute correlation between x and y
datasaurus_corr_of_xy = {q: datasaurus_df[datasaurus_df['dataset'] == q]['x'].corr(datasaurus_df[datasaurus_df['dataset'] == q]['y']) for q in dataset_datasaurus_labels}

In [None]:
# Show statistics
pd.concat([pd.DataFrame.from_dict(datasaurus_mean_of_xy).rename({"x": "Mean of x", "y": "Mean of y"}),
           pd.DataFrame.from_dict(datasaurus_var_of_xy).rename({"x": "Variance of x", "y": "Variance of y"}),
           pd.DataFrame({k: [v] for k, v in datasaurus_corr_of_xy.items()}).rename({0: "Correlation between x and y"})])\
          .style.set_precision(3).set_caption("DataSaurus statistics")