# Practical Assignment - Fundamentals of Data Analysis

## Background and overview of the Anscombe's quartet dataset

"Graphs are essential to good statistical analysis"

Anscombe's quartet comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough." (https://en.wikipedia.org/wiki/Anscombe%27s_quartet)


Anscombe's quartet is a set of four small datasets where each produces almost identical summary statistics (mean, standard deviation, variance and correlations), which could lead come people to infer that the datasets are very similar [1].
However, while the statistical properties prove to be near identical, visualizing (plotting) the data reveals the datasets are notably different.
Anscombe published this paper in order to demonstrate the importance of plotting data before analyzing it, and to show the effect outliers can have on statistical properties [2]. In his paper he makes the astute observation that very little attention is given to plots and that most people avoid making assumptions such as:
 - Numerical calculations are very precise while graphs are "rough".
 - For any kind of statistical data there exists only one set of calulations which results in an accurate analysis.
 - And that performing calculations is something of a virtue while looking at the plotted data is "cheating" [2].
Anscombe asserts that computers should produce both calculations and plots and that both must be studied because each will contribute to understanding. Indeed, by looking at the qualities and features of Anscombe's dataset and by understanding his analyses, we can better appreciate his observational stance on the matter.


In [3]:
# Opening the dataset with pandas

import pandas as pd # Importing pandas to abbreviated form pd
dataframe = pd.read_csv("anscombes.csv") # Reads the csv and assigns this to the new object "dataframe" 
dataframe # Prints dataframe output

Unnamed: 0,id,dataset,x,y
0,0,I,10.0,8.04
1,1,I,8.0,6.95
2,2,I,13.0,7.58
3,3,I,9.0,8.81
4,4,I,11.0,8.33
5,5,I,14.0,9.96
6,6,I,6.0,7.24
7,7,I,4.0,4.26
8,8,I,12.0,10.84
9,9,I,7.0,4.82


## Plotting the dataset

In [4]:
import seaborn as sns
sns.set(style="ticks")

# Load dataset for Anscombe's quartet
# df = sns.load_dataset_csv
df = sns.load_dataset("anscombes")

# Show the results of a linear regression within each dataset
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,col_wrap=2, ci=None, palette="muted", height=4,scatter_kws={"s": 50, "alpha": 1})


HTTPError: HTTP Error 404: Not Found

## Descriptive statistics of the variables in the dataset

In [5]:

dataframe = pd.read_csv("anscombes.csv")
dataframe.shape

(44, 4)

In [6]:

import numpy as np # Importing numpy as np
dataframe = pd.read_csv("anscombes.csv")
np.round(dataframe.describe(), decimals=2) # Rounds output data to 2 decimals

Unnamed: 0,id,x,y
count,44.0,44.0,44.0
mean,21.5,9.0,7.5
std,12.85,3.2,1.96
min,0.0,4.0,3.1
25%,10.75,7.0,6.12
50%,21.5,8.0,7.52
75%,32.25,11.0,8.75
max,43.0,19.0,12.74


# References:

[1] Wikipedia https://en.wikipedia.org/wiki/Anscombe%27s_quartet

[2] Anscombe (1973) - Graphs in Statistical Analysis http://www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf

[3]   https://seaborn.pydata.org/examples/anscombes_quartet.html


http://complementarytraining.net/stats-playbook-what-is-anscombes-quartet-and-why-is-it-important/


