<a href="https://colab.research.google.com/github/pradeep-isawasan/AnscombeQuartet/blob/main/AnscombeQuartet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction

It was constructed in 1973 by statistician [Francis Anscombe](https://en.wikipedia.org/wiki/Frank_Anscombe) to illustrate the importance of plotting the graphs before analyzing and model building, and the effect of other observations such as outliers on statistical properties.

# Dataset

Anscombe's quartet comprises four data sets with each dataset consists of eleven (x,y) points.

In [16]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
import seaborn as sns

# Basic Statistics
## Dataset 1

In [13]:
q1 = pd.read_csv("https://raw.githubusercontent.com/pradeep-isawasan/AnscombeQuartet/main/Dataset1.csv")

q1

Unnamed: 0,x,y
0,10,8.04
1,8,6.95
2,13,7.58
3,9,8.81
4,11,8.33
5,14,9.96
6,6,7.24
7,4,4.26
8,12,10.84
9,7,4.82


In [3]:
q1.describe()

Unnamed: 0,x,y
count,11.0,11.0
mean,9.0,7.500909
std,3.316625,2.031568
min,4.0,4.26
25%,6.5,6.315
50%,9.0,7.58
75%,11.5,8.57
max,14.0,10.84


## Quartet 2

In [4]:
q2 = pd.read_csv("https://raw.githubusercontent.com/pradeep-isawasan/AnscombeQuartet/main/Dataset2.csv")

q2

Unnamed: 0,x,y
0,10,9.14
1,8,8.14
2,13,8.74
3,9,8.77
4,11,9.26
5,14,8.1
6,6,6.13
7,4,3.1
8,12,9.13
9,7,7.26


In [5]:
q2.describe()

Unnamed: 0,x,y
count,11.0,11.0
mean,9.0,7.500909
std,3.316625,2.031657
min,4.0,3.1
25%,6.5,6.695
50%,9.0,8.14
75%,11.5,8.95
max,14.0,9.26


## Dataset 1,2,3,4

In [14]:
q3 = pd.read_csv("https://raw.githubusercontent.com/pradeep-isawasan/AnscombeQuartet/main/Dataset3.csv")
q4 = pd.read_csv("https://raw.githubusercontent.com/pradeep-isawasan/AnscombeQuartet/main/Dataset4.csv")


# mean for x in q1, q2, q3, q4
meanx1 = q1['x'].mean()
meanx2 = q2['x'].mean()
meanx3 = q3['x'].mean()
meanx4 = q4['x'].mean()

# mean for y in q1, q2, q3, q4
meany1 = q1['y'].mean()
meany2 = q2['y'].mean()
meany3 = q3['y'].mean()
meany4 = q4['y'].mean()

# standard deviation, sd for x in q1, q2, q3, q4
sdx1 = q1['x'].std()
sdx2 = q2['x'].std()
sdx3 = q3['x'].std()
sdx4 = q4['x'].std()

# standard deviation, sd for y in q1, q2, q3, q4
sdy1 = q1['y'].std()
sdy2 = q2['y'].std()
sdy3 = q3['y'].std()
sdy4 = q4['y'].std()


In [15]:
# create table for comparison

d = {
    'Parameter': ['Mean(x)', 'Mean(y)', 'SD(x)', 'SD(y)'],
     'Dataset 1': [meanx1, meany1, sdx1, sdy1],
     'Dataset 2': [meanx2, meany2, sdx2, sdy2],
     'Dataset 3': [meanx3, meany3, sdx3, sdy3],
     'Dataset 4': [meanx4, meany4, sdx4, sdy4]
     }

df = pd.DataFrame(d)
df

Unnamed: 0,Parameter,Dataset 1,Dataset 2,Dataset 3,Dataset 4
0,Mean(x),9.0,9.0,9.0,9.0
1,Mean(y),7.500909,7.500909,7.5,7.500909
2,SD(x),3.316625,3.316625,3.316625,3.316625
3,SD(y),2.031568,2.031657,2.030424,2.030579


# Linear Relationship
## Dataset 1

[Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is a measure of linear correlation between two sets of data. For example, if we are interested to know whether there is a relationship between the height and weight, a correlation coefficient can be calculated to answer this question. The coefficient’s value ranges between -1.0 and 1.0. A coefficient of -1.0 shows a perfect negative correlation and 1.0 a perfect positive correlation. A coefficient of 0.0 means that there is no relationship between the two variables