# Fundamentals of Data Analysis - Assignment
***

## Table of Contents


***
## Introduction to Anscombe's quartet dataset
1. Explain the background to the dataset – who created it, when it was created, and
any speculation you can find regarding how it might have been created.

***
## Exploratory data analysis
2. Plot the interesting aspects of the dataset.

In [1]:
# although we can read csv files with numpy, pandas is much better for data analytics
import pandas as pd
import numpy as np

# to make interactive plots with plotly
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.figure_factory as ff
from plotly import tools

# for making a linear regression line
from sklearn import linear_model, metrics

# read in our dataset
filename = 'anscombe.csv'

df = pd.read_csv(filename)

df

Unnamed: 0.1,Unnamed: 0,x1,x2,x3,x4,y1,y2,y3,y4
0,1,10,10,10,8,8.04,9.14,7.46,6.58
1,2,8,8,8,8,6.95,8.14,6.77,5.76
2,3,13,13,13,8,7.58,8.74,12.74,7.71
3,4,9,9,9,8,8.81,8.77,7.11,8.84
4,5,11,11,11,8,8.33,9.26,7.81,8.47
5,6,14,14,14,8,9.96,8.1,8.84,7.04
6,7,6,6,6,8,7.24,6.13,6.08,5.25
7,8,4,4,4,19,4.26,3.1,5.39,12.5
8,9,12,12,12,8,10.84,9.13,8.15,5.56
9,10,7,7,7,8,4.82,7.26,6.42,7.91


When we use the the read_csv() function out of the box by passing just the filename, pandas can automatically guess the header row and separator in the csv file as indicated by the documentation (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

Anyway, while this looks good, we can present it in a better form.

First, we should use the first column within the dataset itself as the index, and then reorder the columns such that x1 is next to y1 and so on.

In [4]:
df = pd.read_csv(filename, index_col=0)
df = df[['x1','y1', 'x2','y2', 'x3', 'y3', 'x4', 'y4']]
df

Unnamed: 0,x1,y1,x2,y2,x3,y3,x4,y4
1,10,8.04,10,9.14,10,7.46,8,6.58
2,8,6.95,8,8.14,8,6.77,8,5.76
3,13,7.58,13,8.74,13,12.74,8,7.71
4,9,8.81,9,8.77,9,7.11,8,8.84
5,11,8.33,11,9.26,11,7.81,8,8.47
6,14,9.96,14,8.1,14,8.84,8,7.04
7,6,7.24,6,6.13,6,6.08,8,5.25
8,4,4.26,4,3.1,4,5.39,19,12.5
9,12,10.84,12,9.13,12,8.15,8,5.56
10,7,4.82,7,7.26,7,6.42,8,7.91


However, the table above is looks too plain and can be made much more interesting looking. Now, it is obvious to us that the the quartets are x1/y1, x2/y2... Let's make it better looking by coloring the columns.


In [5]:
# reference: https://plot.ly/python/table/

trace = go.Table(
    header=dict(values=list(df.columns),
                fill = dict(color='blue'),
                align = ['center'],
                font = dict(color = 'white')
               ),

    cells=dict(values=[df.x1, df.y1,
                       df.x2, df.y2,
                       df.x3, df.y3,
                       df.x4, df.y4],
               fill = dict(color=['red', 'red', 'yellow', 'yellow', 'lightgreen', 'lightgreen', 'black', 'black']),
               align = ['center'],
               font = dict(color=['white', 'white', 'black', 'black', 'black', 'black', 'white', 'white'])
              )
               )

data = [trace] 
iplot(data)

We can also reorder the values in the column so that they make sense visually by looking at the table itself.

In [26]:
# reorder the table above

#df_reordered_Q1 = df_reordered[['x1', 'y1']]
#
#df_reordered_Q1 = df_reordered_Q1.sort_values('x1')
#
#df_reordered_Q1

# let's also split the df because we need to do some separate operations plotting, regression etc.
# we need to reset the indices to make it easier for remerging etc.

df_Q1 = df[['x1', 'y1']]
df_Q1 = df_Q1.sort_values('x1')
df_Q1 = df_Q1.reset_index(drop=True)

df_Q2 = df[['x2', 'y2']]
df_Q2 = df_Q2.sort_values('x2')
df_Q2 = df_Q2.reset_index(drop=True)

df_Q3 = df[['x3', 'y3']]
df_Q3 = df_Q3.sort_values('x3')
df_Q3 = df_Q3.reset_index(drop=True)

df_Q4 = df[['x4', 'y4']]
df_Q4 = df_Q4.sort_values('x4')
df_Q4 = df_Q4.reset_index(drop=True)

# however, plotly table header can only take 1 argument, so we need to join the headers together

df = pd.concat([df_Q1, df_Q2, df_Q3, df_Q4], axis=1)

In [27]:
trace = go.Table(
    header=dict(values=list(df.columns),
                fill = dict(color='blue'),
                align = ['center'],
                font = dict(color = 'white')
               ),

    cells=dict(values=[df.x1, df.y1,
                       df.x2, df.y2,
                       df.x3, df.y3,
                       df.x4, df.y4],
               fill = dict(color=['red', 'red', 'yellow', 'yellow', 'lightgreen', 'lightgreen', 'black', 'black']),
               align = ['center'],
               font = dict(color=['white', 'white', 'black', 'black', 'black', 'black', 'white', 'white'])
              )
               )

data = [trace] 
iplot(data)

The table above easily shows to us that:

1. Quartet IV is a bit problematic because the same value of x is associated with multiple y values, and there seems to be an outlier
1. Quartet I - there is almost linear increase in y as x increases but with dips at a minimum of 2 points (roughly)
1. Quartet II - y increases as x increases and then decreases again
1. Quartet III - y increases as x increases but there is an outlier y (12.74)

Of course, because humans are better with visuals, we'll just plot these.

***
## Descriptive statistics
3. Calculate the descriptive statistics of the variables in the dataset.


## Discussion
4. Explain why the dataset is interesting, referring to the plots and statistics above.


## Conclusion

links:
https://en.wikipedia.org/wiki/Anscombe%27s_quartet