# Module 1 Python Practice: Anscombe's Quartet

In this practice, we will recreate the Anscombe's Quartet visualization in Python similar to how we have done it in R. 


<img src="../images/AnscombeStats.png">

<img src="../images/AnscombeGraph.png">

**We will use the `plotnine` library which is a good implementation of `ggplot2` in Python.**  

In [None]:
from plotnine import *
import pandas as pd

#read the anscombe dataset
anscombe = pd.read_csv("/dsa/data/all_datasets/anscombe.csv")

# the same data set also comes with the seaborn library, we can just load it to the workspace
import seaborn as sns


anscombe2 = sns.load_dataset('anscombe')

In [None]:
# let's look at the data itself
anscombe

In [None]:
# sns version is slightly different 
anscombe2

In [None]:
#Now let's look at the statistics: we can utilize the `dataset` column in the sns version
grouped = anscombe2.groupby('dataset')
grouped.describe()

In [None]:
# correlation
grouped.apply(lambda df: df['x'].corr(df['y']))

In [None]:
# variance
grouped['y'].var()

In [None]:
# linear regression (first pair)
import numpy as np
from scipy import stats

group1 = grouped.get_group('I')
x = group1['x'].values
y = group1['y'].values
slope, intercept, _,_,_ = stats.linregress(x, y)
print("dataset I slope: {:.4f}, intercept: {:.4f}".format(slope, intercept))

In [None]:
# linear regression (second pair)
group1 = grouped.get_group('II')
x = group1['x'].values
y = group1['y'].values
slope, intercept, _,_,_ = stats.linregress(x, y)
print("dataset II slope: {:.4f}, intercept: {:.4f}".format(slope, intercept))

### Now, it's your turn: Find the linear regression coefficients for the next two data sets:

In [None]:
# linear regression (next two pairs)
# <your code here>

**Let's do the same plots as in the R practice notebook**

In [None]:
p1 = ggplot(anscombe,aes(x="x1", y="y1"))
p1 = p1 + geom_point()
p1 = p1 + stat_smooth(method= 'lm', se=False) + expand_limits(x=4, y=4)
p1

In [None]:
# Here is the version with similar visuals as in the original plot from the lab notebook:
pp1 = ggplot(anscombe) + geom_point(aes(x="x1",y= "y1"), color = "darkorange")
pp1 = pp1 + theme_bw() + scale_x_continuous(breaks = range(0,18,2))
pp1 = pp1 + scale_y_continuous(breaks = range(0, 12, 2))
pp1 = pp1 + geom_abline(intercept = 3, slope = 0.5, color = "cornflowerblue")
pp1 = pp1 + expand_limits(x=[4,18], y=[4,12])
pp1


Since we do not have the gridding library in Python, we can use the `facet_wrap()` function utilizing the dataset column of the sns version of the data set. 

In [None]:
pp12 = (ggplot(anscombe2, aes('x', 'y'))
 + geom_point()
 + stat_smooth(method='lm', se=False, color='blue',size=0.5)
 + facet_wrap('~dataset'))

pp12

And the original look: 

In [None]:
(ggplot(anscombe2, aes('x', 'y'))
 + geom_point(color='darkorange')
 + theme_bw()
 + scale_y_continuous(breaks=[y for y in range(0,13,2)])
 + scale_x_continuous(breaks=[x for x in range(0,19,2)])
 + stat_smooth(method='lm', se=False, color='blue',size=0.5)
 + facet_wrap('~dataset'))