# Experimental Design

In [None]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import warnings
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')

Experimental design is the process in which we carry out research in an objective and controlled fashion. The purpose of this is to ensure we can make specific conclusions in reference to a hypothesis we have.

https://www.sciencedirect.com/topics/earth-and-planetary-sciences/experimental-design

Because we use objective tools, we need to use quantified language. Instead of using words like 'probably', 'likely', and 'small' when noting our conclusions, we should use precise and quantified language. This often takes the form of noting the percentage risk on a Type I error in the conclusion. 

Type I errors: we incorrectly reject the null hypothesis when it is actually true. 

## Terminology

Subjects
: What we are experimenting on

Treatment
: Some change given to one group

Control group
: The group not given any treatment

## How to assign subjects to groups

- Non Random (iloc with a range for instance)

In [None]:
df = pd.read_feather('../data/dem_votes_potus_12_16.feather')

group1 = df.iloc[0:100]
group2 = df.iloc[100:200]


In [None]:
group1.describe()

In [None]:
group2.describe()

- Random (sample method for instance)

In [None]:
random_group1 = df.sample(frac=0.5)
random_group1.describe()

In [None]:
random_group2 = df.drop(random_group1.index)

In [None]:
compare_df_rand = pd.concat([random_group1['dem_percent_12'].describe(), random_group2['dem_percent_12'].describe()], axis=1)
compare_df_rand.columns = ['group1', 'group2']

print(compare_df_rand)

# Experimental data setup

Randomization is often the best technique for setting up experimental data, but it isnt always.

## Scenarios where randomization could cause undesiderable outcomes

### Uneven issue

Different number of subjects in groups. Can be solved with block randomization.

### Covariates

Covariates are variables that potentially affect experiment results but aren't the primary focus. If covariates are highly variable or not equally distributed among groups, randomization might not produce balanced groups. This imbalance can lead to biased results. Overall these make it harder to see an effect from a treatment, as these issues may be driving an observed change.

In [None]:
group1 = df.sample(frac=0.5, replace=False)
group1['Block']=1

group2 = df.drop(group1.index)
group2['Block']=2

print(len(group1), len(group2))

But does this technique eliminates the covariate issue?

A nice way of checking for potential covariate issues is with visualizations.

In [None]:
sns.displot(data=df, x='dem_percent_12', fill=True, kind='kde'
            # , hue=''
           )

Not with this dataset, but it could happen that based on a second feature, thmere is quite a difference in the group distributions. When an effect could be because of a variable rather than the treatment, this is often called **confounding**. The covariate issue can be solved with stratified randomization.

### Stratified randomization

Stratified randomization involves splitting based on a potentially confounding variable first, followed by randomization


# Normal Data

Normal data is drawn from a normal distribution

The normal distribution is related to z-scores

$$z = \frac{ x - \mu}{ \sigma }$$

The most common normal distribution is the standard one, having $\mu$=0 and $\sigma$=0

The normal distribution is behind many of the statistical **parametric** tests. There are alse **non parametric** tests that dont assume normal data.

To visually check if a dataset follow a normal distribution we can plot the kde

Another visual tool are the qqplots, that compare 

In [None]:
df = pd.read_csv('../data/chick_weight.csv')

df.info()

In [None]:
n_rows=2
n_cols=2
# Create the subplots
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols)

for i, column in enumerate(df.columns):
    sns.distplot(df[column],ax=axes[i//n_cols,i%n_cols])


In [None]:
qqplot(data=df.weight, line='s')
plt.show()

In [None]:
from statsmodels.graphics.gofplots import qqplot
from scipy.stats.distributions import norm

qqplot(df['weight'], 
       line='s', 
       dist=norm
      ) 
plt.show()

In [None]:
# Subset the data
subset_data = df[df['Time'] == 2]

# Repeat the plotting
sns.displot(data=df, x='weight', kind="kde")
plt.show()

- ideally, the dots should follow the line
- bad: bow out at the ends

Other tests for normality: 
- Shapiro-Wilk: good for smaller datasets
- D'Agostino $K^2$ (uses curtosis and skewness
- Anderson-Darling returns a list of values

For all these, the $H<sub>0</sub>$ is "data is drawn from a Normal distribution"

In [None]:
# A Shapiro Wilk test
from scipy.stats import shapiro

alpha = 0.05
stat, p = shapiro(df.weight)
print(f'p:{round(p,4)} test stat: {round(stat, 4)}')


In [None]:
# A Anderson Darling test
from scipy.stats import anderson

alpha = 0.05
result = anderson(x = df.weight, dist="norm")
print(round(result.statistic,4))
print(result.significance_level)
print(result.critical_values)