<a href="https://colab.research.google.com/github/peterbmob/DHMVADoE/blob/main/Excercises/fractional_design.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Fractional 2$^k$  Factorial Designs
Motivation
The prior section showed an example of what an experimental design might like look like for 6 variables. However, this resulted in a  2$^6$=64
  experiment design campaign. This is potentially a major issue - if my experiments take 6 hours, and have to be staggered over working hours on weekdays, you're looking at almost 90 days turnaround time, assuming each experiment is carried out flawlessly. This is simply not a realistic view of experimentation.

In addition, we saw that a three-coefficient model captured nearly as much detail as a 64-coefficient model. By reducing the number of input variables we looked at, we turned certain experiments into replicates (because the only thing changed bewteen them were insignificant variables or variable combinations).

But we can halve or quarter our effort, and substantially improve our effectiveness in the lab, by carefully selecting experiments at each stage of the experiment to reveal a maximum amount of information, and avoiding as much as possible these kinds of duplicate experiments, through a fractional factorial design.

In [1]:
import pandas as pd
import itertools
import numpy as np
import seaborn as sns
import pylab

import scipy.stats as stats
import statsmodels.api as sm

After re-casting the problem in a general form, we begin with the experimental design matrix. If we were to construct the full factorial for our  2$^6$ factorial example, we would again have 64 rows in our experimental design matrix dataframe, corresponding to 64 experiments to run.

In [2]:
column_labs = ['x%d'%(i+1) for i in range(6)]
encoded_inputs = list( itertools.product([-1,1],[-1,1],[-1,1],[-1,1],[-1,1],[-1,1]) )

# Create the experiment design table (same as the book):
doe=pd.DataFrame(encoded_inputs)
doe=doe[doe.columns[::-1]]
doe.columns=['x%d'%(i+1) for i in range(6)]

print(len(doe))

64


## Design Matrix
Let's talk a bit more about the design matrix. Each column of the design matrix corresponds to a unique coded input variable value (−1,+1). But each experiment also has a corresponding coded value for each two-variable interaction  x$_i$,x$_j$, and for each three-variable interaction  x$_k$,x$_m$,x$_n$, and so on.

These interactions are simply the product of each coded variable value. For example, if

\begin{equation}
x_1=−1 \\
\end{equation}

\begin{equation}
x_2=+1 \\
\end{equation}

\begin{equation}
x_3=+1
\end{equation}

then two-variable interaction effects can be computed as:

\begin{equation}
x_{12}=−1\times+1 = -1 \\
\end{equation}

\begin{equation}
x_{13}=−1\times+1 = −1 \\
\end{equation}

\begin{equation}
x_{23}=+1\times+1 = +1 \\
\end{equation}

and three-variable interaction effects are:

\begin{equation}
x_{123}=−1\times−1\times+1=+1
\end{equation}

Now we can add new columns to our experimental design matrix dataframe, representing coded values for higher-order interaction effects:

In [3]:
doe['x1-x2-x3-x4'] = doe.apply( lambda z : z['x1']*z['x2']*z['x3']*z['x4'] , axis=1)
doe['x4-x5-x6']    = doe.apply( lambda z : z['x4']*z['x5']*z['x6'] , axis=1)
doe['x2-x4-x5']    = doe.apply( lambda z : z['x2']*z['x4']*z['x5'] , axis=1)

doe[0:10]

Unnamed: 0,x1,x2,x3,x4,x5,x6,x1-x2-x3-x4,x4-x5-x6,x2-x4-x5
0,-1,-1,-1,-1,-1,-1,1,-1,-1
1,1,-1,-1,-1,-1,-1,-1,-1,-1
2,-1,1,-1,-1,-1,-1,-1,-1,1
3,1,1,-1,-1,-1,-1,1,-1,1
4,-1,-1,1,-1,-1,-1,-1,-1,-1
5,1,-1,1,-1,-1,-1,1,-1,-1
6,-1,1,1,-1,-1,-1,1,-1,1
7,1,1,1,-1,-1,-1,-1,-1,1
8,-1,-1,-1,1,-1,-1,-1,1,1
9,1,-1,-1,1,-1,-1,1,1,1


The multi-variable columns can be used to fractionate our design.

## Half Factorial
Suppose we pick a high-order interaction effect at random - e.g., x$_1\times$x$_2\times$x$_3\times$x$_4$ - and assume it will be unimportant. Our assumption allows us to cut out any experiments that are intended to give us information about the effect of  x$_1$x$_2$x$_3$x$_4$.

For any two groups of experiments, if one group has
\begin{equation}
x_1x_2x_3x_4=+1
\end{equation}

and the other group has
\begin{equation}
x_1x_2x_3x_4=−1
\end{equation}

then based on our assumption that that interaction effect will be unimportant, one of those two groups can be thrown out.

Fortuitously, the first time a variable is eliminated, no matter which variable it is, the number of experiments is cut in half. Further eliminations of variables continue to cut the number of experiments in half. So a six-factor experimental design could be whittled down as follows:

Six-factor, two-level experiment design:

- n=2,  k=6,  2$^6$ experimental design
- Full factorial:  2$^6$=64 experiments
- Half factorial:  2$^{6−1}$=32 experiments
- $\frac{1}{4}$ Fractional factorial:  2$^{6−2}$=16 experiments
- $\frac{1}{8}$ Fractional factorial:  2$^{6−3}$=8  experiments
- $\frac{1}{16}$ Fractional factorial:  2$^{6−4}$=4  experiments

In general, for an **n$^k$ experiment design** (n factor, k level), a **$\frac{1}{2p}$ fractional factorial** can be defined as:
- $\frac{1}{2p}$ Fractional factorial:  2$^{n−p}$  experiments

Note that as the fractional factorial gets narrower, and the experiments get fewer, the number of aliased interaction effects gets larger, until not even interaction effects can be distinguished, but only main effects. (Screening designs, such as Plackett-Burman designs, are based on this idea of highly-fractionated experiment design; we'll get into that later.)

For now, let's look at the half factorial: 32 experiments, with the reduction in variables coming from aliasing the interaction effect  x$_1$x$_2$x$_3$x$_4$:

In [4]:
print(len( doe[doe['x1-x2-x3-x4']==1] ))

32


## Costs and Benefits
The benefits are obvious - we've halved the number of experiments our experiment design requires. But at what cost?

The first 32 experiments, where  x$_1$x$_2$x$_3$x$_4$=+1, give us information at a positive level of that input variable combination. To get information at a negative level of that input variable combination (i.e.,  x$_1$x$_2$x$_3$x$_4$=−1
 ), we need 32 additional experiments.

Our assumption is that changing x$_1$x$_2$x$_3$x$_4$ from high to low will have no effect on the observable y.

This also modifies the information we get about higher-order interaction effects. For example, we've assumed:
\begin{equation}
x_1x_2x_3x_4=+1
\end{equation}

We can use this identity to figure out what information we're missing when we cut out the 32 experiments. Our assumption about the fourth-order interaction also changes fifth- and sixth-order interactions:

\begin{equation}
(x_1x_2x_3x_4)=(+1) \\
\end{equation}

\begin{equation}
(x_1x_2x_3x_4)x_5=(+1)x_5 \\
\end{equation}

\begin{equation}
x_1x_2x_3x_4x_5=x_5
\end{equation}

meaning the fifth-order interaction effect x$_1$x$_2$x$_3$x$_4$x$_5$ has been aliased with the first-order main effect x$_5$. This is a safe assumption since it is extremely unlikely that a fifth-order interaction effect could be confounded with a first-order main effect. We can derive other relations, using the fact that any factor squared is equivalent to (+1), so that:

\begin{equation}
(x_1x_2x_3x_4)=+1 \\
\end{equation}

\begin{equation}
(x_1x_2x_3x_4)x_1=(+1)x_1 \\
\end{equation}

\begin{equation}
(x^2_1x_2x_3x_4)=(+1)x_1 \\
\end{equation}

\begin{equation}
x_2x_3x_4=x_1
\end{equation}

The sequence of variables selected as the interaction effect to be used as the experimental design basis is called the generator, and is denoted $I$:

\begin{equation}
I=x_1x_2x_3x_4
\end{equation}

and we set $I$=+1 or $I$=−1.

In [5]:
# Defining multiple DOE matrices:

# DOE 1 based on identity I = x1 x2 x3 x4
doe1 = doe[doe['x1-x2-x3-x4']==1]

# DOE 2 based on identity I = x4 x5 x6
doe2 = doe[doe['x4-x5-x6']==-1]

# DOE 3 based on identity I = x2 x4 x5
doe3 = doe[doe['x2-x4-x5']==-1]

In [6]:
doe1[column_labs].T

Unnamed: 0,0,3,5,6,9,10,12,15,16,19,...,44,47,48,51,53,54,57,58,60,63
x1,-1,1,1,-1,1,-1,-1,1,-1,1,...,-1,1,-1,1,1,-1,1,-1,-1,1
x2,-1,1,-1,1,-1,1,-1,1,-1,1,...,-1,1,-1,1,-1,1,-1,1,-1,1
x3,-1,-1,1,1,-1,-1,1,1,-1,-1,...,1,1,-1,-1,1,1,-1,-1,1,1
x4,-1,-1,-1,-1,1,1,1,1,-1,-1,...,1,1,-1,-1,-1,-1,1,1,1,1
x5,-1,-1,-1,-1,-1,-1,-1,-1,1,1,...,-1,-1,1,1,1,1,1,1,1,1
x6,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,1,1,1,1,1,1,1,1,1,1


In [7]:
doe2[column_labs].T

Unnamed: 0,0,1,2,3,4,5,6,7,24,25,...,46,47,48,49,50,51,52,53,54,55
x1,-1,1,-1,1,-1,1,-1,1,-1,1,...,-1,1,-1,1,-1,1,-1,1,-1,1
x2,-1,-1,1,1,-1,-1,1,1,-1,-1,...,1,1,-1,-1,1,1,-1,-1,1,1
x3,-1,-1,-1,-1,1,1,1,1,-1,-1,...,1,1,-1,-1,-1,-1,1,1,1,1
x4,-1,-1,-1,-1,-1,-1,-1,-1,1,1,...,1,1,-1,-1,-1,-1,-1,-1,-1,-1
x5,-1,-1,-1,-1,-1,-1,-1,-1,1,1,...,-1,-1,1,1,1,1,1,1,1,1
x6,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,1,1,1,1,1,1,1,1,1,1


In [8]:
doe3[column_labs].T

Unnamed: 0,0,1,4,5,10,11,14,15,18,19,...,46,47,50,51,54,55,56,57,60,61
x1,-1,1,-1,1,-1,1,-1,1,-1,1,...,-1,1,-1,1,-1,1,-1,1,-1,1
x2,-1,-1,-1,-1,1,1,1,1,1,1,...,1,1,1,1,1,1,-1,-1,-1,-1
x3,-1,-1,1,1,-1,-1,1,1,-1,-1,...,1,1,-1,-1,1,1,-1,-1,1,1
x4,-1,-1,-1,-1,1,1,1,1,-1,-1,...,1,1,-1,-1,-1,-1,1,1,1,1
x5,-1,-1,-1,-1,-1,-1,-1,-1,1,1,...,-1,-1,1,1,1,1,1,1,1,1
x6,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,1,1,1,1,1,1,1,1,1,1


Each of the dataframes above represents a different fractional factorial design.

## $\frac{1}{4}$ Fractional Designs
To further reduce the number of experiments, two identities can be used. The number of experiments is cut in half for each identity. We already have one identity,
\begin{equation}
I=x_1x_2x_3x_4=+1
\end{equation}

now let's define another one:
\begin{equation}
I_2=x_4x_5x_6=1
\end{equation}

Our resulting factorial matrix can be reduced the same way. In Python, we use the logical_and function to ensure our two conditions are satisfied.

In [9]:
quarter_fractional_doe = doe[ np.logical_and( doe['x1-x2-x3-x4']==1, doe['x4-x5-x6']==1 ) ]
print("Number of experiments: %d"%(len(quarter_fractional_doe[column_labs])))
quarter_fractional_doe[column_labs].T

Number of experiments: 16


Unnamed: 0,9,10,12,15,16,19,21,22,32,35,37,38,57,58,60,63
x1,1,-1,-1,1,-1,1,1,-1,-1,1,1,-1,1,-1,-1,1
x2,-1,1,-1,1,-1,1,-1,1,-1,1,-1,1,-1,1,-1,1
x3,-1,-1,1,1,-1,-1,1,1,-1,-1,1,1,-1,-1,1,1
x4,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,1,1,1,1
x5,-1,-1,-1,-1,1,1,1,1,-1,-1,-1,-1,1,1,1,1
x6,-1,-1,-1,-1,-1,-1,-1,-1,1,1,1,1,1,1,1,1


With the quarter-fractional factorial design, what information do we lose? We know already which interaction effects are aliased with main effects:

\begin{equation}
x_4x_5x_6=(+1) \\
\end{equation}

\begin{equation}
(x_4x_5x_6)x_1=(+1)x_1 \\
\end{equation}

\begin{equation}
x_1x_4x_5x_6=x_1 \\
\end{equation}

\begin{equation}
x_2x_4x_5x_6=x_2 \\
\end{equation}

\begin{equation}
x_3x_4x_5x_6=x_3
\end{equation}

We can use this information to design our experiments to cover particular interaction effects we know to be important, or ignore others we don't expect to be significant.

# Other designs

[DOEPY](https://doepy.readthedocs.io/en/latest/) and [pydoe](https://pypi.org/project/pyDOE/) have many other experimental designs.  

Install doepy using pip

In [10]:
!pip install doepy

Collecting doepy
  Downloading doepy-0.0.1-py3-none-any.whl (21 kB)
Collecting pyDOE (from doepy)
  Downloading pyDOE-0.3.8.zip (22 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting diversipy (from doepy)
  Downloading diversipy-0.8.tar.gz (26 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: diversipy, pyDOE
  Building wheel for diversipy (setup.py) ... [?25l[?25hdone
  Created wheel for diversipy: filename=diversipy-0.8-py3-none-any.whl size=27512 sha256=f60a890e66c9228231583e7f99694d8ce288beb1ea3aa270a938cbd1c239767f
  Stored in directory: /root/.cache/pip/wheels/3a/20/09/526fae449308943c9b060f8f95c4004ad375a6f3fcbc83adf5
  Building wheel for pyDOE (setup.py) ... [?25l[?25hdone
  Created wheel for pyDOE: filename=pyDOE-0.3.8-py3-none-any.whl size=18168 sha256=cbbc771ba6c9c68ded7a44e871b54071d0c9ea7f845faff08c0c2dbe2ad2d139
  Stored in directory: /root/.cache/pip/wheels/ce/b6/d7/c6b64746dba6433c593e471e0ac3acf4f3604

## Load the moodules we need

In [11]:
from doepy import read_write, build

### Full factorial design
Let’s say you have a design problem with the following table for the parameters range. Imagine this as a generic example of a checmical process in a manufacturing plant. You have 3 levels of Pressure, 3 levels of Temperature, 2 levels of FlowRate, and 2 levels of Time.

In [12]:
build.full_fact(
{'Pressure':[40,55,70],
'Temperature':[290, 320, 350],
'Flow rate':[0.2,0.4],
'Time':[5,8]}
)

Unnamed: 0,Pressure,Temperature,Flow rate,Time
0,40.0,290.0,0.0,5.0
1,55.0,290.0,0.0,5.0
2,70.0,290.0,0.0,5.0
3,40.0,320.0,0.0,5.0
4,55.0,320.0,0.0,5.0
5,70.0,320.0,0.0,5.0
6,40.0,350.0,0.0,5.0
7,55.0,350.0,0.0,5.0
8,70.0,350.0,0.0,5.0
9,40.0,290.0,1.0,5.0


### Latin Hypercube design
Sometimes, a set of randomized design points within a given range could be attractive for the experimenter to asses the impact of the process variables on the output. Monte Carlo simulations are close example of this approach.

However, a Latin Hypercube design is better choice for experimental design rather than building a complete random matrix as it tries to subdivide the sample space in smaller cells and choose only one element out of each subcell. This way, a more uniform spreading’ of the random sample points can be obtained.

User can choose the density of sample points. For example, if we choose to generate a Latin Hypercube of 12 experiments from the same input files, that could look like,

In [13]:
build.space_filling_lhs(
{'Pressure':[40,55,70],
'Temperature':[290, 320, 350],
'Flow rate':[0.2,0.4],
'Time':[5,11]},
num_samples = 12
)

Pressure had more than two levels. Assigning the end point to the high level.
Temperature had more than two levels. Assigning the end point to the high level.


Unnamed: 0,Pressure,Temperature,Flow rate,Time
0,64.545455,306.363636,0.2,5.545455
1,67.272727,311.818182,0.254545,5.0
2,48.181818,333.636364,0.236364,9.909091
3,45.454545,317.272727,0.363636,8.272727
4,61.818182,350.0,0.4,9.363636
5,59.090909,300.909091,0.218182,6.090909
6,53.636364,339.090909,0.290909,7.181818
7,56.363636,295.454545,0.327273,11.0
8,50.909091,344.545455,0.381818,7.727273
9,40.0,328.181818,0.345455,6.636364


Of course, there is no guarantee that you will get the same matrix if you run this function because this are randomly sampled, but you get the idea!

Other functions to try on Try any one of the following designs,

- Full factorial: build.full_fact()
- 2-level fractional factorial: build.frac_fact_res()
- Plackett-Burman: build.plackett_burman()
- Sukharev grid: build.sukharev()
- Box-Behnken: build.box_behnken()
- Box-Wilson (Central-composite) with center-faced option: build.central_composite() with face='ccf' option
- Box-Wilson (Central-composite) with center-inscribed option: build.central_composite() with face='cci' option
- Box-Wilson (Central-composite) with center-circumscribed option: build.central_composite() with face='ccc' option
- Latin hypercube (simple): build.lhs()
- Latin hypercube (space-filling): build.space_filling_lhs()
- Random k-means cluster: build.random_k_means()
- Maximin reconstruction: build.maximin()
- Halton sequence based: build.halton()
- Uniform random matrix: build.uniform_random()

After perfroming the experiments, we analyze in the way we done in previous tutorials.