# Introduction

This a short guide on how to generate *Plackett-Burman* designs for screening and computing main effects of factors using a linear model fit. For more information, check the [documentation](https://phrb.github.io/ExperimentalDesign.jl/dev/).

## Setup

First, check if you are at the correct project environment. It should be `ExperimentalDesign`:

In [18]:
using Pkg
Pkg.status()

[36m[1mProject [22m[39mExperimentalDesign v0.2.0
[32m[1m    Status[22m[39m `~/.julia/dev/ExperimentalDesign/Project.toml`
 [90m [a93c6f00][39m[37m DataFrames v0.20.2[39m
 [90m [864edb3b][39m[37m DataStructures v0.17.10[39m
 [90m [31c24e10][39m[37m Distributions v0.22.6[39m
 [90m [ffbed154][39m[37m DocStringExtensions v0.8.1[39m
 [90m [e30172f5][39m[37m Documenter v0.24.6[39m
 [90m [38e38edf][39m[37m GLM v1.3.7[39m
 [90m [27ebfcd6][39m[37m Primes v0.4.0[39m
 [90m [2913bbd2][39m[37m StatsBase v0.32.2[39m
 [90m [3eaba693][39m[37m StatsModels v0.6.10[39m
 [90m [37e2e46d][39m[37m LinearAlgebra [39m
 [90m [56ddb016][39m[37m Logging [39m
 [90m [9a3f8284][39m[37m Random [39m
 [90m [8dfed614][39m[37m Test [39m


Then check if all packages are installed and up to date:

In [20]:
Pkg.update()

[32m[1m  Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m  Updating[22m[39m `~/.julia/dev/ExperimentalDesign/Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `~/.julia/dev/ExperimentalDesign/Manifest.toml`
[90m [no changes][39m


In [25]:
using ExperimentalDesign, StatsModels, GLM, DataFrames

# Generating Plackett-Burman Designs

A Plackett-Burman design is an orthogonal design matrix for factors $f_1,\dots,f_N$. Factors are encoded by high and low values, which can be mapped to the interval $[-1, 1]$. For designs in this package, the design matrix is a `DataFrame` from the [DataFrame package](https://juliastats.org/GLM.jl/stable/). For example, let's create a Plackett-Burman design for 6 factors:

In [4]:
design = PlackettBurman(6)
design.matrix

Unnamed: 0_level_0,factor1,factor2,factor3,factor4,factor5,factor6,dummy1
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1,1,1
2,-1,1,-1,1,1,-1,-1
3,1,-1,1,1,-1,-1,-1
4,-1,1,1,-1,-1,-1,1
5,1,1,-1,-1,-1,1,-1
6,1,-1,-1,-1,1,-1,1
7,-1,-1,-1,1,-1,1,1
8,-1,-1,1,-1,1,1,-1


Note that it is not possible to construct exact Plackett-Burman designs for all numbers of factors. In the example above, we needed a seventh extra "dummy" column to construct the design for six factors.

Using the `PlackettBurman` allows quick construction of minimal screening designs for scenarios where we ignore interactions. We can access the underlying formula, which is a `Term` object from the [StatsModels package](https://juliastats.org/StatsModels.jl/stable/):

In [5]:
println(design.formula)

response ~ -1 + factor1 + factor2 + factor3 + factor4 + factor5 + factor6 + dummy1


Notice we ignore interactions and include the dummy factor in the model. Strong main effects attributed to dummy factors may indicate important interactions.

We can obtain a tuple with the names of dummy factors:

In [6]:
design.dummy_factors

(:dummy1,)

We can also get the main factors tuple:

In [7]:
design.factors

(:factor1, :factor2, :factor3, :factor4, :factor5, :factor6)

You can check other constructors on [the docs](https://phrb.github.io/ExperimentalDesign.jl/dev/lib/public/#ExperimentalDesign.PlackettBurman-Tuple{Int64}).

# Computing Main Effects

Suppose that the response variable on the experiments specified in our screening design is computed by:

$$
y = 1.2 + (2.3f_1) + (-3.4f_2) + (7.12f_3) + (-0.03f_4) + (1.1f_5) + (-0.5f_6) + \varepsilon
$$

The coefficients we want to estimate are:

| Intercept | factor1 | factor2 | factor3 | factor4 | factor5 | factor6 |
|---|---|---|---|---|---|---|
| 1.2 | 2.3 | -3.4 | 7.12 | -0.03 | 1.1 | -0.5 |

The corresponding Julia function is:

In [8]:
function y(x)
    return (1.2) +
           (2.3 * x[1]) +
           (-3.4 * x[2]) +
           (7.12 * x[3]) +
           (-0.03 * x[4]) +
           (1.1 * x[5]) +
           (-0.5 * x[6]) +
           (1.1 * randn())
end

y (generic function with 1 method)

We can compute the response column for our design using the cell below. Recall that the default is to call the response column `:response`.

In [26]:
design.matrix[!, :response] = y.(eachrow(design.matrix[:, collect(design.factors)]))
design.matrix

Unnamed: 0_level_0,factor1,factor2,factor3,factor4,factor5,factor6,dummy1,response
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Float64
1,1,1,1,1,1,1,1,7.65914
2,-1,1,-1,1,1,-1,-1,-10.3243
3,1,-1,1,1,-1,-1,-1,13.3226
4,-1,1,1,-1,-1,-1,1,2.33105
5,1,1,-1,-1,-1,1,-1,-7.77123
6,1,-1,-1,-1,1,-1,1,-0.120945
7,-1,-1,-1,1,-1,1,1,-6.20142
8,-1,-1,1,-1,1,1,-1,9.9957


Now, we use the `lm` function from the [GLM package](https://juliastats.org/GLM.jl/stable/) to fit a linear model using the design's matrix and formula:

In [10]:
lm(design.formula, design.matrix)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

response ~ 0 + factor1 + factor2 + factor3 + factor4 + factor5 + factor6 + dummy1

Coefficients:
─────────────────────────────────────────────────────────────────────────
          Estimate  Std. Error    t value  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
factor1   2.07425      1.60867   1.28942     0.4199   -18.3659    22.5144
factor2  -3.76248      1.60867  -2.33887     0.2572   -24.2026    16.6776
factor3   6.92894      1.60867   4.30724     0.1452   -13.5112    27.3691
factor4   0.165416     1.60867   0.102828    0.9348   -20.2747    20.6055
factor5   1.26416      1.60867   0.78584     0.5760   -19.176     21.7043
factor6  -0.26782      1.60867  -0.166485    0.8950   -20.7079    20.1723
dummy1   -0.409582     1.60867  -0.254609    0.8413   -20.8497    2

The table below shows the coefficients estimated by the linear model fit using the Plackett-Burman Design. The purpose of a screening design is not to estimate the actual coefficients, but instead to compute factor main effects. Note that standard errors are the same for every factor estimate. This happens because the design is orthogonal.

| | Intercept | factor1 | factor2 | factor3 | factor4 | factor5 | factor6 | dummy1 |
|---|---|---|---|---|---|---|---|---|
| Original | 1.2 | 2.3 | -3.4 | 7.12 | -0.03 | 1.1 | -0.5 | $-$ |
| Plackett-Burman Main Effects | $-$ | 2.07425 | -3.76248 | 6.92894 | 0.165416 | 1.26416 | -0.26782 | -0.409582 |

We can use the coefficient magnitudes to infer that factor 3 probably has a strong main effect, and that factor 6 has not. Our dummy column had a relatively small coefficient estimate, so we could attempt to ignore interactions on subsequent experiments.

# Fitting a Linear Model

We can also try to fit a linear model on our design data in order to estimate coefficients. We would need to drop the dummy column and add the intercept term:

In [44]:
lm(term(:response) ~ sum(term.(design.factors)), design.matrix)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

response ~ 1 + factor1 + factor2 + factor3 + factor4 + factor5 + factor6

Coefficients:
────────────────────────────────────────────────────────────────────────────────
               Estimate  Std. Error      t value  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────────
(Intercept)   1.11133      0.194373    5.71751      0.1102  -1.35841     3.58107
factor1       2.16107      0.194373   11.1182       0.0571  -0.308673    4.63081
factor2      -3.13766      0.194373  -16.1425       0.0394  -5.6074     -0.66792
factor3       7.2158       0.194373   37.1235       0.0171   4.74606     9.68554
factor4       0.0026864    0.194373    0.0138208    0.9912  -2.46706     2.47243
factor5       0.691073     0.194373    3.5554       0.1745  -1.77867     3.16081
factor6      

Our table so far looks like this:

| | Intercept | factor1 | factor2 | factor3 | factor4 | factor5 | factor6 | dummy1 |
|---|---|---|---|---|---|---|---|---|
| Original | 1.2 | 2.3 | -3.4 | 7.12 | -0.03 | 1.1 | -0.5 | $-$ |
| Plackett-Burman Main Effects | $-$ | 2.07425 | -3.76248 | 6.92894 | 0.165416 | 1.26416 | -0.26782 | -0.409582 |
| Plackett-Burman Estimate | 1.11133 | 2.16107 | -3.13766 | 7.2158 | 0.0026864 | 0.691073 | -0.190781 | $-$ |

Notice that, since the standard errors are the same for all factors, factors with stronger main effects are better estimated. Notice that, despite the "good" coefficient estimates, the confidence intervals are really large.

This is a biased comparison where the screening design "works" for coefficient estimation as well, but we would rather use fractional factorial or optimal designs to estimate the coefficients of factors with strong effects. Screening should be used to compute main effects and identifying which factors to test next.

# Generating Random Designs

We can also compare the coefficients produced by the same linear model fit, but using a random design. For more information, check [the docs](https://phrb.github.io/ExperimentalDesign.jl/dev/lib/public/#ExperimentalDesign.RandomDesign-Tuple{NamedTuple}).

In [59]:
random_design_generator = RandomDesign(tuple(fill(Uniform(-1, 1), 6)...))
random_design = rand(random_design_generator, 8)

random_design[!, :response] = y.(eachrow(random_design[:, :]))
random_design

Unnamed: 0_level_0,factor1,factor2,factor3,factor4,factor5,factor6,response
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,-0.427031,-0.257169,-0.0897106,-0.294582,-0.558218,-0.393435,0.476253
2,0.603076,-0.254854,-0.665052,-0.730241,0.646543,-0.501497,-2.19561
3,-0.271607,-0.280673,-0.403478,0.141009,0.279953,-0.107181,-3.4743
4,-0.188718,-0.865998,-0.545536,-0.823412,0.418322,-0.912559,-1.00517
5,0.545568,0.912846,0.70278,0.210541,-0.865226,0.554044,2.33168
6,0.894642,0.553883,0.638578,0.304812,0.553461,0.283002,6.69112
7,0.139338,0.108341,-0.352455,-0.27172,0.891287,-0.226327,0.125907
8,-0.275804,0.317014,-0.992121,-0.893386,-0.182471,0.5529,-6.32122


In [55]:
lm(random_design_generator.formula, random_design)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

response ~ 1 + factor1 + factor2 + factor3 + factor4 + factor5 + factor6

Coefficients:
─────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error    t value  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)   2.00407     0.769246   2.60524     0.2333   -7.77013   11.7783 
factor1       0.970606    0.974792   0.995706    0.5014  -11.4153    13.3565 
factor2      -1.99341     1.04673   -1.90441     0.3078  -15.2934    11.3066 
factor3       8.72217     2.01669    4.32499     0.1447  -16.9023    34.3467 
factor4      -2.35305     1.40052   -1.68012     0.3418  -20.1484    15.4423 
factor5       0.85805     1.80866    0.474413    0.7180  -22.1231    23.8392 
factor6      -3.83447     0.995831  -3.8

Now, our table looks like this:

| | Intercept | factor1 | factor2 | factor3 | factor4 | factor5 | factor6 | dummy1 |
|---|---|---|---|---|---|---|---|---|
| Original | 1.2 | 2.3 | -3.4 | 7.12 | -0.03 | 1.1 | -0.5 | $-$ |
| Plackett-Burman Main Effects | $-$ | 2.07425 | -3.76248 | 6.92894 | 0.165416 | 1.26416 | -0.26782 | -0.409582 |
| Plackett-Burman Estimate | 1.11133 | 2.16107 | -3.13766 | 7.2158 | 0.0026864 | 0.691073 | -0.190781 | $-$ |
| Single Random Design Estimate | -0.330846 | -11.0202 | 7.32197 | 12.225 | 11.8927 | 19.5051 | 7.6392 | $-$ |

The estimates produced using random designs will have larger confidence intervals, and therefore increased variability. The Plackett-Burman design is fixed, but can be randomised. The variability of main effects estimates using screening designs will depend on measurement or model error.

# Generating Full Factorial Designs

In this toy example, it is possible to generate all the possible combinations of six binary factors and compute the response. Although it costs 64 experiments, the linear model fit for the full factorial design should produce the best coefficient estimates.

The `explicit` parameter below make the full factorial a complete `DataFrame` in memory. Since factorial designs can be prohibitively large, you can omit the `explicit` parameter, or set it to `false`, to create an iterator that will generate design lines on demand. For more, check [the docs](https://phrb.github.io/ExperimentalDesign.jl/dev/lib/public/#ExperimentalDesign.FullFactorial-Tuple{NamedTuple,StatsModels.FormulaTerm}).

In [58]:
factorial_design = FullFactorial(tuple(fill((-1, 1), 6)...), explicit = true)
factorial_design.matrix

factorial_design.matrix[!, :response] = y.(eachrow(factorial_design.matrix[:, :]))
factorial_design.matrix

lm(factorial_design.formula, factorial_design.matrix)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

response ~ 1 + factor1 + factor2 + factor3 + factor4 + factor5 + factor6

Coefficients:
───────────────────────────────────────────────────────────────────────────────
               Estimate  Std. Error     t value  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────────
(Intercept)   1.16675      0.142544    8.18519     <1e-10   0.881308   1.45219 
factor1       2.19939      0.142544   15.4296      <1e-21   1.91396    2.48483 
factor2      -3.35801      0.142544  -23.5578      <1e-30  -3.64345   -3.07257 
factor3       7.23984      0.142544   50.7903      <1e-48   6.9544     7.52528 
factor4      -0.0439413    0.142544   -0.308265    0.7590  -0.32938    0.241498
factor5       0.947299     0.142544    6.64568     <1e-7    0.66186    1.23274 
factor6      -0.43131 

The confidence intervals for this fit are much smaller. Since we have all information on all factors and this is a balanced design, the standard error is the same for all estimates. Here's the complete table:

| | Intercept | factor1 | factor2 | factor3 | factor4 | factor5 | factor6 | dummy1 |
|---|---|---|---|---|---|---|---|---|
| Original | 1.2 | 2.3 | -3.4 | 7.12 | -0.03 | 1.1 | -0.5 | $-$ |
| Plackett-Burman Main Effects | $-$ | 2.07425 | -3.76248 | 6.92894 | 0.165416 | 1.26416 | -0.26782 | -0.409582 |
| Plackett-Burman Estimate | 1.11133 | 2.16107 | -3.13766 | 7.2158 | 0.0026864 | 0.691073 | -0.190781 | $-$ |
| Single Random Design Estimate | -0.330846 | -11.0202 | 7.32197 | 12.225 | 11.8927 | 19.5051 | 7.6392 | $-$ |
| Full Factorial Estimate | 1.16313 | 2.23662 | -3.55696 | 7.03516 | -0.102358 | 0.987681 | -0.44809 | $-$ |

Full factorial designs may be too expensive in actual applications. Fractional factorial designs or optimal designs can be used to decrease costs while still providing good estimates. Screening designs are extremely cheap, and can help determine which factors can potentially be dropped on more expensive and precise designs.

Check the examples directory for more tutorials!