# Introduction

This a short guide on how to generate *Plackett-Burman* designs for screening and computing main effects of factors using a linear model fit. For more information, check the [documentation](https://phrb.github.io/ExperimentalDesign.jl/dev/).

## Setup

First, check if you are at the correct project environment. It should be `ExperimentalDesign`:

In [1]:
using Pkg
Pkg.status()

[36m[1mProject [22m[39mExperimentalDesign v0.3.0
[32m[1m    Status[22m[39m `~/.julia/dev/ExperimentalDesign/Project.toml`
 [90m [a93c6f00][39m[37m DataFrames v0.21.3[39m
 [90m [864edb3b][39m[37m DataStructures v0.17.18[39m
 [90m [31c24e10][39m[37m Distributions v0.23.4[39m
 [90m [ffbed154][39m[37m DocStringExtensions v0.8.2[39m
 [90m [e30172f5][39m[37m Documenter v0.24.11[39m
 [90m [38e38edf][39m[37m GLM v1.3.9[39m
 [90m [27ebfcd6][39m[37m Primes v0.5.0[39m
 [90m [2913bbd2][39m[37m StatsBase v0.33.0[39m
 [90m [3eaba693][39m[37m StatsModels v0.6.11[39m
 [90m [37e2e46d][39m[37m LinearAlgebra [39m
 [90m [56ddb016][39m[37m Logging [39m
 [90m [9a3f8284][39m[37m Random [39m
 [90m [8dfed614][39m[37m Test [39m


Then check if all packages are installed and up to date:

In [2]:
Pkg.update()

[32m[1m  Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m  Updating[22m[39m `~/.julia/dev/ExperimentalDesign/Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `~/.julia/dev/ExperimentalDesign/Manifest.toml`
[90m [no changes][39m


In [2]:
using ExperimentalDesign, StatsModels, GLM, DataFrames, Distributions, Random

┌ Info: Precompiling ExperimentalDesign [4babbea4-9e7d-11e9-116f-e1ada04bd296]
└ @ Base loading.jl:1273
┌ Info: Precompiling GLM [38e38edf-8417-5370-95a0-9cbb8c7f171a]
└ @ Base loading.jl:1273


# Generating Plackett-Burman Designs

A Plackett-Burman design is an orthogonal design matrix for factors $f_1,\dots,f_N$. Factors are encoded by high and low values, which can be mapped to the interval $[-1, 1]$. For designs in this package, the design matrix is a `DataFrame` from the [DataFrame package](https://juliastats.org/GLM.jl/stable/). For example, let's create a Plackett-Burman design for 6 factors:

In [3]:
design = PlackettBurman(6)
design.matrix

Unnamed: 0_level_0,factor1,factor2,factor3,factor4,factor5,factor6,dummy1
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1,1,1
2,-1,1,-1,1,1,-1,-1
3,1,-1,1,1,-1,-1,-1
4,-1,1,1,-1,-1,-1,1
5,1,1,-1,-1,-1,1,-1
6,1,-1,-1,-1,1,-1,1
7,-1,-1,-1,1,-1,1,1
8,-1,-1,1,-1,1,1,-1


Note that it is not possible to construct exact Plackett-Burman designs for all numbers of factors. In the example above, we needed a seventh extra "dummy" column to construct the design for six factors.

Using the `PlackettBurman` constructor enables quick construction of minimal screening designs for scenarios where we ignore interactions. We can access the underlying formula, which is a `Term` object from the [StatsModels package](https://juliastats.org/StatsModels.jl/stable/):

In [4]:
println(design.formula)

0 ~ -1 + factor1 + factor2 + factor3 + factor4 + factor5 + factor6 + dummy1


Notice we ignore interactions and include the dummy factor in the model. Strong main effects attributed to dummy factors may indicate important interactions.

We can obtain a tuple with the names of dummy factors:

In [5]:
design.dummy_factors

(:dummy1,)

We can also get the main factors tuple:

In [6]:
design.factors

(:factor1, :factor2, :factor3, :factor4, :factor5, :factor6)

You can check other constructors on [the docs](https://phrb.github.io/ExperimentalDesign.jl/dev/lib/public/#ExperimentalDesign.PlackettBurman-Tuple{Int64}).

# Computing Main Effects

Suppose that the response variable on the experiments specified in our screening design is computed by:

$$
y = 1.2 + (2.3f_1) + (-3.4f_2) + (7.12f_3) + (-0.03f_4) + (1.1f_5) + (-0.5f_6) + \varepsilon
$$

The coefficients we want to estimate are:

| Intercept | factor1 | factor2 | factor3 | factor4 | factor5 | factor6 |
|---|---|---|---|---|---|---|
| 1.2 | 2.3 | -3.4 | 7.12 | -0.03 | 1.1 | -0.5 |

The corresponding Julia function is:

In [7]:
function y(x)
    return (1.2) +
           (2.3 * x[1]) +
           (-3.4 * x[2]) +
           (7.12 * x[3]) +
           (-0.03 * x[4]) +
           (1.1 * x[5]) +
           (-0.5 * x[6]) +
           (1.1 * randn())
end

y (generic function with 1 method)

We can compute the response column for our design using the cell below. Recall that the default is to call the response column `:response`. We are going to set the seeds each time we run `y(x)`, so we analyse same results. Play with different seeds to observe variability of estimates.

In [8]:
Random.seed!(192938)

design.matrix[!, :response] = y.(eachrow(design.matrix[:, collect(design.factors)]))
design.matrix

Unnamed: 0_level_0,factor1,factor2,factor3,factor4,factor5,factor6,dummy1,response
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Float64
1,1,1,1,1,1,1,1,8.53078
2,-1,1,-1,1,1,-1,-1,-11.1296
3,1,-1,1,1,-1,-1,-1,14.0101
4,-1,1,1,-1,-1,-1,1,2.30315
5,1,1,-1,-1,-1,1,-1,-7.15356
6,1,-1,-1,-1,1,-1,1,-0.972234
7,-1,-1,-1,1,-1,1,1,-7.84967
8,-1,-1,1,-1,1,1,-1,9.78249


Now, we use the `lm` function from the [GLM package](https://juliastats.org/GLM.jl/stable/) to fit a linear model using the design's matrix and formula:

In [15]:
lm(term(:response) ~ design.formula.rhs, design.matrix)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

response ~ 0 + factor1 + factor2 + factor3 + factor4 + factor5 + factor6 + dummy1

Coefficients:
───────────────────────────────────────────────────────────────────────────
           Estimate  Std. Error     t value  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
factor1   2.66359      0.940183   2.83305      0.2160   -9.28257   14.6097 
factor2  -2.80249      0.940183  -2.98079      0.2061  -14.7486     9.14367
factor3   7.71644      0.940183   8.20739      0.0772   -4.22971   19.6626 
factor4  -0.0497774    0.940183  -0.0529443    0.9663  -11.9959    11.8964 
factor5   0.612681     0.940183   0.651661     0.6323  -11.3335    12.5588 
factor6  -0.112675     0.940183  -0.119844     0.9241  -12.0588    11.8335 
dummy1   -0.437176     0.940183  -0.46499      0.

The table below shows the coefficients estimated by the linear model fit using the Plackett-Burman Design. The purpose of a screening design is not to estimate the actual coefficients, but instead to compute factor main effects. Note that standard errors are the same for every factor estimate. This happens because the design is orthogonal.

| | Intercept | factor1 | factor2 | factor3 | factor4 | factor5 | factor6 | dummy1 |
|---|---|---|---|---|---|---|---|---|
| Original | 1.2 | 2.3 | -3.4 | 7.12 | -0.03 | 1.1 | -0.5 | $-$ |
| Plackett-Burman Main Effects | $-$ | 2.66359 | -2.80249 | 7.71644 | -0.0497774 | 0.612681 | -0.112675 | -0.437176 |

We can use the coefficient magnitudes to infer that factor 3 probably has a strong main effect, and that factor 6 has not. Our dummy column had a relatively small coefficient estimate, so we could attempt to ignore interactions on subsequent experiments.

# Fitting a Linear Model

We can also try to fit a linear model on our design data in order to estimate coefficients. We would need to drop the dummy column and add the intercept term:

In [16]:
lm(term(:response) ~ sum(term.(design.factors)), design.matrix)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

response ~ 1 + factor1 + factor2 + factor3 + factor4 + factor5 + factor6

Coefficients:
──────────────────────────────────────────────────────────────────────────────
               Estimate  Std. Error    t value  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)   0.940183     0.437176   2.15058     0.2771   -4.61466    6.49503
factor1       2.66359      0.437176   6.09271     0.1036   -2.89126    8.21843
factor2      -2.80249      0.437176  -6.41043     0.0985   -8.35733    2.75236
factor3       7.71644      0.437176  17.6507      0.0360    2.1616    13.2713 
factor4      -0.0497774    0.437176  -0.113861    0.9278   -5.60462    5.50507
factor5       0.612681     0.437176   1.40145     0.3946   -4.94217    6.16753
factor6      -0.112675     0.43

Our table so far looks like this:

| | Intercept | factor1 | factor2 | factor3 | factor4 | factor5 | factor6 | dummy1 |
|---|---|---|---|---|---|---|---|---|
| Original | 1.2 | 2.3 | -3.4 | 7.12 | -0.03 | 1.1 | -0.5 | $-$ |
| Plackett-Burman Main Effects | $-$ | 2.66359 | -2.80249 | 7.71644 | -0.0497774 | 0.612681 | -0.112675 | -0.437176 |
| Plackett-Burman Estimate | 0.940183 | 2.66359 | -2.80249 | 7.71644 | -0.0497774 | 0.612681 | -0.112675 | $-$ |

Notice that, since the standard errors are the same for all factors, factors with stronger main effects are better estimated. Notice that, despite the "good" coefficient estimates, the confidence intervals are really large.

This is a biased comparison where the screening design "works" for coefficient estimation as well, but we would rather use fractional factorial or optimal designs to estimate the coefficients of factors with strong effects. Screening should be used to compute main effects and identifying which factors to test next.

# Generating Random Designs

We can also compare the coefficients produced by the same linear model fit, but using a random design. For more information, check [the docs](https://phrb.github.io/ExperimentalDesign.jl/dev/lib/public/#ExperimentalDesign.RandomDesign-Tuple{NamedTuple}).

In [20]:
Random.seed!(8418172)

design_distribution = DesignDistribution(DiscreteNonParametric([-1, 1], [0.5, 0.5]), 6)
random_design = rand(design_distribution, 8)

random_design.matrix[!, :response] = y.(eachrow(random_design.matrix[:, :]))
random_design.matrix

Unnamed: 0_level_0,factor1,factor2,factor3,factor4,factor5,factor6,response
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,-1.0,1.0,1.0,1.0,1.0,14.8616
2,1.0,-1.0,1.0,1.0,-1.0,-1.0,11.8434
3,-1.0,1.0,1.0,-1.0,-1.0,-1.0,2.64702
4,1.0,-1.0,1.0,-1.0,1.0,-1.0,15.5183
5,1.0,1.0,1.0,-1.0,1.0,1.0,7.76413
6,1.0,-1.0,1.0,1.0,1.0,-1.0,15.6221
7,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-3.63797
8,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-9.45331


In [13]:
lm(random_design_generator.formula, random_design)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

response ~ 1 + factor1 + factor2 + factor3 + factor4 + factor5 + factor6

Coefficients:
─────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error    t value  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)   0.600392    0.738868   0.812583    0.5656   -8.78781    9.9886 
factor1       2.2371      0.639878   3.49613     0.1774   -5.89333   10.3675 
factor2      -2.56857     1.04492   -2.45816     0.2460  -15.8455    10.7084 
factor3       8.05743     0.738868  10.9051      0.0582   -1.33077   17.4456 
factor4       0.140622    1.04492    0.134577    0.9148  -13.1363    13.4175 
factor5       0.907918    0.97743    0.928882    0.5235  -11.5115    13.3273 
factor6      -0.600354    0.738868  -0.8

Now, our table looks like this:

| | Intercept | factor1 | factor2 | factor3 | factor4 | factor5 | factor6 | dummy1 |
|---|---|---|---|---|---|---|---|---|
| Original | 1.2 | 2.3 | -3.4 | 7.12 | -0.03 | 1.1 | -0.5 | $-$ |
| Plackett-Burman Main Effects | $-$ | 2.66359 | -2.80249 | 7.71644 | -0.0497774 | 0.612681 | -0.112675 | -0.437176 |
| Plackett-Burman Estimate | 0.940183 | 2.66359 | -2.80249 | 7.71644 | -0.0497774 | 0.612681 | -0.112675 | $-$ |
| Single Random Design Estimate | 0.600392 | 2.2371 | -2.56857 | 8.05743 | 0.140622 | 0.907918 | -0.600354 | $-$ |

The estimates produced using random designs will have larger confidence intervals, and therefore increased variability. The Plackett-Burman design is fixed, but can be randomised. The variability of main effects estimates using screening designs will depend on measurement or model error.

# Generating Full Factorial Designs

In this toy example, it is possible to generate all the possible combinations of six binary factors and compute the response. Although it costs 64 experiments, the linear model fit for the full factorial design should produce the best coefficient estimates.

The simplest full factorial design constructor receives an array of possible factor levels. For more, check [the docs](https://phrb.github.io/ExperimentalDesign.jl/dev/lib/public/#ExperimentalDesign.FullFactorial-Tuple{NamedTuple,StatsModels.FormulaTerm}).

In [24]:
Random.seed!(2989476)

factorial_design = FullFactorial(fill([-1, 1], 6))
factorial_design.matrix[!, :response] = y.(eachrow(factorial_design.matrix[:, :]))

lm(term(:response) ~ factorial_design.formula.rhs, factorial_design.matrix)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

response ~ 1 + factor1 + factor2 + factor3 + factor4 + factor5 + factor6

Coefficients:
──────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error    t value  Pr(>|t|)  Lower 95%   Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)   1.13095     0.123021    9.1932     <1e-12   0.88461    1.3773   
factor1       2.23668     0.123021   18.1813     <1e-24   1.99034    2.48303  
factor2      -3.4775      0.123021  -28.2675     <1e-34  -3.72384   -3.23115  
factor3       6.95531     0.123021   56.5377     <1e-51   6.70897    7.20166  
factor4      -0.160546    0.123021   -1.30503    0.1971  -0.406891   0.0857987
factor5       0.975471    0.123021    7.92932    <1e-10   0.729127   1.22182  
factor6      -0.357748    0.123

The confidence intervals for this fit are much smaller. Since we have all information on all factors and this is a balanced design, the standard error is the same for all estimates. Here's the complete table:

| | Intercept | factor1 | factor2 | factor3 | factor4 | factor5 | factor6 | dummy1 |
|---|---|---|---|---|---|---|---|---|
| Original | 1.2 | 2.3 | -3.4 | 7.12 | -0.03 | 1.1 | -0.5 | $-$ |
| Plackett-Burman Main Effects | $-$ | 2.66359 | -2.80249 | 7.71644 | -0.0497774 | 0.612681 | -0.112675 | -0.437176 |
| Plackett-Burman Estimate | 0.940183 | 2.66359 | -2.80249 | 7.71644 | -0.0497774 | 0.612681 | -0.112675 | $-$ |
| Single Random Design Estimate | 0.600392 | 2.2371 | -2.56857 | 8.05743 | 0.140622 | 0.907918 | -0.600354 | $-$ |
| Full Factorial Estimate | 1.13095 | 2.23668 | -3.4775 | 6.95531 | -0.160546 | 0.975471 | -0.357748 | $-$ |

Full factorial designs may be too expensive in actual applications. Fractional factorial designs or optimal designs can be used to decrease costs while still providing good estimates. Screening designs are extremely cheap, and can help determine which factors can potentially be dropped on more expensive and precise designs.

Check the examples directory for more tutorials!