# **Chapter 4. Design of Experiments**

## **4.1. Fundamentals of Design of Experiments**

**Introduction:**

Design of Experiments (DOE) is a systematic method used to plan, conduct, analyze, and interpret controlled tests to evaluate the factors that may influence a particular outcome or response variable. The primary goal of DOE is to determine cause-and-effect relationships by manipulating input variables and observing the resulting changes in output responses. It is a crucial aspect of scientific research, engineering, and industrial processes, enabling researchers and practitioners to make informed decisions based on empirical evidence.

**Key Concepts:**

1. **Factors and Levels:**

   - **Factors** are independent variables that are systematically manipulated during the experiment to observe their effect on the response variable. Factors can be categorical (e.g., type of material) or continuous (e.g., temperature).
   - **Levels** are the specific values or settings of each factor used in the experiment. For example, a temperature factor might have levels of 50°C, 75°C, and 100°C.

2. **Response Variable:**

   - The **response variable** (dependent variable) is the outcome or characteristic being measured in the experiment, which is expected to change due to variations in the factors. It is the primary focus of the experimental study.

3. **Experimental Units:**

   - **Experimental units** are the smallest division of the experimental material such that any two units may receive different treatments. They are the entities to which treatments are applied independently.

4. **Treatments:**

   - A **treatment** is a specific combination of factor levels whose effect is to be compared with other treatments. Each treatment represents a unique experimental condition.

5. **Randomization:**

   - **Randomization** is the practice of assigning treatments to experimental units by chance to reduce bias. It ensures that the experiment does not systematically favor one treatment over another and helps in balancing out the effects of lurking variables.

6. **Replication:**

   - **Replication** involves repeating the experimental conditions multiple times. Replicates provide an estimate of the experimental error and increase the precision of the experiment by reducing the impact of random variability.

7. **Blocking:**

   - **Blocking** is a technique used to account for variability among experimental units by grouping similar units together. Within each block, treatments are randomized, which helps to control for known sources of variation.

8. **Experimental Error:**

   - **Experimental error** refers to the unexplained variation in the response variable that cannot be attributed to the factors being studied. It encompasses all unknown and uncontrolled influences on the response.

9. **Interactions:**

   - **Interaction effects** occur when the effect of one factor on the response variable depends on the level of another factor. Identifying interactions is essential for understanding complex relationships between factors.

**Principles of Experimental Design:**

1. **Control:**

   - Control unwanted sources of variation to isolate the effect of the factors of interest. This includes maintaining consistent environmental conditions and using control groups when appropriate.

2. **Randomization:**

   - Randomly assign treatments to experimental units to avoid systematic biases and ensure that the treatment groups are comparable.

3. **Replication:**

   - Use sufficient replicates to obtain reliable estimates of the treatment effects and to increase the statistical power of the experiment.

4. **Blocking:**

   - Group similar experimental units together to control for sources of variability that are not of primary interest.

5. **Factorial Design:**

   - Study all possible combinations of factor levels to fully understand the effects and interactions among factors.

**Importance of DOE:**

- **Efficiency:** DOE helps in obtaining maximum information with minimal resources by carefully planning experiments.
- **Validity:** By controlling extraneous variables and using randomization, DOE ensures the validity and reliability of the results.
- **Optimization:** Facilitates the identification of optimal conditions for a process or system by systematically exploring the effects of factors.
- **Prediction:** Provides a basis for modeling and predicting the behavior of a system under various conditions.
- **Decision Making:** Empowers researchers and practitioners to make data-driven decisions based on statistically significant findings.

**Process of Conducting an Experiment:**

1. **Define Objectives:**

   - Clearly state the purpose of the experiment, research questions, and hypotheses.

2. **Select Factors and Levels:**

   - Determine which factors to study and the levels at which they will be tested.

3. **Choose Experimental Design:**

   - Select an appropriate design structure (e.g., completely randomized design, factorial design) based on the objectives and constraints.

4. **Randomize and Assign Treatments:**

   - Randomly assign treatments to experimental units while considering replication and blocking if necessary.

5. **Collect Data:**

   - Conduct the experiment according to the design plan and collect response data meticulously.

6. **Analyze Data:**

   - Use statistical methods (e.g., ANOVA, regression analysis) to analyze the data and test hypotheses.

7. **Interpret Results:**

   - Draw conclusions from the analysis, assess the validity of the findings, and identify practical implications.

8. **Report Findings:**

   - Document the methodology, results, and conclusions in a clear and transparent manner.

**Challenges in DOE:**

- **Resource Constraints:** Limited time, budget, or materials may restrict the scope of the experiment.
- **Complex Interactions:** Identifying and interpreting higher-order interactions can be challenging.
- **Assumption Violations:** The validity of statistical tests depends on certain assumptions (e.g., normality, homogeneity of variance) which may not always hold.
- **External Validity:** Ensuring that the findings are generalizable beyond the experimental conditions.

**Summary:**

The Fundamentals of Design of Experiments provide a framework for conducting structured and efficient experiments that yield meaningful insights into the relationships between factors and responses. By adhering to the principles of control, randomization, replication, and blocking, researchers can minimize bias, reduce variability, and enhance the reliability of their conclusions. Understanding these foundational concepts is essential before delving into specific experimental designs and methodologies.

## **4.2. Factorial Design**

### **4.2.1. Full Factorial Design**

**Introduction:**

A **Full Factorial Design** is an experimental design technique used to study the effect of multiple factors on a response variable simultaneously. In a full factorial experiment, all possible combinations of factor levels are tested. This allows for the investigation of not only the individual effect of each factor but also the interaction effects between factors.

**Key Concepts:**

1. **Factors and Levels:**

   - **Factors:** Independent variables that are manipulated during the experiment.
   - **Levels:** The specific values or settings of each factor.

2. **Treatment Combinations:**

   - In a full factorial design with $k$ factors, each at $n$ levels, there are $n^k$ treatment combinations.

3. **Main Effects and Interaction Effects:**

   - **Main Effects:** The effect of an individual factor on the response variable.
   - **Interaction Effects:** The effect on the response variable due to the interaction between two or more factors.

**Advantages of Full Factorial Design:**

- **Comprehensive Analysis:** All possible combinations are tested, providing a complete picture of the effects.
- **Interaction Detection:** Enables the study of interactions between factors.
- **Efficiency with Limited Factors and Levels:** Particularly useful when the number of factors and levels is manageable.

**Disadvantages:**

- **Resource Intensive:** The number of experiments grows exponentially with the number of factors and levels, leading to increased time and cost.
- **Complexity:** Analyzing data from many runs can be complex.

**Visualization of Full Factorial Design:**

If you haven't installed `pyDOE3`, you can install it using:

In [None]:
!pip install pyDOE3

To view the visualization, you will also need `plotly`, you can install it using:

In [None]:
!pip install plotpy

In [None]:
import numpy as np
import pandas as pd
from pyDOE3 import fullfact
import plotly.express as px

# Define the number of levels for each factor
levels = [2, 2, 2]  # Three factors, each at two levels

# Generate the full factorial design
design = fullfact(levels)

# Adjust levels from 0 and 1 to actual levels (-1 and 1)
design = design * 2 - 1  # Maps 0 to -1 and 1 to 1

# Create a DataFrame
df = pd.DataFrame(design, columns=['Factor 1', 'Factor 2', 'Factor 3'])

# Create the 3D interactive scatter plot
fig = px.scatter_3d(
    df,
    x='Factor 1',
    y='Factor 2',
    z='Factor 3',
    title='3D Scatter Plot of Full Factorial Design Points',
    labels={'Factor 1': 'Factor 1', 'Factor 2': 'Factor 2', 'Factor 3': 'Factor 3'},
    width=700,
    height=500
)

# Customize the marker appearance
fig.update_traces(marker=dict(size=5))

# Show the plot
fig.show()

**Example Scenario:**

Suppose a researcher wants to study the effect of temperature and pressure on the yield of a chemical reaction. Temperature has two levels (Low, High), and pressure has two levels (Low, High). A full factorial design would require testing all four possible combinations:

1. Low Temperature, Low Pressure
2. Low Temperature, High Pressure
3. High Temperature, Low Pressure
4. High Temperature, High Pressure

**Python Implementation using pyDOE3:**

We will use the `pyDOE3` library to generate a full factorial design. The `pyDOE3` library is a Design of Experiments package for Python, which allows us to create various experimental designs, including full factorial designs.

**Import Required Libraries:**

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
from pyDOE3 import fullfact
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

**Creating a Full Factorial Design:**

Suppose we have the following factors:

- **Factor A:** Temperature with 2 levels (Low, High)
- **Factor B:** Pressure with 3 levels (Low, High)
- **Factor C:** Catalyst type with 2 levels (Type 1, Type 2)

**Step 1: Define the Levels for Each Factor**

First, we define the number of levels for each factor.

In [None]:
# Number of levels for each factor
levels = [2, 2, 2]  # [Levels of Factor A, Levels of Factor B, Levels of Factor C]

**Step 2: Generate the Design Matrix**

Use the `fullfact` function to generate the factorial design.

In [None]:
# Generate the factorial design
design = fullfact(levels)
print("Design Matrix:")
print(design)

**Step 3: Map Numeric Levels to Actual Factor Levels**

By default, `fullfact` generates levels starting from 0. We'll map these to actual factor levels.

In [None]:
# Actual levels for each factor
temperature_levels = ['Low', 'High']
pressure_levels = ['Low', 'High']
catalyst_levels = ['Type 1', 'Type 2']

# Create a DataFrame to store the design with actual factor levels
df_design = pd.DataFrame(design, columns=['Temperature', 'Pressure', 'Catalyst'])

# Map numeric levels to actual levels
df_design['Temperature'] = df_design['Temperature'].apply(lambda x: temperature_levels[int(x)])
df_design['Pressure'] = df_design['Pressure'].apply(lambda x: pressure_levels[int(x)])
df_design['Catalyst'] = df_design['Catalyst'].apply(lambda x: catalyst_levels[int(x)])

print("Full Factorial Design:")
print(df_design)

**Step 4: Add Response Data**

Assume we conduct the experiments and collect the response variable, e.g., the yield percentage of the chemical reaction

In [None]:
# For demonstration, we'll simulate some response data
np.random.seed(42)  # For reproducibility
df_design['Yield'] = np.random.uniform(0, 100, size=len(df_design))

print("Design with Response:")
print(df_design)

**Step 5: Analyze the Results Using Linear Regression**

We can perform linear regression to analyze the effects of the factors on the response variable.

In [None]:
# Build the formula for the model including interaction terms
formula = 'Yield ~ C(Temperature) * C(Pressure) * C(Catalyst)'

# Fit the model
model = ols(formula, data=df_design).fit()

# Display the model summary
model.summary()

**Step 6: Analyze the Results Using ANOVA**

We can perform an ANOVA to analyze the effects of the factors on the response variable.

In [None]:
# ANOVA
anova_table = sm.stats.anova_lm(model, typ=1)
print("ANOVA Results:")
print(anova_table)

**Interpreting the Results:**

- **Sum Sq:** Sum of squares due to each factor.
- **df:** Degrees of freedom.
- **F:** F-statistic value.
- **PR(>F):** P-value corresponding to the F-statistic.

A significant p-value (typically less than 0.05) indicates that the factor or interaction has a statistically significant effect on the response variable.

**Visualization:**

We can visualize the main effects and interaction effects.

**Main Effects Plot:**

In [None]:
from statsmodels.graphics.factorplots import interaction_plot

# Main effects plot for Temperature
plt.figure(figsize=(8, 6))
plt.scatter(df_design['Temperature'], df_design['Yield'])
plt.title('Main Effects Plot for Temperature')
plt.xlabel('Temperature')
plt.ylabel('Yield')
plt.show()

**Interaction Plot:**

In [None]:
# Interaction plot between Temperature and Pressure
plt.figure(figsize=(8, 6))
interaction_plot(df_design['Pressure'], df_design['Temperature'], df_design['Yield'],
                 colors=['red', 'blue'], markers=['D', '^'], ms=10)
plt.title('Interaction Plot between Temperature and Pressure')
plt.xlabel('Pressure')
plt.ylabel('Yield')
plt.show()

**Example Summary:**

In this example, we:

1. **Defined the factors and their levels.**
2. **Generated a full factorial design using pyDOE3.**
3. **Mapped numeric levels to actual factor levels for clarity.**
4. **Simulated response data (yield).**
5. **Performed ANOVA to analyze the effects of factors and their interactions on the yield.**
6. **Visualized the main effects and interaction effects.**

**Conclusion:**

The full factorial design allows us to comprehensively study the effects of multiple factors and their interactions on a response variable. By conducting experiments at all possible combinations of factor levels, we gain valuable insights into how factors influence the outcome.

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 1</b></p>

**Exercise:** Full Factorial Design Analysis

1. **Scenario:**

   A manufacturer wants to optimize the tensile strength of a new polymer. They are considering three factors:

   - **Factor A:** Polymer Type (Type 1, Type 2)
   - **Factor B:** Curing Time (30 minutes, 60 minutes, 90 minutes)
   - **Factor C:** Temperature (150°C, 175°C)

2. **Tasks:**

   a. **Design the Experiment:**

      - Use the full factorial design to create the experimental runs.
      - List all the treatment combinations.

   b. **Simulate Response Data:**

      - Assume you conducted the experiments and collected the tensile strength measurements (in MPa).
      - For simplicity, simulate the data using a function or random numbers, considering that higher temperatures and longer curing times might improve tensile strength.

   c. **Data Preparation:**

      - Organize the data into a pandas DataFrame with columns: `Polymer_Type`, `Curing_Time`, `Temperature`, `Tensile_Strength`.

   d. **Perform ANOVA:**

      - Use Python to perform an ANOVA to analyze the effects of the factors and their interactions on tensile strength.
      - Use an appropriate model formula considering main effects and interactions.

   e. **Interpret the Results:**

      - Identify which factors significantly affect the tensile strength.
      - Discuss any significant interaction effects.

   f. **Visualization (Bonus):**

      - Create main effects plots and interaction plots to visualize the results.

3. **Questions:**

   - Based on the analysis, what combination of factors would you recommend to maximize tensile strength?
   - How do interaction effects influence your recommendations?

### **4.2.2. Fractional Factorial Design**

**Introduction:**

A **Fractional Factorial Design** is an experimental design strategy used when it is impractical or too costly to conduct experiments with all possible combinations of factors (as in a full factorial design). Fractional factorial designs are a subset (fraction) of a full factorial design and are used to reduce the number of experimental runs while still providing valuable information about the most significant factors and interactions affecting the response.

**Why Use Fractional Factorial Designs?**

- **Resource Efficiency:** When the number of factors increases, the total number of runs in a full factorial design grows exponentially ($2^k$ for $k$ factors with two levels each). Fractional factorial designs reduce this number to a manageable size.
- **Focus on Main Effects and Low-Order Interactions:** Researchers often assume that higher-order interactions (interactions involving many factors simultaneously) are negligible, allowing them to omit certain combinations without losing significant information.

**Key Concepts:**

1. **Resolution of Designs:**

   - **Resolution III Design:** Confounds main effects with two-factor interactions.
   - **Resolution IV Design:** Main effects are unconfounded with two-factor interactions, but two-factor interactions may be confounded with each other.
   - **Resolution V Design:** Main effects and two-factor interactions are unconfounded with each other, but two-factor interactions may be confounded with three-factor interactions.

2. **Aliasing and Confounding:**

   - **Alias Structure:** In fractional factorial designs, certain effects are confounded (aliased) with each other, meaning they cannot be independently estimated from the experimental data.
   - **Defining Relation:** Describes the aliasing pattern in the design, indicating which effects are confounded.

3. **Design Generators:**

   - **Generators:** Equations that determine which combinations of factor levels are included in the fractional design.
   - **Defining Contrasts:** Used to construct the fraction of the full factorial design.

**Additional Notes:**

- **Resolution of the Design:**

  - Ensure that the chosen fractional factorial design has appropriate resolution for the study objectives.

- **Assumption Checking:**

  - Check the normality of residuals and homogeneity of variance to validate the ANOVA model.

- **Practical Implications:**

  - Discuss how the results inform decisions in the context of the pharmaceutical development process.

**Example Scenario:**

Suppose we have 4 factors (A, B, C, D), each at 2 levels (coded as +1 and -1). A full factorial design would require $2^4 = 16$ runs. To reduce the number of runs, we can use a $2^{4-1}$ fractional factorial design, which only requires 8 runs.

**Python Implementation using pyDOE3:**

We will use the `pyDOE3` library functions to create a fractional factorial design.

**Import Required Libraries:**

In [None]:
import numpy as np
import pandas as pd
from pyDOE3 import fracfact
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

**Step-by-Step Guide:**

**Step 1: Define the Fractional Factorial Design Generators**

We will define the generators for the fractional factorial design.

- For a $2^{4-1}$ fractional factorial design, we can set:

  $D = A \times B \times C$

This means that the levels of factor D are determined by the product of the levels of factors A, B, and C.

**Step 2: Generate the Design Matrix**

We can use the `fracfact` function to generate the design based on the generating relation.

In [None]:
# Define the generator string
generator = 'A B C ABC'  # Here, 'ABC' represents D = A * B * C

# Generate the design
design = fracfact(generator)

print("Design Matrix:")
print(design)

**Step 3: Create a DataFrame with Actual Factor Levels**

We'll create a DataFrame for clarity and assign the levels to each factor.

In [None]:
# Convert design matrix to pandas DataFrame
df_design = pd.DataFrame(design, columns=['A', 'B', 'C', 'D'])

# Display the design
print("Fractional Factorial Design:")
print(df_design)

**Step 4: Map Levels to Actual Factor Values**

Assuming that:

- **Factor A:** Temperature (Low: -1, High: +1)
- **Factor B:** Pressure (Low: -1, High: +1)
- **Factor C:** Catalyst Concentration (Low: -1, High: +1)
- **Factor D:** Stirring Speed (Low: -1, High: +1)

We can map the coded levels to actual values.

In [None]:
# Map coded levels to actual values
level_mapping = {-1: 'Low', 1: 'High'}

df_design['Temperature'] = df_design['A'].map(level_mapping)
df_design['Pressure'] = df_design['B'].map(level_mapping)
df_design['Catalyst'] = df_design['C'].map(level_mapping)
df_design['Stirring'] = df_design['D'].map(level_mapping)

# Reorder columns
df_design = df_design[['Temperature', 'Pressure', 'Catalyst', 'Stirring']]

print("Design with Actual Factor Levels:")
print(df_design)

**Step 5: Add Response Data**

Assume we conduct the experiments and collect the response variable, e.g., the yield percentage of a chemical process.

In [None]:
# Simulated response data
np.random.seed(42)  # For reproducibility
df_design['Yield'] = np.random.uniform(0, 100, size=len(df_design))

print("Design with Response:")
print(df_design)

**Step 6: Analyze the Results Using Linear Regression**

We can perform linear regression to analyze the effects of the factors on the response variable.

In [None]:
# Convert factor levels to categorical variables
df_design['Temperature'] = df_design['Temperature'].astype('category')
df_design['Pressure'] = df_design['Pressure'].astype('category')
df_design['Catalyst'] = df_design['Catalyst'].astype('category')
df_design['Stirring'] = df_design['Stirring'].astype('category')

# Build the formula for the model including main effects only
formula = 'Yield ~ C(Temperature) + C(Pressure) + C(Catalyst) + C(Stirring)'

# Fit the model
model = ols(formula, data=df_design).fit()

# Display the model summary
model.summary()

**Step 7: Analyze the Results Using ANOVA**

We will perform an ANOVA to analyze the main effects.

In [None]:
# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

print("ANOVA Results:")
print(anova_table)

**Note:**

In fractional factorial designs, due to confounding, we have to be cautious about interpreting the results. Some main effects may be confounded with interactions.

**Interpreting the Results:**

- **Sum Sq:** Sum of squares due to each factor.
- **df:** Degrees of freedom.
- **F:** F-statistic value.
- **PR(>F):** P-value corresponding to the F-statistic.

A significant p-value (typically less than 0.05) indicates that the factor has a statistically significant effect on the response variable.

**Visualization:**

We can create Pareto charts to visualize the effects of the factors.

**Pareto Chart of Standardized Effects:**

In [None]:
# Calculate effect estimates
effects = model.params.iloc[1:]  # Exclude intercept
abs_effects = effects.abs()
effects_df = pd.DataFrame({'Effect': effects.index, 'Estimate': abs_effects.values})

# Sort effects in descending order
effects_df.sort_values(by=['Estimate'], ascending=False, inplace=True)

# Plot Pareto chart
plt.figure(figsize=(8,6))
plt.bar(effects_df['Effect'], effects_df['Estimate'])
plt.title('Pareto Chart of Standardized Effects')
plt.xlabel('Effect')
plt.ylabel('Absolute Estimate')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

**Example Summary:**

In this example, we:

1. **Defined the generators for the fractional factorial design.**
2. **Generated the design matrix using `fracfact`.**
3. **Mapped coded levels to actual factor levels.**
4. **Simulated response data (yield).**
5. **Performed ANOVA to analyze the main effects.**
6. **Visualized the effects using a Pareto chart.**

**Conclusion:**

Fractional factorial designs offer an efficient way to study multiple factors with fewer runs than a full factorial design. However, due to confounding, careful planning and interpretation are necessary to ensure valid conclusions.

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 14</b></p>

**Exercise:** Fractional Factorial Design Analysis

1. **Scenario:**

   A pharmaceutical company is testing the effect of four factors on the effectiveness of a new drug:

   - **Factor A:** Dosage Level (Low: -1, High: +1)
   - **Factor B:** Release Rate (Slow: -1, Fast: +1)
   - **Factor C:** Compound Purity (Low: -1, High: +1)
   - **Factor D:** Additive Type (Type X: -1, Type Y: +1)

   Conducting a full $2^4 = 16$ runs is too costly, so they decide to perform a $2^{4-1}$ fractional factorial design with 8 runs.

2. **Tasks:**

   a. **Design the Experiment:**

      - Define suitable generators for the fractional factorial design.
      - Generate the fractional factorial design matrix.
      - Map the coded levels to actual factor levels.

   b. **Simulate Response Data:**

      - Assume you conducted the experiments and collected effectiveness scores (on a scale of 0 to 100).
      - For simplicity, simulate the data considering that higher dosage and higher purity might improve effectiveness.

   c. **Data Preparation:**

      - Organize the data into a pandas DataFrame with columns: `Dosage`, `Release_Rate`, `Purity`, `Additive`, `Effectiveness`.

   d. **Perform ANOVA:**

      - Use Python to perform an ANOVA analyzing the main effects.
      - Use an appropriate model formula considering main effects only.

   e. **Interpret the Results:**

      - Identify which factors significantly affect drug effectiveness.
      - Discuss any limitations due to confounding of effects.

   f. **Visualization (Bonus):**

      - Create a Pareto chart of the standardized effects to visualize the importance of each factor.

3. **Questions:**

   - Based on the analysis, which factors would you prioritize for improving drug effectiveness?
   - How does confounding in fractional factorial designs impact the interpretation of the results?

**Guidelines:**

- **Defining Generators:**

  - Choose a generator such that $D = A \times B \times C$, or another suitable generator to create the fractional design.

- **Simulating Data:**

  - When simulating response data, introduce reasonable patterns reflecting expectations (e.g., higher dosage increases effectiveness).

- **Confounding Awareness:**

  - Be aware that in a $2^{4-1}$ design, main effects may be confounded with three-factor interactions.

- **Performing ANOVA:**

  - Use ANOVA to analyze the main effects while acknowledging confounding.

- **Interpretation:**

  - Take into account that some effects are aliased and cannot be independently estimated.

### **4.2.3. Plackett-Burman Design**

**Introduction:**

The **Plackett-Burman Design** is a type of experimental design used for screening a large number of factors to identify the most influential ones on a response variable with a minimal number of experiments. It is particularly useful in the early stages of experimentation when many factors are under consideration, and the goal is to determine which factors have significant effects.

**Key Characteristics:**

- **Orthogonal Arrays:** Plackett-Burman designs are based on orthogonal arrays, ensuring that the estimates of the effects are uncorrelated.
- **Two-Level Designs:** Each factor is tested at two levels, typically coded as -1 and +1.
- **Efficient Screening:** Allows for the examination of up to $N - 1$ factors in $N$ runs, where $N$ is a multiple of 4 (e.g., 12, 20, 24).

**Advantages:**

- **Resource Efficiency:** Enables the study of many factors with relatively few experimental runs.
- **Identification of Key Factors:** Helps in identifying the most significant factors affecting the response variable for further detailed study.

**Disadvantages:**

- **No Interaction Effects:** Plackett-Burman designs do not consider interaction effects between factors.
- **Resolution III Designs:** Main effects may be confounded with two-factor interactions, making it important to assume that interaction effects are negligible.

**When to Use:**

- When the goal is to screen a large number of factors to identify the most significant ones.
- In the preliminary phase of experimentation before conducting detailed analysis with more advanced designs.

**Additional Notes:**

- **Assumption of No Interactions:**

  - Since Plackett-Burman designs do not account for interactions, assume that interaction effects are negligible when interpreting the results.

- **Follow-Up Experiments:**

  - Significant factors identified in the screening can be studied further using full factorial or fractional factorial designs to understand interactions and optimize the formulation.

- **Resource Management:**

  - Emphasize how the Plackett-Burman design helps efficiently utilize resources in the initial screening phase.

---

**Example Scenario:**

Suppose a biotechnology company wants to screen 11 factors that could affect the yield of a fermentation process. Conducting a full factorial design would require $2^{11} = 2048$ runs, which is impractical. Instead, they can use a Plackett-Burman design with 12 runs to efficiently identify the most important factors.

---

**Python Implementation using pyDOE2:**

We will use the `pyDOE2` library to generate a Plackett-Burman design. The `pyDOE2` library functions similarly to `pyDOE3`, and for this example, we will assume that `pyDOE3` has the same functions.

**Import Required Libraries:**

In [None]:
import numpy as np
import pandas as pd
from pyDOE3 import pbdesign
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

**Step-by-Step Guide:**

**Step 1: Generate the Plackett-Burman Design**

We will generate a Plackett-Burman design for 11 factors.

In [None]:
# Number of factors
num_factors = 5

# Generate the Plackett-Burman design
design = pbdesign(num_factors)

print("Design Matrix Shape:", design.shape)
print("Design Matrix:")
print(design)

**Note:**

- The `pbdesign` function generates a design with $N$ runs, where $N$ is the next multiple of 4 greater than or equal to $num\_factors + 1$. For 11 factors, $N = 12$.

**Step 2: Create a DataFrame and Assign Factor Names**

We will assign names to the factors for clarity.

In [None]:
# Create a list of factor names
factor_names = [f'Factor_{i+1}' for i in range(num_factors)]

# Create a DataFrame with the design
df_design = pd.DataFrame(design, columns=factor_names)

print("Plackett-Burman Design:")
print(df_design)

**Step 3: Add Response Data**

Assume we conduct the experiments and collect the response variable, e.g., the yield percentage of the process.

In [None]:
# Simulated response data
np.random.seed(42)  # For reproducibility
df_design['Yield'] = np.random.uniform(0, 100, size=len(df_design))

print("Design with Response:")
print(df_design)

**Step 4: Analyze the Results Using Linear Regression**

We can perform linear regression to analyze the effects of the factors on the response variable.

In [None]:
# Build the formula for the model including main effects only
# The formula includes all factors
formula = 'Yield ~ ' + ' + '.join(factor_names)

# Fit the model
model = ols(formula, data=df_design).fit()

# Summary of the model
print("Regression Results Summary:")
print(model.summary())

**Step 5: Interpret the Results**

Identify the significant factors based on the p-values.

- **Coefficients:** Estimates of the effects of each factor.
- **P>|t| (P-values):** Probability that the coefficient is different from zero due to random chance. A lower p-value indicates a significant effect.

**Note:**

- Since Plackett-Burman designs are Resolution III designs, main effects are aliased with two-factor interactions. We interpret significant factors with caution, assuming that interaction effects are negligible.

**Visualization:**

**Pareto Chart of Standardized Effects:**

In [None]:
# Get the absolute values of the t-statistics
t_values = model.tvalues.iloc[1:]  # Exclude intercept
abs_t_values = t_values.abs()
effects_df = pd.DataFrame({'Factor': factor_names, 't_value': abs_t_values.values})

# Sort effects in descending order
effects_df.sort_values(by='t_value', ascending=True, inplace=True)

# Plot Pareto chart
plt.figure(figsize=(8, 6))
plt.barh(effects_df['Factor'], effects_df['t_value'])
plt.title('Pareto Chart of Standardized Effects')
plt.xlabel('Absolute t-value')
plt.ylabel('Factors')
plt.tight_layout()
plt.show()

**Interpreting the Pareto Chart:**

- Factors with larger absolute t-values have a greater effect on the response variable.
- Identify and prioritize factors with the highest t-values for further investigation.

**Example Summary:**

In this example, we:

1. **Generated a Plackett-Burman design for 11 factors using `pyDOE3`.**
2. **Created a DataFrame and assigned factor names.**
3. **Simulated response data (yield).**
4. **Performed linear regression to analyze the main effects of the factors.**
5. **Visualized the effects using a Pareto chart.**

**Conclusion:**

The Plackett-Burman design allows us to efficiently screen a large number of factors with a relatively small number of experiments. By identifying the most significant factors, researchers can focus resources on studying these factors in more detail using higher-resolution designs.

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 3</b></p>

**Exercise:** Plackett-Burman Design Analysis

1. **Scenario:**

   A food scientist wants to identify which ingredients significantly affect the taste score of a new snack product. There are 6 ingredients (factors) under consideration:

   - **Factor 1:** Sugar Level (Low: -1, High: +1)
   - **Factor 2:** Salt Level (Low: -1, High: +1)
   - **Factor 3:** Cooking Time (Short: -1, Long: +1)
   - **Factor 4:** Temperature (Low: -1, High: +1)
   - **Factor 5:** Additive A (Absent: -1, Present: +1)
   - **Factor 6:** Additive B (Absent: -1, Present: +1)

   Conducting a full factorial design is impractical, so they decide to use a Plackett-Burman design.

2. **Tasks:**

   a. **Design the Experiment:**

      - Generate the Plackett-Burman design for the 6 factors.
      - Assign the actual factor names to the design.

   b. **Simulate Response Data:**

      - Assume you conducted the experiments and collected taste scores (on a scale of 1 to 10).
      - Simulate the data considering that sugar level and fat type might significantly affect taste.

   c. **Data Preparation:**

      - Organize the data into a pandas DataFrame with columns for each factor and the taste score.

   d. **Perform Analysis:**

      - Use linear regression to analyze the main effects of the factors.
      - Identify significant factors based on p-values.

   e. **Visualization:**

      - Create a Pareto chart of the standardized effects to visualize the importance of each factor.

3. **Questions:**

   - Which ingredients significantly affect the taste score?
   - Based on the results, which factors would you recommend for further detailed study?

**Guidelines:**

- **Generating the Design:**

  - Use the `pbdesign` function to create the experimental design for 6 factors.
  - Note that the number of runs will be the next multiple of 4 greater than 6 (i.e., 8 runs).

- **Simulating Data:**

  - When simulating taste scores, introduce patterns where higher sugar levels and fat type B might improve taste.

- **Analyzing Results:**

  - Interpret the regression coefficients and p-values to determine the significance of factors.
  - Remember that main effects may be confounded with two-factor interactions.

- **Visualizing Effects:**

  - Use the absolute values of t-statistics or coefficients to create the Pareto chart.

## **4.3. Response Surface Methodology**

**Introduction:** 

**Response Surface Methodology (RSM)** is a collection of statistical and mathematical techniques used for modeling and analyzing problems in which a response of interest is influenced by several variables. The primary goal of RSM is to optimize this response, which involves finding the best conditions or settings for the factors involved. RSM is especially useful when the relationship between the response variable and the factors is not well understood and can be improved by fitting a mathematical model.

**Key Concepts:**

1. **Response Variable:**
   - The **response variable** is the outcome of interest that is being influenced by the predictor variables (factors).

2. **Factors:**
   - **Factors** are independent variables that are controlled in the experiment. They can be continuous or categorical.

3. **Sequential Experimentation:**
   - RSM often involves a series of experiments where responses are measured at different factor levels, allowing for the development of a surface model to predict responses under various conditions.

4. **Polynomial Models:**
   - RSM typically utilizes polynomial regression models (usually second-order) to fit the response surface. These models can capture both linear and quadratic effects as well as interactions.

5. **Optimal Conditions:**
   - The methodology aims to identify the optimal levels of factors that maximize or minimize the response variable.

6. **Central Composite Design (CCD):**
   - A popular experimental design in RSM that allows for the estimation of curvature in the response surface by adding axial points (star points) to a factorial or fractional factorial design.

**Advantages of RSM:**

- **Efficiency:** RSM allows for the systematic evaluation of multiple factors and their interactions with fewer experiments compared to full factorial designs.
- **Optimization:** It provides a framework for optimizing complex processes.
- **Graphical Visualization:** The use of response surfaces and contour plots facilitates interpretation of the results.

**Disadvantages:**

- **Model Specifications:** Incorrect model assumptions can lead to misleading conclusions.
- **Limited to Quadratic Surfaces:** RSM typically assumes a quadratic relationship, which may not always be valid.

### **4.3.1. Central Composite Design (CCD)**

**Introduction:**

A **Central Composite Design (CCD)** is a specific type of experimental design used in Response Surface Methodology that allows estimation of second-order polynomial models. It consists of a full or fractional factorial or fractional factorial design augmented with a center point and axial points (star points). CCD is useful when modeling complex relationships between factors and responses.

**Key Features of CCD:**

1. **Components:**
   - Two-level factorial or fractional factorial design.
   - Center points to allow estimation of experimental error and detect curvature.
   - Axial (star) points that are placed at a distance from the center point, facilitating a quadratic approximation of the response surface.

2. **Design Efficiency:**
   - CCD is efficient in exploring the response surface with a moderate number of runs, allowing a deeper understanding of the system.

3. **Flexibility:**
   - Adequate for various response surfaces, even if the underlying relationship is not purely quadratic.

**Visualization of Full Factorial Design:**

You can visualize CCD using `plotly`

In [None]:
import numpy as np
import pandas as pd
from pyDOE3 import ccdesign
import plotly.express as px

# Define the number of factors (e.g., Temperature, Pressure)
num_factors = 3  # Adjust as needed

# Generate a Central Composite Design
ccd_design = ccdesign(num_factors, center=(1, 1), face='circumscribed')

# Create a DataFrame for the design
df_ccd = pd.DataFrame(ccd_design, columns=['Factor 1', 'Factor 2', 'Factor 3'])

# Adjust levels from 0 and 1 to actual levels (-1 and 1)
df_ccd = (df_ccd * 2) - 1

# Create the 3D interactive scatter plot
fig_ccd = px.scatter_3d(
    df_ccd,
    x='Factor 1',
    y='Factor 2',
    z='Factor 3',
    title='3D Scatter Plot of Central Composite Design Points',
    labels={'Factor 1': 'Factor 1', 'Factor 2': 'Factor 2', 'Factor 3': 'Factor 3'},
    width=700,
    height=500
)

# Customize the marker appearance
fig_ccd.update_traces(marker=dict(size=5))

# Show the plot
fig_ccd.show()

---

**Example Scenario:**

Suppose a chemical engineer wants to optimize the yield of a reaction based on temperature and pressure. They will use a Central Composite Design to explore the effect of these factors.

**Using PyDOE3 for CCD:**

To implement a CCD in Python, we can use the `PyDOE3` library. The following steps demonstrate how to create and analyze a CCD using Python.

**Import Required Libraries:**

In [None]:
import numpy as np
import pandas as pd
from pyDOE3 import ccdesign
import plotly.graph_objects as go

**Step-by-Step Guide:**

**Step 1: Generate the Plackett-Burman Design**

We will generate a CCD with 2 factors: Temperature and Pressure.

In [None]:
# Define the number of factors (Temperature and Pressure)
num_factors = 2

# Generate a Central Composite Design
design = ccdesign(num_factors, center=(3, 0), face='circumscribed')

In [None]:
# Visualization of the response surface using plotly
fig = px.scatter_3d(df, x='Temperature', y='Pressure', z='Yield',
                     title='Response Surface from Central Composite Design',
                     labels={'Temperature': 'Temperature (°C)', 'Pressure': 'Pressure (psi)', 'Yield': 'Yield (%)'})

# Show plot
fig.show()

**Step 2: Create a DataFrame and Assign Factor Names**

We will assign names to the factors for clarity.

In [None]:
# Create a DataFrame for the design
df = pd.DataFrame(design, columns=['Temperature', 'Pressure'])

# Display the design matrix
print("Central Composite Design Matrix:")
print(df)

**Step 3: Add Response Data**

Assume we conduct the experiments and collect the response variable, e.g., the yield percentage of the process.

In [None]:
# Add simulated response data for demonstration
np.random.seed(42)
df['Yield'] = np.random.randint(0, 100, len(df))
print(df)

**Step 4: Analyze the Results Using Linear Regression**

To analyze the results obtained from the Central Composite Design, we can perform linear regression. The procedure includes fitting a quadratic polynomial regression model to the data:

In [None]:
# Import regression libraries
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit a polynomial model
formula = 'Yield ~ Temperature + Pressure + I(Temperature**2) + I(Pressure**2) + Temperature:Pressure'
model = ols(formula, data=df).fit()

# Display the model summary
print("Linear Regression Results Summary:")
print(model.summary())

**Step 5: Interpret the Results**

After fitting the regression model, it's crucial to interpret the results effectively. Key components to consider include:

- **Coefficients:**
  - The coefficients indicate the effect of each factor on the response variable. Significant coefficients (based on p-values) suggest a strong relationship with the response.

- **P-values:**
  - A common threshold for significance is p < 0.05. Coefficients with p-values below this threshold indicate that the respective factors significantly influence the yield.

- **Goodness-of-Fit:**
  - The R-squared value in the summary provides an indication of how well the model fits the data. Generally, a higher R-squared value indicates a better fit.

For example, consider the following results summary:
- If the coefficient for `Temperature` is positive and significant, it suggests that increasing temperature enhances yield.
- If the interaction term `Temperature:Pressure` is significant, it indicates that the effect of temperature on yield may depend on the level of pressure.

**Visualization:**

To visualize the response surface generated by the fitted model, we can create a 3D plot showing how the predicted yield varies with changes in both temperature and pressure.

In [None]:
# Create a grid for the surface plot
temp_range = np.linspace(df['Temperature'].min(), df['Temperature'].max(), 100)
press_range = np.linspace(df['Pressure'].min(), df['Pressure'].max(), 100)
temp_grid, press_grid = np.meshgrid(temp_range, press_range)

# Predict yield over the grid
predicted_yield = model.predict(exog=dict(Temperature=temp_grid.ravel(), Pressure=press_grid.ravel()))
predicted_yield = predicted_yield.values.reshape(temp_grid.shape)

# Visualization of Prediction
fig_surface = go.Figure(go.Surface(x=temp_grid, y=press_grid, z=predicted_yield))

# Show surface plot
fig_surface.show()

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 4</b></p>

**Scenario:**

A food scientist is experimenting to determine the optimal levels of sugar and fat in a new snack. The factors they will evaluate are:

- **Factor A:** Sugar Level (Low: -1, High: +1)
- **Factor B:** Fat Level (Low: -1, High: +1)

They want to conduct a CCD to optimize the taste response.

**Tasks:**

a. **Design the Experiment:**
   - Use a Central Composite Design to create the experimental runs.
   - List all the treatment combinations.

b. **Simulate Response Data:**
   - Assume you run the experiments and collect taste test scores (1-10 scale).
   - Simulate the data, taking higher levels of sugar and fat to potentially enhance taste.

c. **Data Preparation:**
   - Organize the data into a pandas DataFrame with columns: `Sugar_Level`, `Fat_Level`, `Taste_Score`.

d. **Perform Analysis:**
   - Use a polynomial model to analyze the effects of sugar and fat on the taste score.

e. **Visualization (Bonus):**
   - Create a 3D response surface plot to visualize the effects of sugar and fat levels on taste.

### **4.3.2. Box-Behnken Design**

**Introduction:**

The **Box-Behnken Design** is a response surface methodology that is particularly useful for constructing second-order (quadratic) models without requiring a full factorial design. This design is more efficient than full factorial by using fewer experimental runs, making it an excellent choice when the number of factors is moderate and aims to identify optimal responses. The Box-Behnken design is characterized by its use of a three-level design where the levels of the factors are set at low, medium, and high points.

**Key Features of Box-Behnken Design:**

1. **Three-Level Factors:**
   - Each factor is studied at three levels (high, medium, low). The design does not include a full factorial of combinations but rather a balanced set of combinations to determine curvature.

2. **Efficient Design:**
   - With $k$ factors, the Box-Behnken design requires $k(k-1) + 2$ experimental runs, where $k$ is the number of factors. This makes it suitable for situations where fewer experiments are needed.

3. **No Extreme Combinations:**
   - Unlike some designs, Box-Behnken does not require testing the extreme combinations (i.e., all factors at high or low simultaneously), reducing the experimental workload while still capturing essential response behavior.
  

**Visualization of Full Factorial Design:**

You can visualize CCD using `plotly`

In [None]:
import numpy as np
import pandas as pd
from pyDOE3 import bbdesign
import plotly.express as px

# Define the number of factors (e.g., Sugar Content, Salt Level, Baking Time)
num_factors = 3  # Adjust as needed

# Generate a Box-Behnken Design
bbd_design = bbdesign(num_factors)

# Create a DataFrame for the design
df_bbd = pd.DataFrame(bbd_design, columns=['Factor 1', 'Factor 2', 'Factor 3'])

# Adjust levels from 0 and 1 to actual levels (-1 and 1)
df_bbd = (df_bbd * 2) - 1

# Create the 3D interactive scatter plot
fig_bbd = px.scatter_3d(
    df_bbd,
    x='Factor 1',
    y='Factor 2',
    z='Factor 3',
    title='3D Scatter Plot of Box-Behnken Design Points',
    labels={'Factor 1': 'Factor 1', 'Factor 2': 'Factor 2', 'Factor 3': 'Factor 3'},
    width=700,
    height=500
)

# Customize the marker appearance
fig_bbd.update_traces(marker=dict(size=5))

# Show the plot
fig_bbd.show()

---

**Example Scenario:**

Assume a researcher is studying the effect of three factors on the yield of a fermentation process. The factors are:

- **Factor A:** Temperature (°C)
- **Factor B:** pH level
- **Factor C:** Agitation Speed (rpm)

The researcher intends to use a Box-Behnken design to optimize yield.

**Using PyDOE3 for Box-Behnken Design:**

To implement a Box-Behnken design in Python using the `PyDOE3` library, we follow these steps:

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from pyDOE3 import bbdesign
import statsmodels.api as sm
from statsmodels.formula.api import ols
import plotly.graph_objects as go

**Step-by-Step Guide:**

**Step 1: Generate the Box-Behnken Design**

We will generate a BBD with 2 factors: Temperature, pH, Agitation Speed.

In [None]:
# Define the number of factors (e.g., Temperature, pH, Agitation Speed)
num_factors = 3

# Generate a Box-Behnken Design
design = bbdesign(num_factors)

**Step 2: Create a DataFrame and Assign Factor Names**

We will assign names to the factors for clarity.

In [None]:
# Create a DataFrame for the design
df = pd.DataFrame(design, columns=['Temperature', 'pH', 'AgitationSpeed'])

# Display the design matrix
print("Box-Behnken Design Matrix:")
print(df)

**Step 3: Add Response Data**

Assume we conduct the experiments and collect the response variable, e.g., the yield percentage of the process.

In [None]:
np.random.seed(42)
df['Yield'] = np.random.randint(0, 100, len(df))
print(df)

**Step 4: Analyze the Results Using Linear Regression**

To analyze the results obtained from the Box-Behnken design, we perform a linear regression, fitting a quadratic polynomial model to the data:

In [None]:
# Fit a polynomial model
formula = 'Yield ~ Temperature + pH + AgitationSpeed + I(Temperature**2) + I(pH**2) + I(AgitationSpeed**2) + Temperature:pH + Temperature:AgitationSpeed + pH:AgitationSpeed'
model = ols(formula, data=df).fit()

# Display the model summary
print("Linear Regression Results Summary:")
print(model.summary())

**Step 5: Interpret the Results**

After fitting the regression model, it is essential to interpret the results. Consider the following key elements:

- **Coefficients:**
  - Each coefficient indicates the effect of the corresponding factor on the response variable. A positive coefficient for `Temperature` suggests that increasing the temperature would improve yield.

- **P-values:**
  - Check the significance of each coefficient using p-values. A p-value less than 0.05 usually indicates that the factor has a significant influence on the yield.

- **Interactions:**
  - Interaction terms (e.g., `Temperature:pH`) help in understanding if the effect of one factor depends on the level of another factor.

- **R-squared Value:**
  - The R-squared value indicates how well the model explains the variability in the response variable. Adjusted R-squared values should also be considered to account for the number of predictors in the model.

For interpretation:

- If `Temperature` and its square term both have significant p-values and a substantial effect, the researcher can infer that yield significantly varies with temperature.