diff --git a/docs/docs/guide/index.mdx b/docs/docs/guide/index.mdx index 940e2f1d92c..01e168292bc 100644 --- a/docs/docs/guide/index.mdx +++ b/docs/docs/guide/index.mdx @@ -6,11 +6,9 @@ slug: /guide # GrowthBook Detailed Guides. -The following sections contain detailed walkthroughs on how to set up GrowthBook with various technologies. +The following sections contain detailed walkthrough's on how to set up GrowthBook with various technologies. -## A/B testing guide - -- [The Open Guide to Successful A/B Testing (pdf)](/open-guide-to-ab-testing.v1.0.pdf) +This is not a complete list of the ways GrowthBook can be integrated. ## Implementation guides @@ -18,3 +16,9 @@ The following sections contain detailed walkthroughs on how to set up GrowthBook - [GrowthBook with Create React App](/guide/create-react-app-and-growthbook) - [GrowthBook with Next.js and Rudderstack](/guide/rudderstack-and-nextjs-with-growthbook) - [GrowthBook with Google Tag Manager](/guide/google-tag-manager-and-growthbook) +- [GrowthBook with Webflow](/integrations/webflow) +- [GrowthBook with Shopify](/integrations/shopify) + +## A/B testing guide + +- [The Open Guide to Successful A/B Testing (pdf)](/open-guide-to-ab-testing.v1.0.pdf) diff --git a/docs/docs/using/experimentation-best-practices.mdx b/docs/docs/using/experimentation-best-practices.mdx new file mode 100644 index 00000000000..1dd5f684e7f --- /dev/null +++ b/docs/docs/using/experimentation-best-practices.mdx @@ -0,0 +1,75 @@ +# Experimentation Best Practices + +## Running experiments + +### Running Your First Experiment + +When you’ve finished integrating your experimentation platform (which for GrowthBook, is adding the +SDK to your code), it’s time to start running an experiment. We suggest that you first an A/A test to +validate your experimentation implementation is correctly splitting traffic, and producing statistically +valid results. + +### Sample Sizes + +Understanding experiment power and MDE are important to predict how many samples are required. +There are numerous online calculators that can be used to help you predict the sample size. Typical +rule of thumb for the lowest number of samples required is that you want at least 100 conversion +events per variation. So for example if you have a registration page which has a 10% conversion rate, +and you have a 2 way (A and B) experiment that is looking to improve the member registrations, you +will want to expose the experiment to at least 2,000 people (1000 per variation). + +### Test Duration + +Due to the natural variability in traffic day to day and hour to hour, experimentation teams will often set +a minimum test duration within which a test cannot be called. This helps you avoid optimizing a product +for just the users that happen to visit when the test is started. For example, if the weekend traffic of +your product is different from the traffic during the week, if you started a test on Friday and ended it on +Monday, you may not get a complete picture of the impact your changes have to your weekday traffic. +Typical test durations are 1 to 2 weeks, and usually care needs to be taken over holidays. + +You may also find that a test would need to run for a month or more to get the power required for the +experiment. Very long running tests can be hard to justify as you have to keep the variations of the +experiment unchanged for duration, and this may limit your team's velocity towards potentially higher +impact changes. + +### Interaction Effects and Mutual Exclusion + +When you start having the ability to run a lot of A/B tests, it can be tempting to not want to run tests in +parallel in case they have interaction effects (see above). For example you may want to test a change in the CTA +button on your purchase page, and also test changing the price. It can be difficult to figure out if any +two tests will meaningfully interact, and many will run the tests in serial in an abundance of caution. + +However, meaningful interactions are actually quite rare, and keeping a higher rate of experimentation +is usually more beneficial. You can run analysis after the experiments to see if there were any +interaction effects which would change your conclusions (GrowthBook is working on an integrated solution for +this). If you need to run mutually exclusive tests, you can use GrowthBook’s [INSERT LINK]namespace feature. + +### Experimentation Frequency + +Having a high frequency of A/B testing is important for running a success experimentation program. +The main reasons why experimentation frequency is important are: + +**Maximizing chances**: Since success rates are typically low for any given experiment, and large changes +are even more rare, by having a high frequency of A/B testing you are maximizing your chance of +having impactful experiments. + +**Continuous improvement**: A high frequency of A/B testing allows you to continuously improve your +website or application. By testing small changes frequently, you can quickly identify and implement +changes that improve user experience, engagement, and conversion rates. + +**Adaptability**: A high frequency of A/B testing allows you to quickly adapt to changes in user behavior, +market trends, or other external factors that may impact your website or application. By testing +frequently, you can identify and respond to these changes more quickly, ensuring that your site or app +remains relevant and effective. + +**Avoiding stagnation**: A high frequency of A/B testing can help you avoid stagnation and complacency. +By continually testing and experimenting, you can avoid falling into a rut or becoming overly attached +to a specific design or strategy, and instead remain open to new ideas and approaches. + +:::tip Quote + +**_“If you want to have good ideas you must have many ideas. Most of them will be wrong, and what you have to learn is which ones to throw away.”_** + +
- Linus Pauling
+ +::: diff --git a/docs/docs/using/experimentation-problems.mdx b/docs/docs/using/experimentation-problems.mdx new file mode 100644 index 00000000000..188a7675897 --- /dev/null +++ b/docs/docs/using/experimentation-problems.mdx @@ -0,0 +1,286 @@ +# Where Experimentation goes wrong + +The following contains a list of common pitfalls and mistakes that can happen when running A/B tests. +It is important to be aware of these issues and to take steps to avoid them in order to ensure that +your A/B tests are valid and reliable. It is by no means an exhaustive list. + +### Multiple Testing Problem + +The multiple testing problem refers to the issue that arises when statistical hypothesis testing is +performed on multiple variables simultaneously, leading to an increased likelihood of incorrectly +rejecting a true null hypothesis ([Type I error](/using/fundamentals#false-positives-type-i-errors-and-false-negatives-type-ii-errors)). + +For example, if you test the same hypothesis at a 5% level of significance for 20 different metrics, the +probability of finding at least one statistically significant result by chance alone is around 64%. This +probability increases as the number of tests performed increases. This math assumes that the metrics +are independent from one another, which in most cases for a digital application there will be some +interaction between metrics (ie, page views is most likely related to sales funnel starts, or member +registration to purchase events) + +To address this problem, various multiple comparison correction methods can be used, such as the +Bonferroni correction, False Discovery Rate (FDR) correction, or the Benjamini-Hochberg procedure. +These methods adjust the significance level or the p-value threshold to account for the increased risk +of false positives when multiple comparisons are made. + +It's essential to be aware of this issue and select an appropriate correction method when conducting +multiple statistical tests to avoid false discoveries and improve the accuracy and reliability of research +findings. If you are using a high number of metrics, draw conclusions from the test thoughtfully and if +you may consider running a follow up test just to test that one result or metric. + +### Texas Sharpshooter Fallacy + +The Texas sharpshooter problem is a cognitive bias that involves cherry-picking data clusters to suit a particular +argument, hypothesis, or bias. The name comes from the idea of a Texan marksman shooting at a barn and then painting a +target around the cluster of bullet holes to create the appearance of accuracy. As the story goes, he then showed his +neighbors and convinced them he was a great shot. It is closely related to the Multiple Testing Problem/Multiple Comparison Problem. + +In the context of data analysis and statistics, the Texas sharpshooter problem refers to the danger of finding +apparent patterns or correlations in data purely by chance and then using those patterns as if they were meaningful. +This can lead to false conclusions and misguided decision-making. Texas sharpshooter problem is relevant in the +sense that if you analyze the results of a test without a clear hypothesis or before setting up the experiment, you +may be susceptible to finding patterns that are purely due to random variation. If you analyze the data in multiple +ways or look at various subgroups without adjusting for multiple comparisons, you might identify spurious patterns +that do not actually represent a true effect. + +### P-Hacking + +P-hacking, or data dredging, is a statistical fallacy that involves manipulating or analyzing data in various ways +until a statistically significant result is achieved. It occurs when researchers or analysts repeatedly test +their data using different methodologies or subsets of the data until they find a statistically significant result, +even if the observed effect is due to chance. + +In the context of A/B testing, p-hacking can be a significant concern. A/B testing involves comparing at least two versions +(A and B) to determine which performs better. The danger of p-hacking arises when analysts, either consciously or +unconsciously, explore different metrics, time periods, or subgroups until they find a statistically significant +difference between the A and B groups. + +### Peeking + +The peeking problem refers to the issue of experimenters making decisions about the results of +an experiment based on early data. The more often the experiment is looked at, or ‘peeked’, the +higher the false positive rates will be, meaning that the results are more likely to be significant by +chance alone. Peeking typically applies to Frequentist statistics, which are statistically valid at their +predetermined sample size. However, Bayesian statistics can also suffer from peeking depending on +how decisions are made on the basis of Bayesian results. + +The peeking problem in A/B testing occurs when the experimenter looks at the data during the +experiment and decides to stop the test early based on the observed results, rather than waiting until +the predetermined sample size or duration has been reached. This can lead to inflated false positive +rates, as the results are more likely to be significant by chance alone if the experimenter stops the test +early based on what they see in the data. The more often the experiment is looked at, or ‘peeked’, the +higher the false positive rates will be. + +To avoid the peeking problem in A/B testing, it's important to use a predetermined sample size or +duration for the experiment and stick to the plan without making any changes based on the observed +results. This helps to ensure that the statistical test is valid and that the results are not influenced by +experimenter bias. + +Another way to avoid the peeking problem in A/B testing is to use a statistical engine that is less +susceptible to peeking, like a Bayesian with custom priors, or to use a method that accounts for +peeking like Sequential testing. + +### Problems with client side A/B testing + +Client-side A/B testing is a technique where variations of a web page or application are served to users +via JavaScript on the user's device, without requiring any server-side code changes. This technique +can offer a fast and flexible way to test different variations of a website or application, but it can also +present some potential problems, one of which is known as "flickering." + +Flickering is a problem that can occur when the A/B test is implemented in a way that causes the +user interface to render the original version, then flash or flicker as the variations are loaded. This can +happen when the A/B test code is slow to load or when the A/B testing library is lacking performance. +As a result, the user may see the original version of the page briefly before it is replaced with one of +the variations being tested, leading to a jarring and confusing user experience. This flickering can lead +to inaccurate or unreliable test results. Rather counterintuitively, flickering may have a positive effect on +the results, sometimes the flashing may draw a users attention to that variation, and cause an inflation +in the effect. + +To avoid flickering in client-side A/B testing, it is important to implement the test code in a way that +minimizes the delay between the original page and the variations being tested. This may involve +preloading the test code or optimizing the code for faster loading times. GrowthBook’s SDKs are built +for very high performance, and allow you to use client side A/B testing code inline, so there are no +blocking 3rd party calls. + +You can also use an alternative technique such as server-side testing or redirect-based testing to avoid +flickering issues. If loading the SDK in the head does not sufficiently prevent flicking, you can also use +an anti-flickering script. These scripts hide the page while the content is loading, and reveal the page +after the experiment loaded. The problem with this is that while it technically prevents flickering, it +slows how quickly your site appears to load. + +### Redirect tests (Split testing) + +Redirect-based A/B testing is a technique where users are redirected to different URLs or pages +based on the A/B test variation they are assigned to. While this technique can be effective in +certain scenarios, it can also present several potential problems that should be considered before +implementation. + +**SEO**: Redirects can negatively impact SEO, as search engines may not be able to crawl the redirected +pages or may see them as duplicate content. This can result in lower search engine rankings and +decreased traffic to the site. + +**Load times/User experience**: Redirects can increase page load times, as the browser has to make +an additional HTTP request to load the redirect page. This can result in slower load times, which can +impact user experience, conversion rates, and A/B test outcomes. + +**Data accuracy**: Redirects can also impact the accuracy of the test results, as users may drop off or exit +the site before completing the desired action due to a slower load time or confusing user experience. It +can also be harder technically to fire the tracking event, causing a loss in data. + +To mitigate these problems, it's important to carefully consider whether redirect-based A/B testing +is the most appropriate technique for your specific use case. If you do choose to use redirects, it's +important to implement them correctly and thoroughly test them to ensure that they do not negatively +impact user experience or test results. Additionally, it may be helpful to use other techniques such as +server-side testing or client-side testing to supplement redirect-based testing and ensure the accuracy +and reliability of the test results like testing on the edge or using middleware to serve different pages. + +### Semmelweis Effect + +The Semmelweis effect refers to the tendency of people to reject new evidence or information that +challenges their established beliefs or practices. It is named after Ignaz Semmelweis, a Hungarian +physician who, in the 19th century, discovered that hand washing could prevent the spread of +infectious diseases in hospitals. Despite his findings, he was ridiculed and ignored by his colleagues, +and it took many years for his ideas to be accepted and implemented. + +In the context of A/B testing, the Semmelweis effect can manifest in several ways. For example, a +company may have a long-standing belief that a certain design or feature is effective and produces +good results, and may not want to experiment with it because everyone knows it ‘correct’. Even if an +experiment is run against this entrenched belief, and the results of an A/B test challenge established +norms, there may be resistance to accept the new evidence and change the established practice. + +To avoid the Semmelweis effect in A/B testing, it is important to approach experimentation with an +open mind and a willingness to challenge established beliefs and practices. It is crucial to let the data +guide decision-making and to be open to trying new things, even if they go against conventional +wisdom or past practices. It is also important to regularly review and evaluate the results of A/B tests to +ensure that the company's beliefs and practices are aligned with the latest evidence and insights, and +haven’t changed over time. + +### Confirmation Bias + +Confirmation bias refers to the tendency to favor information that confirms our preexisting beliefs and +to ignore or discount information that contradicts our beliefs. In the context of A/B testing, confirmation +bias can lead to flawed decision-making and missed opportunities for optimization. + +For example, if a company believes that a certain website design or feature is effective, they may +only run A/B tests that confirm their beliefs and ignore tests that challenge their beliefs. This can lead +to a bias towards interpreting data in a way that supports preexisting beliefs, rather than objectively +evaluating the results of the tests. Or a PM may believe a new version of their product will be superior, +and only acknowledge evidence that confirms this belief. + +Confirmation bias can also manifest in the way tests are designed and implemented. If a company +designs an A/B test in a way that biases the results towards a particular outcome, such as by using +a biased sample or by selecting a suboptimal metric to measure success, it can lead to misleading +results that confirm preexisting beliefs. + +To avoid confirmation bias in A/B testing, it is important to approach experimentation with an open +and objective mindset. This involves being willing to challenge preexisting beliefs (Semmelweis) and +being open to the possibility that the data may contradict those beliefs. It also involves designing tests +in a way that is unbiased and that measures the most relevant and meaningful metrics to evaluate +success. Having multiple stakeholders review and evaluate the results of A/B tests can help ensure +that decisions are based on objective data, rather than personal biases or beliefs. + +### HiPPOs + +HiPPO is an acronym for the "highest paid person's opinion." +In less data-driven companies, decisions about what product to build or which products to ship are made +by HiPPOs. The problem with HiPPOs is that it turns out their opinions are no more likely to be right +than anyone else's opinions, and are therefore often wrong. But due to their status they may resist +against experimentation to preserve their status or ego. The HiPPO effect is a common problem in many +organizations, and it can lead to poor decision-making and missed opportunities for your product. + +### Trustworthiness + +When experiment results challenge existing norms or an individual’s beliefs, it can be easy to blame +the data. For this reason, having a trustworthy A/B testing platform is extremely important. There must +be ways to audit the results, and look into if there was any systemic or specific problem affecting the +results of the experiment. Running A/A tests can help build trust that the platform is working correctly. +Trust in an experimentation platform is built over time, and care must be taken to not just dismiss +results that are counterintuitive. + +### Twyman's Law + +Twyman's law is a principle in statistics that states that any data that is measured and collected will +contain some degree of error, and that this error is an inherent part of the data. It is named after the +British statistician Maurice G. Kendall Twyman. + +In the context of A/B testing, Twyman's law suggests that there will always be some level of variability +or uncertainty in the results of an A/B test due to factors such as random chance, sample size, or +measurement error. It is often phrased as: + +> Any data or figure that looks interesting or different is usually wrong + +If you notice a particularly large or unusual change in the results of an experiment, it is more likely to be +the result of a problem with the data or an implementation than an actual result. Before you share the +results, make sure that the effects are not the result of an error. + +### Goodhart's Law + +Goodhart's law is a concept in economics that states that when a measure becomes a target, it ceases +to be a good measure. In other words, once a metric becomes the sole focus of attention and effort, it +loses its value as an indicator of the desired outcome. + +When it comes to A/B testing, Goodhart's law can apply in several ways. For example, if a specific +metric such as click-through rate or conversion rate becomes the sole focus of an A/B test, it can lead +to unintended consequences such as artificially inflating the metric while neglecting other important +aspects of the user experience. This can happen because individuals or teams may optimize for the +metric being measured rather than focusing on the broader goals of the A/B test, such as improving +user engagement or increasing revenue. + +To avoid the negative effects of Goodhart's law in A/B testing, it is important to choose the right metrics +to track and analyze, and to use a variety of metrics to evaluate the effectiveness of the test. It is also +important to keep in mind the broader goals of the test and to avoid tunnel vision on any one metric. +Goodhart's law is more likely to happen when you are using proxy metrics, instead of the real KPIs +you’re trying to improve - an example of this might be items added to a cart being used as a proxy for +purchases. Also If the proxy metric is not strongly causally linked to the target metric, pressing hard on +the proxy may have no effect on the goal metric, or might actually cause the correlation to break. + +### Simpson's Paradox + +Simpson's paradox is a statistical phenomenon where a trend or pattern appears in different groups of +data but disappears or reverses when the groups are combined. In other words, the overall result may +be opposite to what the individual subgroups suggest. + +This paradox can arise when a confounding variable (a variable that affects both the independent and +dependent variables) is not taken into account while analyzing the data. + +Simpson's paradox was famously observed at the University of California, Berkeley in 1973, where it +had implications for gender discrimination in graduate school admissions. + +At the time, it was observed that although the overall admission rate for graduate school was higher for +men than for women (44% vs. 35%), when the admission rates were broken down by department, the +reverse was true for many of the departments, with women having a higher admission rate than men +in each department. In the Department of Education, for example, women had a 77% admission rate +compared to men's 62% admission rate. + +The paradox was resolved by examining the application data more closely and considering the impact +of an important confounding variable, which was the choice of department. It was discovered that +women were more likely to apply to departments that were more competitive and had lower admission +rates, while men were more likely to apply to less competitive departments with higher admission +rates. + +When the data was reanalyzed, taking into account the departmental differences in admission rates, it +was found that women actually had a slightly higher overall admission rate than men, suggesting that +there was no discrimination against women in the admissions process. +This case study illustrates how Simpson's paradox can occur due to the influence of confounding +variables, and how it can lead to misleading conclusions if not properly accounted for in the analysis. +To avoid the Simpson's paradox in experimentation, it is essential to analyze the data by considering +all relevant variables and subgroups. It is crucial to ensure that the experimental groups are similar in +terms of demographics and behavior, and to use statistical techniques that account for confounding +variables. + +### Ethical considerations + +Experimentation judges the outcome of changes by looking at the impact it has on some set of metrics. +But the seeming objectivity of the results can hide problems. The simplest way this can go wrong is +if your metrics are tracking the wrong things, in which case you’ll have garbage in and garbage out. +But it is also possible for the metrics to not capture harm that is being done to some subsets of your +population. + +Experimentation results work on averages, and this can hide a lot of systemic biases that may exist. +There can be a tendency for algorithmic systems to “learn” or otherwise encode real-world biases in +their operation, and then further amplify/reinforce those biases. + +Product design has the potential to differentially benefit some groups of users more than others; It is +possible to measure this effect and ensure that results account for these groups. Sparse or poor data +quality that leads to objective-setting errors and system designs that lead to suboptimal outcomes for +many groups of end users. One company that does this very well is the team at LinkedIn, you can read +about their approach here. diff --git a/docs/docs/using/experimenting.mdx b/docs/docs/using/experimenting.mdx new file mode 100644 index 00000000000..cf3609837cb --- /dev/null +++ b/docs/docs/using/experimenting.mdx @@ -0,0 +1,317 @@ +# Experimenting in GrowthBook + +The Experiments section in GrowthBook is all about analyzing raw experiment results in a data source. +Before analyzing results, you need to actually run the experiment. This can be done in several ways: + +- Feature Flags (most common) +- Running an inline experiment directly with our SDKs +- Our Visual Editor (beta) +- Your own custom variation assignment / bucketing system + +When you go to add an experiment in GrowthBook, it will first look in your data source for any new +experiment ids and prompt you to import them. If none are found, you can enter the experiment +settings yourself. + +## Experiment Splits + +When you run an experiment, you need to choose who will get the experiment, and what percentage +those users should get each variation. In GrowthBook, we allow you to pick overall exposure +percentage, as well as customize the split per variation. Yor can also target an experiment at just some +attribute values. + +GrowthBook uses deterministic hashing to do the assignment. That means that each user’s hashing +attribute (usually user id), and the experiment name, are hashed together to get a number from 0 to 1. +This number will always be the same for the same set of inputs. + +There is quite often a need to de-risk a new A/B test by running the control at a higher percentage of +users than the new variation, for example, 80% of users get the control, and 20% get the new variation. +To solve this case, we recommend keeping the experiment spits equal, and adjusting the overall +exposure (ie, 20% exposure, 50/50 on each variation, so each variation gets 10%). This way the overall +exposure can be ramped up (or down) without having any users potentially switch variations. + +## Metric selection + +GrowthBook lets you choose goal metrics and guardrail metrics. Goal metrics are the metrics you’re +trying to improve or measure the impact of the change of your experiment. Guardrail metrics are +metrics you’re not necessarily trying to improve, but you don’t want to hurt. With goal metrics we +show full statistical changes on the metrics. Guardrail metrics we only show the chance of that metric +being worse- if there is a significant chance that the guard rail metric is worse, it will be shown in red, +otherwise it will be green. + +
+GrowthBook Metric Selector +
+ +It is best to pick metrics for your experiment that are as close to your treatment as possible, and, if possible, the event itself. For +example, if you’re trying to improve a signup rate, you can add product review metrics that are close to +that event, like "signup modal open rate", and "signup conversion rate". You can add as many metrics as you +like to your experiment, but we suggest each experiment have only a few primary metrics that are used for +making the shipping decision. Adding all your metrics is not recommended, as this can lead to false +positives caused by random variations (see [Multiple testing problem](/using/experimentation-problems#multiple-testing-problem)) + +Before you being a test, you should have selected a primary metric or set of metrics that you are +trying to improve. These metrics are often called the OEC for Overall Evaluation Criterion. It is +important to have this decided ahead of time so when you look at your experiment results you're not just shopping for metrics that confirm +your bias (see [confirmation bias](/using/experimentation-problems#confirmation-bias)). + +With GrowthBook, Goal and guardrail metrics can be added retroactively to experiments, as long as +the data exists in your data warehouse. This allows you to reprocess old experiments if you add new +metrics or redefine a metric. + +## Activation metrics + +Assigning your audience to the experiment should happen as close to the test as possible to reduce +noise and increase power. However, there are times when running an experiment requires that users +be bucketed into variations before knowing that they are actually being exposed to the variation. One +common example of this is with website modals, where the modal code needs to be loaded on the +page with the experiment variations, but you’re not sure if each user will actually see the modal. With +activation metrics you can specify a metric that needs to be present to filter the list of exposed users to +just those with that event. + +## Sample sizes + +When running an experiment you select your goal metrics. Getting enough samples depends on the +size of the effect you’re trying to detect. If you think the experiment will have a large effect, the smaller +total number of events you need to collect. GrowthBook allows users to set a minimum sample size for +each metric where we will hide results before that threshold is reached to avoid premature peeking. + +## Test Duration + +We recommend running an experiment for at least 1 or 2 weeks to capture variations in your traffic. +Before a test is significant, GrowthBook will give you an estimated time remaining before it reaches the +minimum thresholds. Traffic to your product is likely not uniform, and there may be differences + +## Attribution models / Conversion Windows + +A lot can happen between when a user is exposed to an experiment, and when a metric event is +triggered. How you want to attribute that conversion event to the experiment is adjustable within +GrowthBook. Once a user enters an experiment for the first time, the attribution model determines +which subsequent conversion events are included in the analysis. GrowthBook lets you choose +between two attribution models, first exposure, and experiment duration. + +**First Exposure** - Only include events that fall within the configured metric conversion window. For +example, if the metric's conversion window is set to 72 hours, any conversion that happens after that is +ignored. + +**Experiment Duration** - Include all events that happen while the experiment is running, no matter what +the metric's conversion window is set to. + +![Attribution Window Diagram](/images/using/attribution-window-diagram2.png) + +## Understanding results + +### Bayesian Results + +In GrowthBook the experiment results will look like this. + +![GrowthBook Results](/images/using/experiment-results-bayesian.png) + +Each row of this table is a different metric. This is a simplified overview of the data. If you want to +see the full data, including 'risk', mouse over any of the results. + +![GrowthBook Results](/images/using/experiment-results-bayesian-details2.png) + +Risk tells you how much you are predicted to lose if you choose the selected variation as the winner +and you are wrong. Anything highlighted in green indicates that the risk is very low and it may be safe +to call the experiment. You can use the dropdown to see the risk of choosing a different winner if you +have multiple variations. Risk thresholds are adjustable per metric. + +Value is the conversion rate or average value per user. In small print you can see the raw numbers +used to calculate this. + +**Chance to Beat Control** tells you the probability that the variation is better. If you are familiar with +Frequentist statistics, you can consider this value 1 - the p value. Anything above the threshold (which +by default is set to 95%) is highlighted green indicating a very clear winner. Anything below the +threshold (5% by default) is highlighted red, indicating a very clear loser. Anything in between is gray +indicating it's inconclusive. If that's the case, there's either no measurable difference or you haven't +gathered enough data yet. + +**Percent Change** shows how much better/worse the variation is compared to the control. It is a +probability density graph and the thicker the area, the more likely the true percent change will be +there. As you collect more data, the tails of the graphs will shorten, indicating more certainty around +the estimates. + +### Frequentist Results + +You can also choose to analyze results using a Frequentist engine that conducts simple t-tests for +differences in means and displays the commensurate p-values and confidence intervals. +If you selected the "Frequentist" engine, when you navigate to the results tab to view and update the +results, you will see the following results: + +![GrowthBook Results - Frequentist](/images/using/experiment-results-frequentist.png) + +Everything is the same as above except for three key changes: + +- There is no longer a risk value, as the concept is not easily replicated in frequentist statistics. +- The Chance to Beat Control column has been replaced with the P-value column. The p-value + is the probability that the percent change for a variant would have been observed if the true + percent change were zero. When the p-value is less than the threshold (default to 0.05) and the + percent change is in the preferred direction, we highlight the cell green, indicating it is a clear + winner. When the p-value is less than the threshold and the percent change is opposite the + preferred direction, we highlight the cell red, indicating the variant is a clear loser on this metric. +- We now present a 95% confidence interval rather than a posterior probability density plot. + +## Data quality checks + +GrowthBook performs automatic data quality checks to ensure the statistical inferences are valid and +ready for interpretation. You can see all check and monitor the health of your experiments on the +experiment **health page**. + +### Health Page + +GrowthBook automatically does data quality checks on all experiments and shows the results on the our _Health Page_. + +
+ Experiment Health Page +
+ +This page shows experiment exposure over time, and also all the other health checks we do. + +### Sample Ratio Mismatch (SRM) + +Every experiment automatically checks for a Sample Ratio Mismatch and will warn you if found. +This happens when you expect a certain traffic split (e.g. 50/50) but you see something significantly +different (e.g. 46/54). We only show this warning if the p-value is less than 0.001, which means it's +extremely unlikely to occur by chance. We will show this warning on the results page, and also on our +experiment health page. + +
+ Sample Ratio Mismatch +
+ +Like the warning says, you shouldn't trust the results since they are likely misleading. Instead, find and +fix the source of the bug and restart the experiment. + +### Multiple Exposures + +We also automatically check each experiment to make sure that too many users have not been +exposed to multiple variations of a single experiment. This can happen if the hashing attribute is +different from the assignment id used in the report, or for implementation problems. + +### Minimum Data Thresholds + +You can set thresholds per metric to make sure people viewing the results aren’t drawing conclusions +too early (e.g. when it’s 5 vs 2 conversions) + +### Variation Id Mismatch + +GrowthBook can detect missing or improperly-tagged rows in your data warehouse. The most common +way this can happen if you assign with one parameter, but send a different ID to your warehouse +from the trackingCallback call. It may indicate that your variation assignment tracking is not +working properly. + +### Suspicious Uplift Detection + +You can set thresholds per metric for a maximum percent change. When a metric results is above this, +GrowthBook will show an alert. Large uplifts may indicate a bug - see [Twymans Law](/using/experimentation-problems#twymans-law). + +### Guardrails + +metrics are ones that you want to keep an eye on, but aren't trying to specifically improve with your +experiment. For example, if you are trying to improve page load times, you may add revenue as a +guardrail since you don't want to inadvertently harm it. + +Guardrail results show up beneath the main table of goal metrics. The full statistics are shown like +goal metrics, and similarly they are colored based on "Chance to Beat Control". If guardrail metrics +become significant, you may want to consider ending the experiment. + +![Guardrail Results](/images/using/guardrail-metrics.png) + +If you select the frequentist engine, we instead use yellow to represent a metric moving in the wrong +direction at all (regardless of statistical significance), red to represent a metric moving in the wrong +direction with a two-sided t-test p-value below 0.05, and green to represent a metric moving in the +right direction with a p-value below 0.05. Otherwise the cell is unshaded if the metric is moving in the +right direction but not statistically significant at the 0.05 level. + +## Digging deeper + +GrowthBook lets you dig into the results to get a better understanding of the likely effect of your +change. + +### Segmentation + +Segments are applied to experiment results to only show users that match a particular attribute. For +example, you might have “country” as a dimension, and create a segment for just “US visitors”. In the +experiment you can configure the experiment to just look at one particular segment of users. Segments +can be created with SQL from the "Data and Metrics -> Segments" page. + +![Segments](/images/using/segments-page.png) + +There are two ways you can use segments in your experiment results. The first is to use edit the +experiment's 'analytics settings' and add one of the segments. The other way is to create a custom +ad-hoc reports, and then click on 'customize' and select a segment to apply to the results. + +### Dimensions + +GrowthBook lets you break down results by any dimension you know about your users. We +automatically let you break down by date, and any additional dimensions can be added either with +the exposure query, or with custom SQL from the dimension menu. Some examples of common +dimensions are “Browser” or “location”. You can read more about [dimensions here](/app/dimensions). + +![Dimension Selector](/images/using/dimension-selector.png) + +It can be very helpful to look into how specific dimensions of your users are affected by the +experiment. For example, you may discover that a specific browser is underperforming compared to +the rest, and this may indicate a bug, or something to investigate further. +The more metrics and dimensions you look at, the more likely you are to see a false positive. If you find +something that looks surprising, it's often worth a dedicated follow-up experiment to verify that it's real. + +### Ad-hoc reports + +Experiment reports have a lot of configuration, and sometimes it can be useful to want to adjust these +configurations without changing the original report. GrowthBook supports Ad-hoc reports, which are +essentially copies of the original report, where you can adjust any of the configuration parameters, +such as segments, dates, metrics, even custom SQL to remove outliers. + +![Ad-hoc Reports Menu](/images/using/ad-hoc-menu.png) + +All ad hoc reports created can +be shared publicly and live at the bottom of the report, to make sure you capture any derived results. + +![Ad-hoc Reports and Publishing](/images/using/ad-hoc-report-saving.png) + +## Deciding A/B test results + +Hopefully you are analysing your experiment results with your OEC already documented. Even so, when +to stop, and how to interpret results may not be straight forward. + +### When to stop an experiment + +When using the Bayesian statistic engine, there are a few methods you can use when stopping a test. + +- significance reached on your primary metrics +- metric risk drop below your risk thresholds +- guardrail metrics are not affected +- test duration reached + +It all depends on what you’re trying to do with the experiment. For example, if you’d like to know what +impact your change has, you should use the first method. If you’re doing a design change, and want to +make sure you haven’t broken anything on your product, you can use the risk or guard rail approach. +You should also make sure that the experiment has run for your minimum test duration (typically 1 or 2 +weeks), so that you’re not looking at highly skewed sampling. + +For Frequentist statistics, you should determine the running time of the experiment and stop the test at +that fixed horizon to ensure accurate results (see [Peeking](/using/experimentation-problems#peeking)) +or use Sequential analysis. + +### Interpreting results + +It is quite common to have experiment results with mixed results. Deciding on the results of an +experiment in these cases may require some interpretation. As a general rule, you should have one +goal metric that is the primary metric you’re trying to improve, and if this metric is up significantly it is +generally straightforward to declare a result. If you have a mix and up and down metrics, the decisions +are less clear. + +Once you have reached a decision with your experiment, you can click the “mark as finished” link +towards the top of the results. This will open a modal where you can document the results, including +the result, and observations. + +This creates a card on the top of the experiment results with your conclusion. +Please note that currently this marking a test as finished does not stop the test from running. If you are +using feature flags to run the experiment, you should also go to the feature and turn off the experiment. + +### Inconclusive results + +Sometimes you may have an experiment that is inconclusive. Generally it is a good idea to have a policy +of what to do in these cases. We suggest that your policy should be to revert to the control variant in +these cases, unless the new version unlocks some new features. diff --git a/docs/docs/using/fundamentals.mdx b/docs/docs/using/fundamentals.mdx new file mode 100644 index 00000000000..81481215f60 --- /dev/null +++ b/docs/docs/using/fundamentals.mdx @@ -0,0 +1,285 @@ +# A/B Testing Fundamentals + +If you are new to A/B testing, you may find a lot of new terminology. The goal of this section is to help +you understand the basics of A/B testing. + +## Glossary - Common Experimentation Terms + +### Control (or Baseline) + +The existing version of the product that you are trying to improve upon. + +### Variation (or Treatment) + +A new version of the page that you are testing against the Control. + +### Hypothesis + +Formal way to describe what you are changing and what you think it will do. + +### Statistical Significance + +an indicator that the difference in performance between the control and treatment groups is unlikely to +have occurred by chance. + +### Confidence level + +The level of certainty we want before the result of a test is statistically significant. A common +confidence level used in A/B testing is 95%. + +### Sample size + +The number of visitors or users who are included in the A/B test. + +### Test duration + +The length of time that the A/B test is run. This can vary depending on the sample size and the desired +confidence level. + +### Variance + +The degree to which the results of an A/B test vary over time or across different segments of the user +base. + +## Anatomy of an A/B test + +![Anatomy of an A/B test](/images/using/ab-test-diagram.png) + + + + + + + + + +
+
Hypothesis
+

Come up with an idea you want to test

+
+
Assignment
+

Randomly split your audience into persistent groups

+
+
Variations
+

Create and show different experiences to each group

+
+
Tracking
+

Record events and behaviors of the two groups

+
+
Results
+

Use statistics to determine if the differences in behavior are significant

+
+ +### Hypothesis + +Good A/B tests, and really any project, starts with a hypothesis about what you’re trying to do. A good +hypothesis should be as simple, specific, and falsifiable as possible. + +A good A/B test hypothesis should be: + +- **Specific**: The hypothesis should clearly state what you want to test and what outcome you + expect to see. +- **Measurable**: The hypothesis should include a metric or metrics that can be used to evaluate the + outcome of the test. +- **Relevant**: The hypothesis should be relevant to your business goals and objectives. +- **Clear**: The hypothesis should be easy to understand and communicate to others. +- **Simple**: The fewer variables that are involved in the experiment, the more causality can be + implied in the results. +- **Falsifiable**: The hypothesis should be something that can be tested using an A/B test to + determine the validity of the hypothesis. + +Overall, a good A/B test hypothesis should be a clear statement that identifies a specific change you +want to make and the expected impact on a measurable outcome, while being grounded in data and +relevant to your business goals. + +### Audience and Assignments + +Choose the audience for your experiment. To increase the detectable effect of your experiment, +the audience you choose should be as close to the experiment as possible. For example, if you’re +focusing on a new user registration form, you should select as your audience just unregistered users. +If you were to include all users, you would have users who could not see the experiment, which would +increase the noise and reduce the ability to detect an effect. Once you have selected your audience, +you will randomize users to one variation or another. + +### Variations + +An A/B test can include as many variations as you like. Typically the A variation is the control variation. +The variations can have as many changes as you like, but the more you change the less certain you +can be what caused the change. + +### Tracking + +Tracking is the process of recording events and behaviors that your users do. In the context of AB +testing, you want to track events that happen after exposure to the experiment, as these events +will be used to determine if there is a change in performance due to being exposed to the experiment. +AB testing systems either are "warehouse native" (like GrowthBook) as in they use your existing event +trackers (like GA, Segment, Rudderstack, etc), or they require you to send event data to them. + +### Results + +With A/B testing we use statistics to determine if the effect we measure on a metric of interest is +significantly different across variations. The results of an A/B test on a particular metric can have +three possible outcomes: win, loss, or inconclusive. With GrowthBook we offer two different statistical +approaches, Frequentist and Bayesian. By default, GrowthBook uses Bayesian statistics. Each method +has their pros and cons, but both will provide you with evidence as to how each variation affected your +metrics. + +## Experimentation Basics + +### Typical success rates + +A/B testing can be incredibly humbling—one quickly learns how often our intuition about what will be +successful with our users is incorrect. Industry wide average success rates are only about 33%. ⅓ of +the time our experiments are successful in improving the metrics we intended to improve, ⅓ of the +time we have no effect, and ⅓ of the time we hurt those metrics. Furthermore, the more optimized your +product is, the lower your success rates tend to be. + +But A/B testing is not only humbling, it can dramatically improve decision making. Rather than thinking +we only win 33% of the time, the above statistics really show that A/B tests help us make a clearly right +decision about 66% of the time. Of course, shipping a product that won (33% of the time) is a win, but +so is not shipping a product that lost (another 33% of the time). Failing fast through experimentation +is success in terms of loss avoidance, as you are not shipping products that are hurting our metrics of +interest. + +### Experiment power + +With A/B testing, power analysis refers to whether a test can reliably detect an effect. Specifically, it +is often written as the percent of the time a test would detect an effect of a given size with a given +number of users. You can also think of the power of a test with respect to the sample size. For example: +"How many times do I need to toss a coin to conclude it is rigged by a certain amount?" + +### Minimal Detectable Effect (MDE) + +Minimal Detectable Effect is the minimum difference in performance between the control and +treatment groups that can be detected by the A/B test, given a certain statistical significance threshold +and power. The MDE is an important consideration when designing an A/B test because if the +expected effect size is smaller than the MDE, then the test may not be able to detect a significant +difference between the groups, even if one exists. Therefore, it is useful to calculate the MDE based +on the desired level of statistical significance, power, and sample size, and ensure that the expected +effect size is larger than the MDE in order to ensure that the A/B test is able to accurately detect the +difference between the control and treatment groups. + +### False Positives (Type I Errors) and False Negatives (Type II Errors) + +When making decisions about an experiment, we can say that we made the right decision when +choosing to ship a winning variation or shut down a losing variation. However, because there is always +uncertainty in the world and we rely on statistics, sometimes we make mistakes. +Generally, there are two kinds of errors we can make: Type I and Type II errors. + +**Type I Errors**: also known as False Positives, these are errors we make when we think the experiment +provides us with a clear winner or a clear loser, but in reality the data are not clear enough to make +this decision. For example, your metrics all appear to be winners, but in reality the experiment has no +effect. + +**Type II Errors**: also known as False Negatives, these are errors we make when the data appear +inconclusive, but in reality there is a winner or a loser. For example, you run an experiment for as long +as you planned to, and the data aren’t showing a clear winner or loser when actually a variation is +much better or worse. Type II errors often require you to collect more data or choose blindly rather +than provide you with the correct, clear answer + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Actual Results
InconclusiveLostWon
Decision MadeInconclusiveCorrect InferenceType II error
(false negative)
Type II error
(false negative)
Shut downType I error
(false positive)
Correct InferenceType I error
(false positive)
ShipType I error
(false positive)
Type I error
(false positive)
Correct Inference
+ +### P-Value + +In frequentist statistics, a p-value is a measure of the evidence against a null hypothesis. The null +hypothesis is the hypothesis that there is no significant difference between two groups, or no +relationship between two variables. In the context of A/B testing, the p-value is a statistical measure +that indicates whether there is a significant difference between two groups, A and B. + +The p-value is the probability of observing a difference as extreme or more extreme as your actual +difference, given there is actually no difference between groups. If the p-value is less than a +predetermined level of significance (often 0.05), the result is deemed to be statistically significant as +the difference is not likely due to chance. + +For example, let's say that you conduct an A/B test in which you randomly assign users to either group +A (the control group) or group B (the experimental group). You measure a specific metric such as +conversion rate for each group, and you calculate the p-value to test the hypothesis that there is no +difference between the two groups. If the p-value is less than 0.05, the observed difference conversion +rate between the two groups is unlikely if there wasn't truly a difference in groups; we say the effect is +statistically significant and likely not due to chance. + +It's important to note that p-value alone cannot determine the importance or practical significance of +the findings. Additionally, it's essential to consider other factors such as effect size, sample size, and +study design when interpreting the results. + +### A/A Tests + +A/A testing is a form of A/B testing in which instead of serving two different variations, two identical +versions of a product or design are tested against each other. In A/A testing, the purpose is not to +compare the performance of the two versions, but rather to check the consistency of the testing +platform and methodology. + +The idea behind A/A testing is that if the two identical versions of the product or design produce +significantly different results, then there may be an issue with the testing platform or methodology that +is causing the inconsistency. By running an A/A test, you can identify and address any potential issues +before running an A/B test, which can help ensure that the results of the A/B test are reliable and +meaningful. + +A/A testing is a useful tool for ensuring the accuracy and reliability of A/B tests, and can help improve +the trust in the platform, and faith in the quality of the insights and decisions that are based on the +results of these tests. + +### Interaction effects + +When you run more than one test at a time, there is a chance that the tests may interfere with each +other. For example, you could have two tests that change the price on two different parts of your +product. Some combinations of the two experiments can cause users to see two different prices +confused and lose trust in your product. This is an extreme example. A more common example is someone +who sees an experiment on the account registration page, and then another test on the checkout page. +If the tests are run in parallel, you will have users who see all combinations of variations: **AA, AB, BA**, +and **BB**. A _meaningful_ interaction effect would be the combination of AA, for example, out performs +the other combinations more than each test alone. If there are interaction effects of tests run +in parallel, but they are unlikely to be meaningful. Most often they will just increase the variance of +the tests without changing the results. + +### Novelty and Primacy Effects + +Novelty and primacy effects are psychological phenomena that can influence the results of A/B testing. +The novelty effect refers to the tendency of people to react positively to something new and different. +In the context of A/B testing, a new design or feature may initially perform better than an existing +design simply because it is new and novel. However, over time, the novelty effect may wear off and the +performance of the new design may decrease. + +The primacy effect refers to the tendency of people to remember and give more weight to information +that they encounter first. With A/B testing, this can manifest as an initial reduction in the improvement +for metrics as users prefer the original treatment of the product. + +One way to mitigate the effects of novelty is to run tests over a longer period of time to allow for the +novelty effect to wear off. Another approach is to stagger the rollout of a new design or feature to +gradually introduce it to users and avoid a sudden and overwhelming change. + +To account for the primacy effect, you can target or segment an experiment to just new users to ensure +that they won’t be influenced by how things used to work. This can help ensure that the results of the +test are truly reflective of user behavior and preferences, rather than the order in which designs were +presented. diff --git a/docs/docs/using/growthbook-best-practices.mdx b/docs/docs/using/growthbook-best-practices.mdx new file mode 100644 index 00000000000..335f38276fe --- /dev/null +++ b/docs/docs/using/growthbook-best-practices.mdx @@ -0,0 +1,136 @@ +# GrowthBook Best Practices + +## Organization + +As you scale up your usage of GrowthBook and start running many experiments, keeping everything +organized and easy to find is important. GrowthBook includes a number of organizational structures +to help you scale. + +### Organizations + +The organization is the highest level structure within GrowthBook. In an organization contains +everything within your GrowthBook instance: users, data sources, metrics, features, etc. For both +cloud and self-hosted users, it is possible for users to join multiple organizations. Users can +belong to multiple organizations, but each organization is otherwise entirely independent of the +others. For some, complete isolation of the teams or subdivisions within the company may be desired. +For example, if your company has two or more largely independent products (e.g., Google has Search +and Google Docs), you can set up multiple organizations per product. + +For self-hosted enterprise users, we support multi-organization mode, which also comes with a +super-admin account type that can manage users across organizations. + +### Environments + +In GrowthBook, you can create as many Environments as you need for the feature flags and override rules. +Environments are meant to separate how your feature flags and override rules are deployed. Each environment +can have one or more SDK API endpoints, specified when you create the SDK, allowing you to differentiate +the override rules. For example, you might have environments for “Staging”, “QA”, and “Production”. While +testing the feature, you can set specific rules to on the "development" or "QA" environment, and when +you're ready, you can move applicable rules to the "production" environment. + +You can add an arbitrary number of environments from the SDK Connections → Environments page. + +![Environments Page](/images/using/environments-page.png) + +### Projects + +Within an organization, you can create projects. Projects can help isolate the view of GrowthBook to +just the sections that apply for that GrowthBook user. Projects are a great way to organizationally +separate features, metrics, experiments, and even data sources by team or product feature. For example, +you could have a project “front-end” and one for “back-end”, or by team like “Growth” and “API”. Unlike +separate organizations, projects can share data. Projects are managed from the Settings → Projects page. +between them. + +![Projects Page](/images/using/projects-page.png) + +A use case for using projects is if you have divisions within your product but a centralized data source. +We typically see projects used per team or per project within your organization. For example, if you have +a mobile app and a website that shares users, but the code bases are different, you will want to create +two projects: a _mobile_ project and a _web_ project. + +Each of the items within GrowthBook can be assigned to multiple projects. You can have a data source that +is part of the ‘mobile’ and ‘web’ projects but not to a ‘marketing’ project. That data source will not be +available for users in the 'marketing' project. + +To help keep feature payloads smaller, the SDK endpoint where the feature definitions are returned +can be scoped to each project. If using Projects based on features or area of your product, you can +use this feature to only return features which pertain to that area. For example, with our “mobile” +and “website” example, you can add the project scope to only return features for the project as these +are likely to use different code than the other, and you don’t want to expose features unnecessarily. + +One advantage of using projects is that you can adjust permissions and even some statistical settings per +project- users can have no access to a project or, inversely, have no general permissions but add a project +permission so they can work within their project. If a team prefers to use a frequentist statistical model, +this can be adjusted per project. + +### Tags + +Another way to organize GrowthBook is with _tags_. With tags, you can quickly filter lists, and select +metrics. For example, if you tagged all experiments to do with your checkout flow with the tag “checkout”, +you can quickly see this in the list by clicking on ‘filter by tags’ on the experiment list. Tags can be +color-coded and managed from our Settings → Tags page. You can add multiple tags per item you are tagging. + +![Tags Page](/images/using/tags-page.png) + +Metrics with tags can be used to quickly add all those metrics to an experiment. When creating an +experiment or editing the metrics, there is a section titled “Select metric by tag” which will let you add +all the metrics by the tag name to both guardrail and goal metrics. This is useful if you want to use a +standard set of goal metrics or guardrail metrics for your experiments. + +Tags are often used to mark sub-features of your product; for example, if you have an e-commerce +website, you might want to tag features or experiments with the area they affect, like ‘_pricing_,’ +‘_product page_,’ or ‘_checkout_.’ + +![Experiments filtered by tag](/images/using/experiments-filtered-by-tag.png) + +### Naming + +Another organizing principle you can use is the naming of your experiments and features. Because GrowthBook +makes it easy to quickly search the list of features and flags, using naming conventions can be an +effective way to organize your project. + +We’ve seen several strategies be successful here, but as a general rule, you’ll want to be as specific as possible +with naming features and experiments. For example, you can use <project scope>\_<project name> or the year, +quarter, or section plus the name of the experiment, eg: “23-Q4 New user registration modal“ or “23-Team3 Simplified checkout flow”. +This lets you quickly see when the experiment was run or which team worked on it. + +### Hygiene & Archiving + +As the number of features and experiments grow, you will want to remove past items that are no longer +relevant. Within GrowthBook you can archive and delete. **Deleting** something will permanently remove +items from GrowthBook. **Archived** items in GrowthBook won’t be deleted, but they are removed from +the main part of the UI and not available for adding to new experiments (for archived metrics). Archived +items can also be restored at any time. These methods help you keep your UI clean and relevant. + +### Source of Truth + +If you run an experimentation program for a long enough time, you’ll find yourself with an experiment +idea that seems really familiar, and people will wonder, “Didn’t we already test this?” If you don’t +have a central repository for all your experiment results, it can be difficult to find if you did test +this previously, and even if you did, if what you tested was similar enough to the new idea not to have +to test it again. + +GrowthBook is designed to help with this by creating a central source for the features you’ve launched +and the experiments you’ve run. To help facilitate this, GrowthBook has created a number of features to +help you capture meta information. + +### Meta information + +Features and experiments can all have metadata attached to them. The purpose of this is to help capture +all the meta-information around a feature or experiment that might help contextualize it for posterity +and help capture the institutional knowledge that your program generates. This is also very helpful when +new members join your team, so they don’t just suggest ideas you’ve run many times already. + +For experiments, you should capture the original idea, any screenshots of similar products, and, most +importantly, capture images/screenshots of the control and variants for the experiment. Quite often, +someone will suggest an idea you’ve run previously. In these cases, it is important to be able to find +out what exactly you tested previously - it's possible that the new idea is slightly different, or you +may decide that it is the same and try testing another idea, or you could decide that your product is +substantially different, and the same idea may be worth testing again. To make this decision, it is +important to capture not just the experiment results but the broader context of what your product +looked like at the time and the test variants. + +Getting your team to document is always a challenge. To help with this, GrowthBook takes two approaches. +The first is to make is super easy to add documentation directly in the platform you’re already using +for the experiment. Secondly, we added launch checklists, which can require that certain files be filled +before your team is able to start an experiment. diff --git a/docs/docs/using/index.mdx b/docs/docs/using/index.mdx new file mode 100644 index 00000000000..c9f429da1c9 --- /dev/null +++ b/docs/docs/using/index.mdx @@ -0,0 +1,53 @@ +--- +title: Using GrowthBook +sidebar_label: Using +slug: /using +--- + +# Guide on using GrowthBook + +## Introduction + +In today's data-driven world, businesses of all sizes rely on A/B testing to make data-driven decisions. +A/B testing, also known as split testing, has come a long way from a simple tool to optimize websites, +and is often used as a powerful tool to determine the impact of any changes to your application. +By measuring how your user behavior and engagement changes in a controlled manner, you can +determine causally if your hypothesis is correct, and make informed data-driven decisions that improve +user experience, increase conversions, and drive growth. + +This document is intended to be an open source and continuously updated guide to A/B testing. +Whether you're a seasoned expert at running experiments, or just starting out, this book will provide +you with the knowledge and skills you need to run a successful A/B testing program, with a specific +focus on GrowthBook, an open source feature flagging and A/B testing platform. + +In the following chapters, we'll start with an overview of what A/B testing is, and familiarize you with the +terms that are commonly used. We'll cover the basics of statistical significance, sample size, and other +key concepts that are essential for understanding A/B testing. +Next, we'll cover the best practices for running an A/B test, followed by some of the common mistakes +and pitfalls that can affect experiment programs. Finally, we'll go beyond individual A/B tests and talk about how to run an experimentation program, +and then specifics of how to do this well with GrowthBook. + +We hope after reading this guide, you'll understand that A/B testing is a critical tool for determining +causal impact of the changes you make, as well as optimizing flows. By making informed data-driven +decisions, you can improve user experience, increase conversions, and drive growth. With the open +source A/B testing tool, GrowthBook, you have a powerful and flexible platform that can help you run +experiments quickly and easily. We hope that this guide will give you the knowledge and skills you need to run a successful A/B +testing program and make data-driven decisions. Whether you're a developer, product manager, data scientist, +marketer, or business owner, A/B testing can help you achieve your goals and drive growth. + +## Contents + +- [Fundamentals of AB Testing](/using/fundamentals) +- [Experimentation Best Practices](/using/experimentation-best-practices) +- [Experimentation Common Problems](/using/experimentation-problems) +- [Experimentation As Part of Your Development Process](/using/product-development) +- [Experimenting in GrowthBook](/using/experimenting) +- [GrowthBook Organization Best Practices](/using/growthbook-best-practices) +- [Securing GrowthBook](/using/security) +- [Experimentation Programs](/using/programs) + +## Other resources + +At GrowthBook we highly recommended the book: "Trustworthy Online Controlled Experiments: A +Practical Guide to A/B Testing" by Ron Kohavi, Diane Tang, and Ya Xu. It is available on Amazon +or on Ronny's site at [https://www.exp-platform.com/Documents/GuideControlledExperiments.pdf](https://www.exp-platform.com/Documents/GuideControlledExperiments.pdf) diff --git a/docs/docs/using/product-development.mdx b/docs/docs/using/product-development.mdx new file mode 100644 index 00000000000..d6a3c47dd9a --- /dev/null +++ b/docs/docs/using/product-development.mdx @@ -0,0 +1,82 @@ +# Experimentation driven product development + +Experimentation driven product development is a shift in product development from focusing **shipping** +new products, to focusing on shipping features that have an **impact** on your business. The best way +of determining impact is through A/B testing. + +_The ideal goal with product driven experimentation is to test every feature new features that are +developed._ This level of experimentation often requires adjusting your existing product process. + +If you want to read more about how to make the case for experimentation driven product development, +or the benefits to your culture, you can read our section on [experimentation programs](/using/programs). + +## Platform integration + +Experimentation driven product development requires a tight integration between your product and the +experimentation platform. This is important to keep the incremental cost of running an experiment low, +while increasing the ability to run a high volume of experiments. The cost can be cost in terms of +effort, and cost in terms of actual money. You want to keep these a low as possible, and make sure +your platform encourages your team to run experiments. + +Given this, many companies choose to build their own experimentation platform. This is a big undertaking, +and much harder than it might seem at first. There is also the ongoing maintenance costs, as well as the +risk in making product decisions on a platform that might have undiscovered bugs. + +### Reducing costs + +Costs of running experiments can be broken down into a few categories: the cost data storage, cost of +engineering time, and cost of the platform itself. GrowthBook is designed to directly address these +costs and allow you to run a lot of experiments. GrowthBook is warehouse native, and uses any data +you already have. It was also designed to be extremely easy to add an experiment (2 lines of +code) and have a high quality developer experience that reduces engineering costs. GrowthBook itself +is open-core, and extremely economical to run. + +## Product prioritization changes + +As you become aware of HiPPOs effect on your decision-making process, and start to move away from this, +you need another way to prioritize projects. There are many prioritization frameworks to help with this +(we have a [whole section on experimentation prioritization frameworks](/using/programs#prioritization)), but the goal the system you +choose should be to encourage building smaller testable features, a high frequency of experiments, and +to make sure you have a good mix of ideas and size of projects. + +This is a big change from the traditional product development process, where we would spend a lot of time +trying to predict what features will work, and then spend a lot of time building them. With experimentation +product development, we spend less time predicting, and more time testing. This means that we can spend +less time building features that don’t work, and more time building features that do work. + +### HAMM + +HAMM is a framework to help you think about experimentation focused product development. It stands for +Hypothesis, Actions, Measure, and MVP. It is one product framework to help increase your learning rate, +and build a culture of experimentation. + +- **Hypothesis**: What is the hypothesis you are testing? This should be a clear, falsifiable statement of what you are trying to learn. See our section on how to make good [hypothesis](/using/fundamentals#hypothesis-1). +- **Actions**: What are the actions you expect your user to take if this hypothesis is true? +- **Measure**: What are the metrics that you could measure that would indicate that the user is doing the actions you expect? What might be a counterfactual metric that would indicate that the user is not doing the actions you expect? What are the guardrail metrics that you want to make sure you don't negatively impact? +- **MVP**: Given the above, what is the smallest thing you could build to test this hypothesis? + +
+
+

+ Thinking about the HAMM process at the beginning of a project lays the groundwork + for a high quality experiment. You'll have the hypothesis, the metrics, and the success criteria (OEC). + With a smaller MVP or MTP (Minimum Testable Products), you can also increase your experimentation rate and therefore your learning + rate. +

+
+
John Hamm
+
+ +## Why you're not seeing experiment impacts + +Quite often companies run A/B tests which show positive results, and yet when +overall metrics are examined, the impact of these tests is invisible. You might be expecting to +see inflection points around the time the experiment was implemented. Here are some reasons why: + +- **Confusing statistics** - Interpreting results can be confusing without a solid grasp of what the statistics are telling. +- **Bad practices** - You may be running experiments that are not valid, or are not measuring the right thing. See the Peeking problem. +- **Lost in the noise** - The impact of the experiment may be too small to be visible. This is especially true if you are running a lot of experiments - but this is not bad, just hard to see on a macro level. +- **Optimizing wrong product** - You might be experimenting on a section of your product that don't represent a large fraction of your overall use. Even if you're successful on these areas of the product, the overall impact will be limited by the small fraction of users that you've effected. +- **Optimizing wrong metric** - You might be optimizing for a metric that doesn't matter. For example, you might be optimizing for a metric that is not correlated with revenue, your main KPI. This is especially true if you are optimizing for a proxy metric, such as clicks, instead of the actual metric, such as revenue. + +You can read more about this topic on our blog post [Why the impact of A/B testing can seem invisible](https://medium.com/growth-book/why-the-impact-of-a-b-testing-can-seem-invisible-5b2d69efa48) diff --git a/docs/docs/using/programs.mdx b/docs/docs/using/programs.mdx new file mode 100644 index 00000000000..5dbbb20add8 --- /dev/null +++ b/docs/docs/using/programs.mdx @@ -0,0 +1,390 @@ +# Experimentation Programs + +“Experimentation”, or being more “data driven” can mean a lot of different things for different +companies. It can be anything from running 1 test a quarter, to running 10s of thousands of experiments +simultaneously. This difference of experimentation sophistication can be thought of with the crawl, +walk, run, fly framework (From “Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing”, Ron Kohavi, Diane Tang, Ya Xu). + +**CRAWL - Basic Analytics** +Companies at this stage have added some basic event tracking and are starting to get some visibility +into their users behavior. They are not running experiments, but the data is used to come up with +insights and potential project ideas. + +**WALK - Optimizations** +After implementing comprehensive event tracking, focus now turns to starting to optimize the user +experience on some parts of the product. At this stage, A/B tests may be run manually, limiting the +number of possible experiments that can be run. Typically at this stage, depending on the amount of +traffic you have, you may be running 1 to 4 tests per month. + +**RUN - Common Experimentation** +As a company realizes that experimentation is really the only way to causally determine the impact of +the work they are doing, they will start to ramp up their ability to run A/B tests. This means adopting or +building an experimentation platform. With this, all larger changes made to their product are tested, as +well they may have a Growth Team that is focused on optimizing parts of their product. At this stage, a +company will be running 5 - 100 A/B Tests/Month. This may also include hiring a data team to help with +setting up and interpreting the results. + +**FLY - Ubiquitous Experimentation** +For companies that make it to the flying stage, A/B testing becomes the default for every feature. +Product teams develop success metrics for all new features, and then they are run as an A/B test +to determine if they were successful. At this point, A/B tests are able to be run by anyone in the +product and engineering organization. Companies at this stage of ubiquitous experimentation can run +anywhere from 100 to 10,000+ A/B Tests/Month. + +## Making the case for experimentation + +If your organization doesn't yet experiment often, you may need to make the case for why you should. +The best way, when you are working on a project, is to ask your team "what does success look like for +this project?" and "How would we measure that success?" In this case, two things will happen, either +they'll give an answer that is not statistically rigorous, like looking at the metrics before and +after, or they will say some variation of "We don't know". Once you're team realizes that A/B testing +is a controlled way to determine causal impact, they'll wonder how they ever built products without it. + +The next pushback you may get is that A/B testing is too hard, or that it will slow down development. +This is where you can make the case for GrowthBook. GrowthBook is designed to make A/B testing easy, +and to make it so that you can run experiments without slowing down development. We are warehouse +native, so we use whatever data you already are tracking, and our SDKs are extremely light weight +and developer friendly. The goal at GrowthBook is to make it so easy and cost efficient to run +experiments you'll test far more often. + +You can watch a video of making the case for AB testing here: + +
+
+ +
+
+ +### Why AB test? + +- **Quantify Impact** You can determine the impact of any product change you make. There is a big + difference between "we launched feature X on time" and "we launched feature X on time and it raised + revenue by Y". +- **De-risking** You can de-risk any product change you make with A/B testing. You can test any + change you make to your product, and if it doesn't work, you can roll it back. Typically new projects, + if they are going to fail, will fail in 3 ways: The project has errors, the project has bugs that + unexpectedly effect your metrics/business, or the project has no bugs or errors, but still negatively + effects your business. A/B testing will catch all of these issues, and allows you to roll out to a + small set of users to limit the impact of a bad feature. +- **Limiting investment on bad ideas** As we discussed in our HAMM section - When you focus on + building the smallest testable MVP (or MTP) of a product, you can save a lot of time and effort put + into a bad idea. You build the MVP and get real users testing it, and if it turns out that you cannot + validate the hypothesis behind the idea, then you can move on to other projects, and limit the time + spent on ideas that don't work or that will have a negative impact on your business. +- **Learning** If you have a well-designed experiment, you can determine causality. If you limit + the number of variables that your test has, you can know the exact change drove that change in + behavior, and apply these learnings to future projects. + +### Why A/B testing programs fail + +- **Lack of buy-in** If you don't have buy-in from the top, it can be hard to get the resources you + need to run a successful experimentation program. You'll need to make the case for why you should + experiment, and why you need the resources to do so. +- **High cost** Many experimentation systems, especially legacy ones, can be expensive to run or maintain. When the costs are high, you can end up running fewer experiments, and with fewer experiments, the impact is lower. Eventually, a program in this state can atrophy and die. +- **Cognitive Dissonance** As you're often getting counter-intuitive results with A/B testing, team members can start to question the platform itself, and may prefer to listen to their gut over the data. This is why building trust in your platform is so important. +- **No visibility into the program's impact** Without some measure of the impact of your experimentation program, it can be hard to justify the expense of running it. You'll want to make sure you have a way to measure the impact of your experimentation program. + +## Measuring Experiment Program Success + +Once you have added an experimentation program, teams often look for a way to measure the success +of that program. There are a few ways you can use to measure the success of your experimentation +program, such as universal holdouts, win rate, experimentation frequency and learning rate. Each of +these has their own advantages and disadvantages. + +### Universal Holdouts + +Universal holdouts is a method for keeping a certain percentage of your total traffic from seeing any +new features or experiments. Users in a universal holdout will continue to get the control version of +every test for an extended period of time, even after an experiment is declared, and then those users are compared to users who are getting all the new features and changes. This effectively gives you +a cohort of users that are getting your product as it was, say 6 months ago, and comparing it to all +the work you’ve done since. This is the gold standard for determining the cumulative impact of every +change and experiment, however, it has a number of issues. + +To make universal holdouts work, you need to keep the code that delivers the old versions running and +working on your app. This is often very hard to do. Some changes can have a non zero maintenance +cost, block larger migrations, or limit other features until the holdout ends. Also, any bugs that arise +that only affect one side of the holdouts (either control or in the variations), can bias the results. Finally, +due to the typically smaller size of the universal holdout group, it can take longer for these holdout +experiments to reach significant values, unless you have a lot of traffic. + +Given the complexity of running universal holdouts, many companies and teams look for other proxy +metrics or KPIs to use for measuring experimentation program success. + +### Win Rate + +It can be very tempting to want to measure the experimentation win rate, the number of A/B tests +that win over the total number of tests, and optimize your program for the highest win rate possible. +However, using this as the KPI for your experiment program will encourage users to not run high risk +experiments and creates a perverse incentive for more potentially impactful results (see Goodhart’s +Law). Win rate can also hide the benefits of not launching a “losing” test, which is also a “win”. + +### Experimentation Frequency + +A more useful measure than win rate is optimizing for the number of experiments that are run. This +encourages your team to run a lot of tests which increases the chances of any one test producing +meaningful results. It may, however, encourage you to run smaller experiments over larger ones, which +may not be optimal for producing the best outcomes. + +### Learning Rate + +Some teams try to optimize for a “learning rate” which is the rate at which you learn something about +your product or users through A/B testing. This does not have the frequency or win rate biases, but +also is nebulously defined. How do you define learning? Are there different qualities of what you learn? + +### KPI Effect + +If you can pick a few KPIs for your experimentation program, you should be able see the effects +of the experiments you run against this. You may not be able to see causality precisely due to the +natural variability in the data, and typically small improvements from an A/B test, but by aligning by the +graph of this metric to experiments that are run, you may start to see cumulative effects. This is what +GrowthBook shows with our North Star metric feature. + +## Prioritization + +Given the typical success rates of experiments, all prioritization frameworks should be taken with a +grain of salt. Our preference at GrowthBook is to add as little process as possible and to maximize for a +good mix of iterative and innovative ideas. + +### Iteration vs Innovation + +It is useful to think of experiment ideas on a graph with one axis being the effort required and the +potential impact on the other. If you divide the ideas into two for high effort/impact and low effort/ +impact, you’ll end up with the following quadrant. + + + + + + + + + + + + + + + + + + + + + +
Low impactHigh impact
High effortDangerPrioritize
Low effortPrioritizeRun now
+ +The low effort, high impact ideas you should be running immediately, and similarly the high effort, low +impact ideas you may not want to run at all. But this leaves the other two, low effort but low impact +(smaller tests), and high effort high impact ideas (big bets). If you over index for smaller test ideas, you +can increase your experimentation frequency, but risk not getting larger gains. If you over index for +bigger bets, you decrease your experimentation frequency at the hope of larger returns, at the risk of +not achieving the smaller wins which can stack up. You can also consider the smaller tests as being +“iterative” and the bigger bets as “innovative”. + +Finding a good mix of small, iterative tests and bigger bets/innovative tests is the best strategy. What +constitutes “good” is up to the team. Some companies will bucket their ideas into these two groups, +and then ensure that they are pulling some percentage of ideas from both lists. A healthy mix of large +and small ideas are important to a successful experimentation program. + +### Prioritization frameworks + +In the world of A/B testing, figuring out what to test can be particularly challenging. Often prioritization +requires a degree of gut instinct which is often incorrect (see success rates). To solve this, some +recommend prioritization frameworks, such as ICE and PIE. + +:::note +Note: Please keep in mind that while these frameworks may be helpful, they can work to give the +appearance of objectivity to subjective opinions. +::: + +#### ICE + +The ICE prioritization framework is a simple and popular method for prioritizing A/B testing ideas based +on their potential impact, confidence, and ease of implementation. Each idea is evaluated on each of +these factors and scored on a scale of 1 to 10 and then averaged to determine the overall score for that +idea. Here's a brief explanation of the factors: + +- **Impact**: This measures the potential impact of the testing idea on the key metrics or goals of the + business. The impact score should reflect the expected magnitude of the effect, as well as the + relevance of the metric to the business objectives. +- **Confidence**: This measures the level of confidence that the testing idea will have the expected impact. + The confidence score should reflect the quality and quantity of the available evidence, as well as any + potential risks or uncertainties. +- **Ease**: This measures the ease or difficulty of implementing the testing idea. The ease score should + reflect the expected effort, time, and resources required to implement the idea. + To calculate the ICE score for each testing idea, simply add up the scores for Impact, Confidence, and + Ease, and divide by 3: + +> ICE score = (Impact + Confidence + Ease) / 3 + +Once all testing ideas have been scored using the ICE framework, they can be ranked in descending +order based on their ICE score. The highest-ranked ideas are typically considered the most promising +and prioritized for implementation. + +#### PIE + +Like the ICE Framework, the PIE framework is a method for prioritizing A/B testing ideas based on their +potential impact, importance to the business, and ease of implementation. Each score is ranked on a 10 +point scale. + +- **Potential**: This measures the potential impact of the testing idea on the key metrics or goals of the + business. The potential score should reflect the expected magnitude of the effect, as well as the + relevance of the metric to the business objectives. +- **Importance**: This measures the importance of the testing idea to the business. The importance score + should reflect the degree to which the testing idea aligns with the business goals and objectives, and + how critical the metric is to achieving those goals. +- **Ease**: This measures the ease or difficulty of implementing the testing idea. The ease score should + reflect the expected effort, time, and resources required to implement the idea. + To calculate the PIE score for each testing idea, simply multiply the scores for Potential, Importance, + and Ease together: + +> PIE score = Potential x Importance x Ease + +Once all testing ideas have been scored using the PIE framework, they can be ranked in descending +order based on their PIE score. The highest-ranked ideas are typically considered the most promising +and prioritized for implementation. + +### Bias in prioritization + +Regardless of what prioritization method you choose, it's quite common to develop a bias for a +particular types of ideas within a team. Make sure you're open to ideas +that may not fit your preconceived notions of what will work (see [Semmelweis Effect](/using/experimentation-problems#semmelweis-effect)). +Be mindful of when you're saying "no" to an idea if it's based on data or opinion. The goal, in the +end, is to improve your business by producing the best product. + +## Experimentation Culture + +Adopting experimentation as a key part of being a more data-driven organization has numerous +benefits to culture. Specifically around areas of alignment, speed, humility, and collaboration. + +### Alignment + +Adopting a north star metric or KPI that would drive our business success removes a lot of ambiguity +about projects because we had clear success metrics. By making sure you have defined success +metrics at the start of your planning cycle, you achieve alignment around your goals. This helps +reduce the invariable scope creep and pet features from inserting themselves — or at least gives you +a framework to say “yes, but not now.” Knowing what success means also allows developers to start +integrating the tracking needed to know if the project would be successful from the beginning, which +can often be forgotten or only done as an afterthought. + +### Speed + +When adopting an experimentation mindset, the default answer to a difference of opinion becomes +“let’s test it” instead of long drawn out ego bruising meetings. This helps reduce personal opinions +or bias affecting decisions. Quite often decisions in companies without this mindset are made by +whomever is the loudest, or the HiPPOs (Highest Paid Person’s Opinion). By focusing on which metrics +defined success, and defaulting to running an experiment, you can remove the ego from the decision +process, and move quickly. + +Experimentation can also help increase your product velocity by minimizing the time it takes to +determine if your new product or feature has product market fit. Most big ideas can be broken down +into a small set of assumptions that, if true, would mean your larger idea may be successful. If you can +prove or disprove these ideas, you can move more quickly and not waste time on failing ideas (loss +avoidance). + +### Intellectual humility + +AB testing shows us that, in most of the cases, people are bad at predicting user behaviors. When you +realize that your opinions may not be correct, you can start to channel your inner Semmelweis and be +open to new ideas that challenge any deeply held entrenched norms or beliefs. Having an open mind +and intellectual humility for new ideas can make your workplace a more collaborative environment, and +produce better products. + +### Team collaboration + +When you are open to new ideas, you can remove the silos that prevent teams from collaborating well. +The goal is to produce the best product as measured by a specific set of metrics. With this alignment, +and the openness to new ideas, you can dramatically increase collaboration as good ideas come from +anywhere. + +## Driving Experimentation Culture + +Developing a culture of experimentation can be hard, especially in a company where it has never existed. +It requires a lot of buy-in from the top down, and/or a lot of evangelism from the bottom up. + +### Top down + +This is often the easiest way to drive experimentation culture. If the CEO or CTO or CPO says that they +want more experimentation, they can make it happen. In these situations, picking the right platform and +educating your team becomes the hardest part. You'll want to pick a platform that the developers like +to use, that doesn't add unnecessary effort per experiment, and that brings the incremental cost per +experiment close to zero. These are some of the reasons we built GrowthBook. If you do decide on +GrowthBook, we can also help with educating your team. + +### Bottom up + +If you don't have buy-in from the top, you can still drive experimentation culture from the bottom up. +Typically this starts with one team that wants to start experimenting. They may start with a simple test. +Experimentation like this can be contagious, and other teams may start to see the benefits of running +experiments. It's important with this approach to make sure that you are sharing your results, both +good and bad, and that you are evangelizing the benefits of experimentation. + +### Sharing + +One great way to get fresh ideas and to help experimentation culture is to share your experiment +ideas and results. Our preferred way to present your results is with an experiment review meeting. The +premise behind these is to talk about the experiment without revealing the results, and to have people +guess about the outcome. Specifically, you talk about the hypothesis and observations as to what and +why you are testing, and then talk about the metrics you’re testing against, and then show screen shots +of the variations (if applicable). You can have people vote simply by raising their hand. Once you’ve had +people guess, you reveal the actual results. This is a great way to help build intellectual humility and +also collect new ideas. + +GrowthBook has built experiment review meetings directly into the platform. You can create presentation +from the management left navigation. You can then share the presentation with your team, and they can +vote on the results. + +## Organizational Structures + +As you start to scale your experimentation program, you’ll want to think about how you want to organize +your teams to ensure high frequency and high quality. There are a number of different ways to organize +your teams, and we’ll go through some of the most common ones we’ve seen. + +### Isolated teams, + +When companies first start experimenting with experimenting, they often start with isolated teams. This +can even be one individual on a team. + +One of the problems with this approach is that as an individual, it is hard to have good ideas to test +continually, and you may suffer from idea bias, where your experiences and expertise limit the number +and type of ideas you test. Another issue is that successes and failures are not shared. As is typical +of experiment programs, if you present ideas that are failing at a 60%+ rate, people may think that +the team is doing something wrong. + +These isolated teams can be critical in helping grow awareness of experimentation-driven development. +However, the isolated team does not scale well, and running the frequency of experiments to see large +impacts will be hard. If the team and leadership like the results, you’ll want to expand to one of the +other structures. + +### Decentralized Teams + +As awareness of the ease of and insights gained through experimentation, more teams may start +experimenting. This is great and increases the frequency of experimentation that you can run. Each team +is empowered to design and start their own experiments- this is sometimes referred to as experimentation +democratization. + +There can be some downsides with this approach. It can end up like the Wild West, where best practices, +data, metrics, and tooling may not be shared from team to team. This can make it hard for teams to +ensure consistent quality and trustworthiness of the results. + +### Center of Excellence + +To compensate for the problems of decentralized experimentation programs, many companies will switch +to a center-of-excellence approach. With this structure, a central experimentation team ensures that +experiments follow best practices, have a testable hypothesis, and have selected the right metrics +before launching. This team can also ensure that the data looks right as it comes in and that the +results are interpreted correctly. + +One of the issues with the center-of-excellence approach is that it can easily become a bottleneck of +excellence and limit the number of experiments that are run. + +### Hybrid + +Combining the best of the decentralized teams and center-of-excellence is one of the best ways we’ve +seen to run experimentation programs. The Hybrid approach involves an experimentation team that +oversees the experiments that are run but don’t directly gatekeep the launching of experiments. +In this role, the experimentation team serves as advisors to the teams that are running experiments, +helps them improve the quality of experiments, and can also help look into any issues that appear. +They can also ensure that the platform, metrics, and data are following their standards. This approach +aims to have the experimentation team help educate the product teams on best practices and common +pitfalls with running experiments. diff --git a/docs/docs/using/security.mdx b/docs/docs/using/security.mdx new file mode 100644 index 00000000000..9899619497b --- /dev/null +++ b/docs/docs/using/security.mdx @@ -0,0 +1,68 @@ +# Security + +GrowthBook is built with security in mind, and have made architectural decisions to ensure that your GrowthBook +instance can be as secure as possible. + +## Data Security + +GrowthBook only stores account information for your users with GrowthBook accounts (email, and name). +For feature flagging, evaluations typically happen within the SDK, and so none of your users information +is sent to GrowthBook. For experiment reporting, we connect to your data warehouse to pull the +assignment/exposure and metric information. This data remains in your data warehouse, GrowthBook only +stores the aggregate results, such as the total number of users exposed to each variation, and other +statistical information. No PII is stored or transferred to GrowthBook with the experimentation reports. + +There are some ways where personal information may be stored or exposed with GrowthBook. If you are using +GrowthBook on the client side of your application, the rules about how each feature will be exposed +to your users is publicly accessible by inspecting network requests. Normally these rules contain no personal information, but if +you are targeting to a specific user or set of users, than this information may be visible to malicious +users. If you are using GrowthBook on the server side, this information is not exposed to the client. +For these reasons, targeting based on PII when using GrowthBook client side is not recommended. + +If you have to target based on PII on the client side, GrowthBook has some ways to make this secure. You +can use hashed attributes, where the values are hashed before sending, or you can use encrypted +attributes, where the payload is encrypted before sending to the client. Keep in mind that encrypted SDK +endpoints will have to payload decrypted before use, and that means if you are using it client side, that +it is possible for a malicious actor to see the decrypted payload. You can enable encryption when setting +up the SDK, and you can select an attribute as 'hashed' when creating the attribute. + +Finally, to avoid these issues, you can also use 'remote evaluations' to evaluate +flags based on sensitive information without exposing it, even on client side. With remote evaluations, the SDK will send the +user attribute to your server (or ours), and then that attribute is matched against the rules, and then +the state of the feature is returned, without revealing the targeting rules to the client. The downside +of this approach is that each users requires a network request to evaluate the feature, which can slow +down your application. + +## Data Access + +GrowthBook requires read only access to your data warehouse. This connection information is kept encrypted. +To help protect access to your data warehouse through the GrowthBook UI, you can use permissions. You can +assign users who can edit the connection info, or who can edit SQL queries for the assignment or metric queries. + +Data sources can also be scoped to projects. This means that only users who have access to the project and have +the right permission levels can edit those queries. If you require more separation of your data, you can create +separate cloud organizations, or run GrowthBook in multi-tenant mode when you self-host (available as part of +GrowthBook Enterprise). With this, you can have separate GrowthBook organization running from one central +GrowthBook instance. Each organization will have its own data source, metrics, and users. +There is a super-admin account type that can manage users across organizations. + +## Infrastructure Security + +GrowthBook Cloud is hosted on AWS, in a multi-tenant environment. We use industry standard security +to ensure that your data is secure. If you require additional security, we also offer self-hosted options. +When you self-host, no data leaves your infrastructure. We do have anonymous usage tracking enabled by +default, but this can be disabled (see [self-hosted](/self-host)). + +## Self-hosted deployments + +GrowthBook can be self-hosted on your own infrastructure. If self-hosting, we recommend that you keep GrowthBook behind a +firewall, and accessible via a VPN. See [self-hosting](/self-host) for more information. GrowthBook should +also be regularly updated to ensure that you are running the latest version with the latest security patches. +GrowthBook updates are backwards compatible, and can be easily applied with a single command. See +[updating GrowthBook](/self-host#updating-to-latest) + +Before deploying GrowthBook in production, we recommend that you make sure you've configured GrowthBook correctly: + +- Change the `JWT_SECRET` environment variable. This is used to sign the JWT tokens used for authentication, and needs to be changed from the default. +- Change the `ENCRYPTION_KEY` environment variable. This is used to encrypt sensitive data, and should be set to a long random string. +- Set the `NODE_ENV` environment variable to `production`. This will enable add additional optimizations and disable some debugging features. diff --git a/docs/sidebars.js b/docs/sidebars.js index d1335b5cab6..acecb8eb3d9 100644 --- a/docs/sidebars.js +++ b/docs/sidebars.js @@ -352,12 +352,24 @@ const sidebars = { }, items: ["self-host/environment-variables", "self-host/config-yml"], }, + /* + { + type: "category", + label: "Advanced", + collapsed: true, + items: [ + { type: "doc", id: "api-overview", label: "API" }, + { type: "doc", id: "self-host/proxy", label: "Proxy" }, + { type: "doc", id: "webhooks", label: "Webhooks" }, + ], + }, + */ { type: "doc", id: "self-host/proxy", label: "Proxy" }, { type: "doc", id: "api-overview", label: "API" }, { type: "doc", id: "webhooks", label: "Webhooks" }, { type: "category", - label: "How to Guides", + label: "Installation Tutorials", collapsed: true, link: { type: "doc", @@ -382,13 +394,20 @@ const sidebars = { }, { type: "doc", - id: "guide/rudderstack-and-nextjs-with-growthbook", - label: "Rudderstack + Next.js", + id: "integrations/shopify", + label: "Shopify + GrowthBook", + className: "pill-new", }, { - type: "link", - href: "https://docs.growthbook.io/open-guide-to-ab-testing.v1.0.pdf", - label: "Guide to A/B Testing", + type: "doc", + id: "integrations/webflow", + label: "Webflow + GrowthBook", + className: "pill-new", + }, + { + type: "doc", + id: "guide/rudderstack-and-nextjs-with-growthbook", + label: "Rudderstack + Next.js", }, { type: "doc", @@ -444,18 +463,6 @@ const sidebars = { label: "SCIM", className: "pill-new", }, - { - type: "doc", - id: "integrations/shopify", - label: "Shopify + GrowthBook", - className: "pill-new", - }, - { - type: "doc", - id: "integrations/webflow", - label: "Webflow + GrowthBook", - className: "pill-new", - }, ], }, { type: "doc", id: "faq", label: "FAQ" }, @@ -471,6 +478,58 @@ const sidebars = { }, ], }, + { + type: "category", + label: "GrowthBook Guide", + collapsed: true, + className: "top-divider", + link: { + type: "doc", + id: "using/index", + }, + items: [ + { + type: "doc", + id: "using/fundamentals", + label: "Fundamentals of AB testing", + }, + { + type: "doc", + id: "using/experimentation-best-practices", + label: "Best Practices", + }, + { + type: "doc", + id: "using/experimentation-problems", + label: "Common Problems", + }, + { + type: "doc", + id: "using/product-development", + label: "Experimentation in Product Development", + }, + { + type: "doc", + id: "using/experimenting", + label: "Experimenting in GrowthBook", + }, + { + type: "doc", + id: "using/growthbook-best-practices", + label: "Organizing GrowthBook", + }, + { + type: "doc", + id: "using/security", + label: "Securing GrowthBook", + }, + { + type: "doc", + id: "using/programs", + label: "Experimentation Programs", + }, + ], + }, ], }; diff --git a/docs/src/styles/components/_menus.scss b/docs/src/styles/components/_menus.scss index f28e0066ef1..029a95b7966 100644 --- a/docs/src/styles/components/_menus.scss +++ b/docs/src/styles/components/_menus.scss @@ -83,14 +83,19 @@ html[data-theme="dark"] { } } - &.pill-new { - .menu__link { - &::after { - content: "new"; - @include pill(#13a100, #fff); - } + &.pill-new > .menu__link, + &.pill-new > div > .menu__link { + &::after { + content: "new"; + @include pill(#13a100, #fff); } } + + &.top-divider { + border-top: 1px solid #cccccc80; + margin-top: 5px; + padding-top: 5px; + } } } diff --git a/docs/static/images/using/ab-test-diagram.png b/docs/static/images/using/ab-test-diagram.png new file mode 100644 index 00000000000..b1d176d625b Binary files /dev/null and b/docs/static/images/using/ab-test-diagram.png differ diff --git a/docs/static/images/using/ad-hoc-menu.png b/docs/static/images/using/ad-hoc-menu.png new file mode 100644 index 00000000000..7ec4ba673ac Binary files /dev/null and b/docs/static/images/using/ad-hoc-menu.png differ diff --git a/docs/static/images/using/ad-hoc-report-config.png b/docs/static/images/using/ad-hoc-report-config.png new file mode 100644 index 00000000000..5383bf8974d Binary files /dev/null and b/docs/static/images/using/ad-hoc-report-config.png differ diff --git a/docs/static/images/using/ad-hoc-report-saving.png b/docs/static/images/using/ad-hoc-report-saving.png new file mode 100644 index 00000000000..ecf04d088e3 Binary files /dev/null and b/docs/static/images/using/ad-hoc-report-saving.png differ diff --git a/docs/static/images/using/attribution-window-diagram.png b/docs/static/images/using/attribution-window-diagram.png new file mode 100644 index 00000000000..2e6289c095f Binary files /dev/null and b/docs/static/images/using/attribution-window-diagram.png differ diff --git a/docs/static/images/using/attribution-window-diagram2.png b/docs/static/images/using/attribution-window-diagram2.png new file mode 100644 index 00000000000..c743dd9368a Binary files /dev/null and b/docs/static/images/using/attribution-window-diagram2.png differ diff --git a/docs/static/images/using/dimension-selector.png b/docs/static/images/using/dimension-selector.png new file mode 100644 index 00000000000..8a013b9ad50 Binary files /dev/null and b/docs/static/images/using/dimension-selector.png differ diff --git a/docs/static/images/using/environments-page.png b/docs/static/images/using/environments-page.png new file mode 100644 index 00000000000..42bf9b02771 Binary files /dev/null and b/docs/static/images/using/environments-page.png differ diff --git a/docs/static/images/using/experiment-results-bayesian-details.png b/docs/static/images/using/experiment-results-bayesian-details.png new file mode 100644 index 00000000000..24a2d0e6c12 Binary files /dev/null and b/docs/static/images/using/experiment-results-bayesian-details.png differ diff --git a/docs/static/images/using/experiment-results-bayesian-details2.png b/docs/static/images/using/experiment-results-bayesian-details2.png new file mode 100644 index 00000000000..eca2b06ed0b Binary files /dev/null and b/docs/static/images/using/experiment-results-bayesian-details2.png differ diff --git a/docs/static/images/using/experiment-results-bayesian.png b/docs/static/images/using/experiment-results-bayesian.png new file mode 100644 index 00000000000..0f2c6c3494a Binary files /dev/null and b/docs/static/images/using/experiment-results-bayesian.png differ diff --git a/docs/static/images/using/experiment-results-frequentist.png b/docs/static/images/using/experiment-results-frequentist.png new file mode 100644 index 00000000000..ee81df509ac Binary files /dev/null and b/docs/static/images/using/experiment-results-frequentist.png differ diff --git a/docs/static/images/using/experiments-filtered-by-tag.png b/docs/static/images/using/experiments-filtered-by-tag.png new file mode 100644 index 00000000000..0aa0a97cecb Binary files /dev/null and b/docs/static/images/using/experiments-filtered-by-tag.png differ diff --git a/docs/static/images/using/guardrail-metrics.png b/docs/static/images/using/guardrail-metrics.png new file mode 100644 index 00000000000..8712eb42bc1 Binary files /dev/null and b/docs/static/images/using/guardrail-metrics.png differ diff --git a/docs/static/images/using/health-page.png b/docs/static/images/using/health-page.png new file mode 100644 index 00000000000..2ae9ca8a50a Binary files /dev/null and b/docs/static/images/using/health-page.png differ diff --git a/docs/static/images/using/jon_hamm-cropped.png b/docs/static/images/using/jon_hamm-cropped.png new file mode 100644 index 00000000000..8d584f0b65d Binary files /dev/null and b/docs/static/images/using/jon_hamm-cropped.png differ diff --git a/docs/static/images/using/metrics-modal.png b/docs/static/images/using/metrics-modal.png new file mode 100644 index 00000000000..1cd99b83281 Binary files /dev/null and b/docs/static/images/using/metrics-modal.png differ diff --git a/docs/static/images/using/projects-page.png b/docs/static/images/using/projects-page.png new file mode 100644 index 00000000000..f01f6d8bec8 Binary files /dev/null and b/docs/static/images/using/projects-page.png differ diff --git a/docs/static/images/using/save-ad-hoc-report.png b/docs/static/images/using/save-ad-hoc-report.png new file mode 100644 index 00000000000..06c9dbf1957 Binary files /dev/null and b/docs/static/images/using/save-ad-hoc-report.png differ diff --git a/docs/static/images/using/segments-page.png b/docs/static/images/using/segments-page.png new file mode 100644 index 00000000000..9c87a6cdec1 Binary files /dev/null and b/docs/static/images/using/segments-page.png differ diff --git a/docs/static/images/using/srm-check-health-page.png b/docs/static/images/using/srm-check-health-page.png new file mode 100644 index 00000000000..bbbcdd8b721 Binary files /dev/null and b/docs/static/images/using/srm-check-health-page.png differ diff --git a/docs/static/images/using/tags-page.png b/docs/static/images/using/tags-page.png new file mode 100644 index 00000000000..334847812bf Binary files /dev/null and b/docs/static/images/using/tags-page.png differ