# Limitations of Experiments (and Average Treatment Effects)

We've now discussed at length all the magical things we get from randomized experiments. But let's take a moment to also discuss some of the limitations -- both practical and conceptual -- of experiments and the "average treatment effect" framework. 

## Who's Average?

Of all the potential problems with ATE, perhaps the biggest is that *it's just an average*. 

If everyone in our data has the same response to treatment (in the lingo, if we have "homogeneous treatment effects"), then this isn't a problem -- then estimating the average treatment amounts to estimating every individual's treatment effect. 

In the real world, however, there are almost always heterogeneous (varying) treatment effects across groups and individuals. 

Consider the following example: In 2018, the FDA approved a new drug for treating chronic migraines (Aimovig) that is being hailed by some as a "game changer" in migraine treatment. As is required for drug approval in the US, the pharmaceutical companies developing Aimovig had to undergo clinical trials in which a random sample of people with chronic migraines were given Aimovig (treatment), and a random sample was not (control). And if you see an add for Aimovig, you'll probably see the following result from those trials:

![migraine_average_effect](images/migraine_average_effect.png)

Cool! Setting aside the fact the companies selling Aimovig are pushing the reduction in migraines count from before the trial to after (and hiding the actual difference between control and treatment in the fine print -- *any* medical intervention tends to reduces symptoms, so you really do have to compare outcomes between treatment and control groups), the ATE of the drug appears to be about 2-3 fewer days of migraine a month (reduction of 6-7 in treatment minus 4 in control) for people who have 15+ headaches and >8 migraines a month. 

That's good -- chronic migraines can be a crippling disability, and any improvements are exciting -- but you'd be excused for asking why people are so excited about what seems like a relatively small reduction.

The answer is that the treatment effect of Aimovig is *extremely* heterogeneous. *Most* people who take Aimovig see little to no benefit, but *some* (depending on your criteria, something like 40%) see their migraine frequency fall by 50% or more.  

And herein lies the problem of ATE: it doesn't tell us about the *distribution* of effects. 

To help understand heterogeneous effects, it is common in analyzing experiments to look for differences in outcomes among sub-populations. For example, we might split our sample into men and women, and see if the treatment effect among men is different from the treatment effect among women.

This can be especially important in interventions that may have disproportionate impacts. A sales tax, for example, may have a low average effect on the amount of money households have to spend on their children's education, but among low-income households, that effect may be very large. And here, again, is where values come into data science -- if you *just* present someone with an average treatment effect, they will generally interpret it as "the" treatment effect, so it's up to you to ensure that decision makers are aware of not just the *average* effect of an action, but also the distribution of consequences. 

(On a technical note: splitting your sample also reduces the sample size in each bucket, so it reduces your statistical power. That means that you can generally only do it for proportionately large groups in your data (unless you're working with massive datasets).)

## The Fine Print of ATE

In addition to these conceptual issues, there are also a handful of technical issues to be aware of when calculating treatment effects. 

### SUTVA

Implicit in our discussion of the potential outcomes framework and definition of ATE is the idea that when we assign one unit to treatment or control, it has no impact on the outcomes of other units. (If you're a math person: in the potential outcome derivations we've done, this is embodied by the way we made our outcomes additively separable.)

The cleanest example of where this holds is in something like a medical trial where we give people cholesterol medicine -- me getting cholesterol medicine does not affect the health of someone in the control group. There are no "spillovers" to my treatment assignment / people in the control group have a "stable" treatment assignment.  

By contrast, if we were doing a medical trial of vaccines, then my assignment to treatment (getting a vaccine) *might* have an impact on the health of people in the control group (if we live or work in the same place) because the reduced likelihood of me getting sick also makes them less likely to get sick. As a result, even a perfect randomized experiment will not allow you to estimate ATE of vaccines in this situation because you aren't *really* comparing treated individuals to control individuals, you're comparing treated individuals to kinda treated individuals. 

Where does this matter? 

In industry, it matters on any platform with lots of interactions between users. If you run an AB test on the matches some people see in a dating app, their change in behavior will also change the behavior of users in your control group. Similarly, changing a Facebook users' Newsfeed will change what they share, resulting in changes to the experience of other users. 

There are ways around this -- for Facebook experiments, you can pick treatment and control individuals who are *very* far apart from one another socially in the hope that changes in treatment behavior won't ever reach your control individuals. But even this is problematic -- if you're testing a new feature, giving it to one person may not accurately reflect what would happen if you gave it to one person *and* all their friends. In those cases, you can "block randomize", randomly assigning big groups to control or treatment instead of individuals, while also trying to make sure treatment and control groups are far from one another. 

ATE is best defined when you have a clear units of analysis that is relatively isolated from other units.  This doesn't always mean you need *individuals* to be independent. For example, it is common in development economics to assign *rural villages as a whole* to either treatment or control, since we think that if we assigned some individuals within a village to treatment and some to control, those people would likely interact in ways that violate SUTVA. But since rural villages in developing countries are *relatively* isolated from one another, we think that the treatment assignment of each *village* should be independent of outcomes for other villages. 

### Endogenous stopping

There is often a temptation when running experiments to watch the data roll in as the experiment runs. In AB testing, you may watch because it's easy; in medical studies, you may watch because the trial is expensive and you'd like to stop as soon as you can, or because you want to know if lots of patients start experiencing negative side effects. 

But it turns out that it is critically important to the legitimacy of experiments that you not stop an experiment because the data looks good (or bad).

Ending an experiment because of the intermediate results is what's called "stopping endogenously", and it will render your experiment statistically invalid. The math on this gets very complicated, but the basic idea is that the apparent results of your experiment will fluctuate over time, and the law of large numbers only guarantees that in the long run, your $\widehat{ATE}$ will probably be equal to the true $ATE$. The results for short periods are likely to show your treatment is more amazing than it really is, or more terrible than it really is; probability only ensures those moments will be relatively rare. But if you choose to stop an experiment *because* you've hit on of those moments (that should be fleeting), you'll end up with erroneous results. 

To illustrate this point, [Ramesh Johari, Leo Pekelisâ€ , and David Walsh](https://arxiv.org/pdf/1512.04922.pdf) ran an experiment where they ran an "A/B test" on a large website where the two treatment conditions A and B were exactly the same. They ran this over several days, then plotted -- for each moment in time -- what an analysis of the experiment would say about whether A is better than B if they stopped the experiment at any given point. As the figure shows, the result is generally accurate that there's no significant difference between A and B; but there are moments where random fluctuations make the difference *look* significant, and so if you chose to stop the experiment because those spikes occurred, you'd be in deep trouble!

![ab_significance_over_time](images/ab_significance_over_time.png)

To be clear, that doesn't mean there *aren't* ways you can stop experiments based on results -- see Johari, Pekelist, and Walsh's paper for ways to do so in a statistically sound sense -- but don't do it unless you really understand the statistics (even if your boss really wants to!).
