# Structuring a Causal Project

## What is it That You Even Want?

THE most important thing to define in ANY data science project, not only causal ones, is what the hell do you want. That's often disregarded as non important because it doesn't involve fancy math or algorithms, but it is often the most important part of a project. Fail to define it, and there is a high chance you will end up with a solution for a non existent problem. What you want is meant to act as a guiding star, one that tells you how far or how close you are to your goal.

Depending on what you want, you may or may not use causal inference. Causal inference is just one tool among many to solve data science problems. But remember: don't hammer your thumb because you are trying to use the wrong tool. Learn to recognise what type of problem you are dealing with so that this doesn't happen. To help you with that, I've laid out here what a typical causal inference problem looks like.

This most basic problem you can solve with causal inference is figuring out how something you can do or change will impact some metric of interest.

![img](./data/img/causal-project/lever-1.png)

You typically have a population (usually a set of customers), you can do something to that population and you want to know what will happen to your business if you do that thing to the population. Example: you have rich prospects (potential customers) from a rich neighbourhood. This is your population of interest. Then, you want to attract those potential customers to your business (the thing that will happen) by placing a billboard marketing your company in that neighborhood (the thing that you can do, or lever you can pull). Of course, it is always a good idea to frame this problem in math terms because it is much more concise than writing a whole paragraph. 

$$
\underset{T}{argmax} \ E[Y_T]
$$

Here, \\(T\\) is the lever you can pull, which we will refer to as the treatment from now on. In our example, is the decision between placing or not a billboard. \\(Y\\) is the metric you are trying to influence. In our example, it's the number of prospects you managed to convert to customers. For this simple case, \\(T\\) can only take two values: zero (not placing a billboard) or one (placing it). Thus, solving this problem amounts to figuring out if:

$$
E[Y_1] - E[Y_0] > \text{Billboard Cost}
$$

This is obviously a causal problem. You can use all the toolset from part 1 to solve it. Some common extension of this problem is when \\(T\\) have more than two values or is continuous. For example (it's a bad example, but bear with me), suppose you have 3 places where you can set the billboard. Then, the problems converts to

$$
\underset{t \in \{0,1,2,3\}}{argmax} \ E[Y_{T=t}]
$$

where \\(0\\) means no billboard at all and 1 to 3 are the different possible locations. I can't give a continuous example with billboards, but think about the problem we've seen before: deciding the optimal coupon value. Here, the problem converts to a classical optimization problem, where we can wright the expected value as a function of the treatment (or coupon value).

$$
E[Y_T] = F(T)
$$

Then, you have to find the maximum value of that function. You can do it by testing different regions of \\(T\\) or you can also place some parametric restriction on \\(F\\). In our coupon example, if we believe there is an optimal coupon value, we can argue that the functional form is quadratic:

$$
F(T_i) = \beta_0 + \beta_1T_i - \beta_2T_i^2 + e_i
$$

We can then estimate those parameters with OLS and optimise the function by differentiating it:

$$
F'(T_i) = \beta_1 - 2\beta_2 T_i = 0
$$

In all the examples above we have an optimisation problem where there is one single optimal value for \\(T\\). This is usually good enough for most businesses and it is a good starting point for most causal solutions. However, we can go even further. 

In some situations finding the average optimal \\(T\\) is simply unprofitable or not enough. In those cases, we need to do some sort of **personalization** of the treatment. We need to segment the population in some clever way that allows us to find a region where the localized optimal treatment is still better than the average optimal treatment.

![img](./data/img/causal-project/lever-2.png)

In this second case, you can think about your population as a set of different subpopulations. There is again one optimal treatment for the entire population. However, now you are interested in something else. You care more about the best treatment in each subpopulation rather than the best treatment overall.

To give an example, pretend that you are a bank trying to figure out how much credit you should give. You tested giving multiple credit values, but the best one of those values, \\(T=BRL 500\\), was still unprofitable when given to everyone. Doing some more research, you then figured out that you can partition your population according to the customer's income. Then, you found out that you can have a profitable business if you give \\(T=BRL 1000\\) in credit to customers with income above BRL 4000 and \\(T=BRL 100\\) for customers with income below that level. This example shows how you can find multiple treatments, each one optimal on a subpopulation, and end up better than you would have been using a single best average treatment. 

In a more technical notation, we are going from a problem of estimating the treatment effect so that we can optimise the treatment

$$
E[Y_0 - T_1]
$$

To one where we want to estimate the conditional treatment effect so that we can optimise the treatment *within each subpopulation*

$$
E[Y_0 - T_1 | X]
$$

Notice that we can also define this for the continuous case

$$
E\bigg[\frac{\delta Y}{\delta T} | X \bigg]
$$

To recap, we saw that problems that can be solved with causal inference have an optimization nature. They are either of the form "how can I set T to optimise Y", where the goal is to find a single best treatment that optimizes the desired treatment effect, or of the form "how can I set T within each subpopulation defined by X so that I can optimise Y", where the goal is to find a best treatment within subpopulations. 

Learning how to identify and formulate our problem is a first and important step in solving it. But we are nowhere near done. We still need to lay out the procedure that actually solves a causal problem.

## Causal Inference Assumptions

In my experience with causal problems, I've noticed that we can break them up into two smaller, but still difficult, problems. The first one is **gathering data** that allows for causal inference. The second one is, given that data, actually **doing causal inference**. 

Doing causal inference is no easy task. For a start, you need to have data that satisfies a series of conditions in order for causal inference to even be possible. If the data doesn't satisfy those conditions you are usually in big trouble. Hence, it pays to know what those conditions are. We often refer to them as the causal inference assumptions.

### Conditional Unconfoundedness or Exchangeability

$
(Y_0, Y_1) \perp\!\!\!\perp T | X
$

We've hinted on this assumption on chapter two already. It basically means that the treatment and control population are exchangeable and don't differ in any systematic way. Another way of stating it is that, after we control for factors measured in X, the treatment is as good as random. This assumption fails to hold when we have unmeasured confounders. 

To give an example, suppose that we are estimating the effect of prices on icecream sales. But the ice cream store always raises prices during the summer, when people buy the most. If we do a simple estimation of how changes in price affect changes in sale, it will look like increases in prices will lead to an increase in sales. In this situation, there is confounder and this assumption doesn't hold: the season variable affects both treatment and outcome, being thus a confounder. However, if we can control for season with, say, a linear regression, then, we have conditional unconfoundedness, because, conditioning on season, the treatment becomes as good as random and we can successfully estimate the effect of price on sales. 

Notice that this obviously doesn't hold if we can't measure all the confounders. However, I find that in the industry, unconfoundedness usually holds for the following reason. The company usually defines a treatment policy (ex: giving coupon values) based on a set of internal measured indicators. You can then argue that the treatment is a function of only things that you can measure: \\(T(X)\\). If that is the case, unconfoundedness will hold if you can control for the same set of variables that defined the treatment policy in the first place. 

### Positivity

$$
0 < P(T=t) < 1
$$

We've briefly discussed positivity when we were discussing propensity scores in chapter 11. Simply put, it says that **all the treatments that are being considered must have a strictly positive, strictly less than 1, probability of being assigned**. Another way of saying this is that the treatment cannot be deterministic. To see this, consider the countexample. Say that, for the same type of customers \\(X=x\\), the treatment is always \\(T=0\\). Then, there is no way of knowing what would happen for those customers if we set their treatment to \\(T=1\\).  We need some randomness in the treatment assignment so that we can get some idea about what each treatment does to each particular type of customer.

This assumption is often a source of great controversy in the industry and I think it is mostly because people don't understand it very well. For example, consider again the case of estimating the causal effect of price on sales. Some people think that, in order to have positivity, no price can have zero probability of being selected. Hence, you should test prices like 1, 2, 3, 4.5 BRL up to 1000 BRL. This is obviously nonsense. The key is noticing that positivity needs to hold only for **the treatment that are being considered**. If you deem a treatment to be so absurd that is not even worth exploring, by all means, assign zero probability to it. Be careful though, you can't test prices like 1, 2, 3, 4.5 BRL and expect to know what would happen if the price was 5. In other words, the positivity assumption says that you can't make inference about treatments that were not considered in your sample (well, you can with extrapolation, but that's super dangerous. Be warned!)

### SUTVA

The Stable Unit of Treatment Value Assumption states that units cannot interfere with each other. In other words, the treatment for unit \\(i\\) only affects unit \\(i\\) and no other unit. The best way to understand this is to look at a case where it does not hold. Consider vaccines, for example. They are usually a treatment for contagious disease. This means that, if I treat your friend, even if you don't get the treatment, it is more likely that you will also be protected from the disease. This causes the treatment effect to spill over from the treatment to the untreated. It is not something catastrophic, but it biases the treatment effect towards zero. That's because, by making the untreated look like the treated due to spillover, the difference between them will be harder to measure. 

Spillover is not something that happens only in vaccines. Another common place for it is in marketing. If you engage some customers, those customers could like your product and tell their friends about it, causing a ripple effect of your marketing campaign. 

## Causal Inference System

Because causal inference requires your data to abide by these strict assumptions, it is rarely the case where you can simply get some pre-recorded data and solve a causal problem. More commonly, you will also have to design a system that collects data following the requirements set by the causal assumptions laid above. One example of such a system is an experiment, where you randomly (or conditionally randomly) test each of the treatments out. And, fair enough, we will talk about those. But here I want to lay out something more general. Something that you can use as a guideline when thinking about solving causal problems in your own work or field. 

First, notice that knowledge of the causal inference assumptions gives us some guidelines on how we should collect data in order to perform causal inference. First and foremost, unconfoundness says it is **super important to know very well the process that generates the treatment**. Want to know the effect of price on sales? Make sure you store everything that is used to define price in the first place! Want to know which ad campaign is better? Make sure you log all the variables that are used to define which campaign is displayed to each customer. Those will be crucial for unconfoundedness and could mean the difference between the possibility or impossibility of estimating a causal effect.

Also, when you want to make any inference about the effect of some treatment, you need to make sure it is tested non deterministically in your population of interest. You can't hope to know what would happen if you set your price to 100. Your prices have never exceeded 50. Similarly, you can't know what is the effect of a red homepage on customer retention if you never tried anything other than red. 

Knowing all those things, here is what I think anyone that wants to do causal inference the way grownups do should seriously consider implementing.

![img](./data/img/causal-project/causal-system.png)

Here is how this thing works. First, you start by defining a population and its defining features \\(X\\). This is the leftmost square. A population can be something like the customers I have from some city, say, Rio de Janeiro. The defining characteristic could be something like customer age, history of customer transactions with the business, customer tenure and so on. The important thing is that those features will be the input to your policy. 

The policy is this little machine that takes customers and attaches treatments to them. In mathematical terms, it's a function that takes as input the defining features of your entities and maps each entity with a treatment. 

$X_i \rightarrow F \rightarrow (X_i, T_i)$

This policy has two purposes: its primary purpose is to select the best possible \\(T\\) for each unit \\(i\\) in order to maximise some business metric \\(Y\\). However, this policy also needs to evolve and get better with time. For this reason, it also needs to explore treatments that are different from the ones it thinks to be optimal. This is the traditional exploit-explore tradeoff, well known to reinforcement learning. One very simple and effective way of designing this policy is to select the known optimal treatment with probability \\(p\\) and explore the treatment space with probability \\(1-p\\).

$$
\tau \sim Binomial(p)
$$

$$
t_i \sim D(t1, t2, ..., tk)
$$

$$
T_i = \tau \ \underset{T}{argmax} \ E[Y_{iT}|X_i] \ + \ (1-\tau)t_i
$$

where \\(D\\) defines the sampling distribution for the treatment you want to explore. 


$$
\tau \sim Binomial(p)
$$

$$
t_i \sim D(t1, t2, ..., tk)
$$

$$
T_i = \tau \ \underset{T}{argmax} \ E[Y_{iT}|X_i] \ + \ (1-\tau)t_i
$$

where \\(D\\) defines the sampling distribution for the treatment you want to explore. 

Next to the policy, after the \\((X, T)\\) pairs have been formed, we see how that treatment plays out in the real world. Process is simply measuring what is the value of the quantity we want to optimise associated with each \\((X, T)\\) pair to produce a triple \\((X, T, Y)\\). 

Once we have this, all the pieces for causal inference are in place. We can then proceed to estimate the effect of \\(T\\) on \\(Y\\) on the population defined by \\(X\\). This information is then used to update the policy to make it better and better and the cycle restarts. 

## Key Ideias

My goal here was twofold. First, I tried to give you some hints to identify a problem that you can solve with causal inference tools. The hope being that you start to identify those problems around you. I firmly believe that the first step in solving a problem is understanding it. So there you have it. Some tools to identify and formulate causal problems.

Second, I've laid out how causal inference can't be done in any kind of data. Istead, there is a set of requirements that the data must satisfy so that causal inference can be done at all. I've shown you those and argued that, for such reasons, causal inference projects often require a data gathering phase before any inference is done. We then saw a good example of a causal inference system that mixes estimation, execution and data gathering. I hope all of these can serve as a map that you can use to navigate the most thorny causal problems. 

# References 

The things I've written here are mostly stuff from my head that I've learned through experience. This means that there isn't a direct reference I can point you to. It also means that they have **not** passed the academic scrutiny that good science often goes through. Instead, notice how I'm talking about things that work in practice, but I don't spend too much time explaining why that is the case. It's a sort of science from the streets, if you will. However, I am putting this up for public scrutiny, so, by all means, if you find something preposterous, open an issue and I'll address it to the best of my efforts. 

To check out some similar ideas as the ones laid here, I strongly suggest you read Hal R. Varian paper: [Causal inference in economics and marketing](https://www.pnas.org/content/113/27/7310).

