# Developing retropsective causal inference

A retrospective model uses data including entities with and without an effect condition. Entities with the effect condition are then matched to those without it. Finally, differences in the frequency of the potential causal factors are investigated for correlation with the effect condition. Because we carefully construct the similar, non-effect group to match the effect group, we control for a wide range cofactors. This is not perfect (see "Disadvantages" below), but it's a powerful way to survey a wide range of potential causal factors for an effect.

## Benefits of Causal inference

Causal inferences allow us to influence outcomes. We don't just want to predict how long we will live, how well kids will do in school, or which patients will die of a heart attack: we want to increase our life spans, help kids learn, and prevent needless deaths. Hence, we need to know not only which variables _predict_ but also which variables _cause_ outcomes. 


## Example: Pitcher Injury

Baseball teams often can't do injury experiments on their players. Practically, they need their pitchers to pitch their best, not pitch in a way that investigates the effect of pitching differently. Ethically, they can't expose their pitchers to risky treatments. Legally, the CBA wouldn't allow it. This is an ideal situation for retrospective analysis. But baseball teams and players want to reduce player injuries, so they need to figure out which factors influence injury risk.

Among the varibales that a team can control are workloads, how often and how much he pitches. Does workload influence injury? A retrospective analysis might provide an answer.

**Step one:** we identify pitchers who have a particular serious injury, such as a UCL tear requiring ligament replacement. 

**Step two:** we idenfity pitchers who did not have that injury. For each pitcher in the the group with the injury (the effect group), we identify a similar pitcher in the uninjured group. Similarity can be determined with many factors, as long as we don't include the factors that we want to analyze for an causal influence; initially, we might use age, weight, height, BMI, team affiliation, velocity, and prior injury history.

**Step three:** we examine any features that we didn't use to determine similarity. Any feature that is more common in the injured group is a candidate causal factor; i.e., this is a correlation which _does_ suggest causation. Hence, we might look at release point consistency, velocity, pitch selection, trade history, or nearly any other factor we can get our hands on. (Again, this factor cannot be one that was used to deterine similarity.)

## Advantages

Retrospetive models can be use to analyze data after it has been collected. Thus, if we already have a lot of data about something, we can look for causation even if we didn't plan to ahead of time or if other methods of causal inference are impossible for ethical or practical reasons.

## Disadvantages

This is not Randomized Experimental Design (RED). Even with well constructed controls, it is possible that a some co-factor of an apparent causal factor is the real factor. (For example, caffine consumption might look like a causal factor, but if caffine consumption is corelated with cigarette smoking, then this may be only apparent. RED controls for this. In a retrospective model, the researcher must look for it.) 

Additionally, retrospective models cannot be used to determine effectiveness. Because the control group necessarily does not have the effect condition, there is no way to guage how much a causal factor influences the effect condition. Of course, once a causal hypothesis has been identified, it may be possible by other means to estimate effectiveness. 

#### Further Reading

See Giere, R. _Understanding Scientific Reasoning_

In [2]:
import numpy as np
import pandas as pd

In [None]:
class Retro:
    '''A class for analyzing data for causal relationships with a retrospective model.

    Attributes
    ----------
    effect_label: str
        The label of the column containing the effect condition.
    
    causal_variables: list-like
        a list of columns containing variables to test for causal influence.

    match_variables: list-like
        a list of columns in data that are used to determine the similarity of two entities.

    '''

    def __init__(self, data, effect_label, match_variables, causal_variables):
        '''
        Parameters
        ----------
        effect: DataFrame
            The data in which a cause might be found.
            
        effect_label: str
            The label of the column in which we find the effect condition of interest. 
        '''

    

    

# Matching

From the point of view of implementation, the matching process is probably the most complex portion of the problem. Each normalized datapoint is a vector. We need to match each datapoint in the effect group with a datapoint in the control group, using the distance between the vectors as a similarity metric. The matches must be unique and we want to minimize overall dissimilarity.

## More precisely

Let the effect group be $X = \{a, b, c\}$ and the control group be $Y = \{w, x, y, z\}$. We want to pair members of X with members of $Y$ in a way that minimizes the distance between the pairs. The pairing should be unique, so that no member of $Y$ is matched to more than one member of $X$. A complexity emerges: what if $a$ and $b$ are each closer to $x$ than to any other member of $Y$. How can we chose whether to pair $a$ or $b$ with $x$? Let $m$ denote our mapping. We want to minimize the total dissimilarity in the sets. Mean squared distance provides a model for such similarity maximization:
$$
min \sum{
    \sqrt{
    \frac{(dist(x,m(x))^2}{n}
    }
}
$$

