# How to Pick One Card by Looking at Every Draft Ever

## Introduction

In this analysis, we provide statistical estimates of the efficacy of selecting first picks based on a strategy of taking the highest ranked card in a pack according to a given metric, for several different metrics. I hope the utility of such an analysis is obvious, assuming it is accurate. If we are going to use metrics to rank cards, we should have some belief that taking a card with a higher rank than other cards will help us win more. If taking cards according to metric A definitively leads to more winning than metric B, then it should be fair to say that that metric A is a more accurate ranking of card quality than metric B.

The approach we will use is new and comprehensive. For each p1p1 made in over 2.2 million drafts, we will determine the highest ranked card in the pack according to each different strategy, then replace the results of the event with the results of the average event where that card was taken, under the condition of being the highest ranked card in the pack, when possible, by players in the same skill band, when possible. To provide the broadest analysis, we look for examples of the given card being selected in neighboring skill cohorts, moving toward the mean, when there are no examples within the skill cohort under consideration. A few cards end up being left out of the analysis. I would have liked to exclude the comparisons but the corresponding rows are hard to remove from the base case, so instead I just ignored them. The metrics most affected were GIH WR and GP WR (which don't use ATA and therefore have some weirder picks), for which about 0.1% of the picks were not possible to simulate. Given that this event is correlated with a very low quality p1p1, it's possible that those picks would have a win rate 2% or so lower than average, which would mean the overal result would be .002% lower than what was recorded. So, de minimis.

Unlike analyses which simply look at the results of taking the highest ranked card in the pack, this should completely control for opening luck, since the distribution of opening picks in the simulated strategy is identical to what would have been possible with the observed opens. Since we are only conditioning on p1p1, the remaining picks should be sufficiently independent of the strategy and there should not be any additional luck effects. The might be a slight bias in the strength of the alternative cards passed to the left (since the person that actually chose the given card may have had less enticing alternatives), but I don't think card strength passed by itself should materially affect results. In any case, we are mostly interested in the differences between metrics rather than the differences from the base case, and each comparison uses the same methodology.

A remaining source of bias is the time travel induced by the simulation methodology. If we replace a given pick with a pick made under a different metagame, one more conducive to the pick in question, then the results could be spurious. There is a broad element of this that we control for, which is the general decrease in win rates over time within a given cohort. The card-by-card element is harder to control for, but I also think it is not a big source of worry. First of all, it is hard to exploit a format by choosing a first pick. Generally, the best performing metrics had skew towards the end of the format when people starting picking the best cards earlier.

The corpus of drafts includes every draft available in a public dataset on 17Lands.com with a schema supporting this analysis. That means the last 16 draft releases going back to NEO, excluding HBG and SIR. Because I think the findings are of material interest to anyone who uses 17Lands metrics to inform picks, I will present the basic chart first, then dive into methodology and additional analysis. This is a chart of the marginal win rate resulting from picking p1p1 according to the rankings provided by different metrics, segmented by an estimation of player skill based on win rate and the number of games played:

\<chart here>

The metrics evaluated are ATA (using the lowest value, of course), GIH WR, and GP WR, according to the methodology used by 17Lands.com, and DEq, which is a custom metric I introduced this summer, and which intends to estimate the increase in win rate attributable to drafting a given card relative to a "null pick", e.g. a basic land, using only the daily data provided by 17Lands.com. For this analysis I recreated the same methodology I use in my spreadsheet, although I intend to pursue improvements to the methodology as my next project.

The win rate groups are estimated from the data using a Bayesian methodology which I wrote up in [this post](bayes-cohorts). The basic idea is that a lot of confidence is required to assign a player to a high win rate bucket, so to end up in the 66% group it's not enough to just win at a high rate, you have to do it over a very large number of games (e.g. 500 games at 68% win rate). I combined groups beyond the endpoints into the end buckets (42% and 66%).

## Interpretation

Before I dive into more details about the methodology, let me give my interpretation of the results. I think it's a serious result that shows that many, many drafters could materially improve their results by taking higher quality cards. And that DEq is the best metric currently available. Let's consider the scale of the result. For players in the 54% band, we estimate that by choosing the highest DEq card from pack 1, without making *any* other changes to their game, their win rate would increase by about half a percent. That's just one pick. I estimate that at most, a single pick is worth about 3% on average, and this analysis is estimating that average players are leaving half a point of that on the table for each draft. You could go on to choose higher quality cards in p1p2, p1p3, and throughout the draft, and that could presumably have additional benefits.

Now, as for the ranking of metrics, well, I win. I hope you don't think I somehow cooked the books. I had absolutely no way to predict which metric would perform best, aside from the technique of developing a metric intended to perform well when used to choose cards, which I freely admit to. I've spent the last three months developing the technology to make this analysis (and a lot more) possible, and as I embarked on the project I had no inkling of the shape the final data would take. 

To be honest, I expected GIH WR to perform worse and for DEq to win by more. As an entrenched GIH WR hater, I have spent a lot of time dwelling on the weaknesses of the metric, and I am slowly beginning to understand what makes it work for a lot of people. Some of the things that, statistically speaking, are bias, happen to correlate well with performance for early picks. In particular, the philosophy of choosing cards with "high impact," as opposed to cards that perform well as part of a linear strategy, stands up well to changes in the metagame (which I know many people are perfectly aware of, but I'm slow), and correlates with ATA to compensate for that missing information. Nevertheless, correctly evaluating cards in linear, aggressive strategies is a persistent weakness for the metric.

One of the things that had surprised me in my first forays into player behavior was the stickiness of ATA for elite players. I expected card quality to take over as the driving force of picks for elite players, but that is not the case. ATA continues to predict p1p1 behavior better than any win-rate metric. However, that stickiness did not translate to performance, suggesting that even elite players could go further in adjusting their games to card-quality information. But for elite players, it won't be enough to take one ranking for the whole format and blindly follow it, as none of the metrics gave persistent improvements at the highest levels. Part of that is due to the cohort-switching used to find 