# Data Driven Decisions

In this piece, we'll get down to the core of what data means, how we can extract knowledge from information from data, and how to use it as a savvy buisnessperson, researcher, or scientist. You won't need any preparation, but there will be code available for the programmers experienced and aspiring alike to use and modify. You will walk away from this page with a better understanding of what your data needs or needs not and how to most effectively to use it.

## Part 1: Information vs Data

Webster's online dictionary defines Data as factual measurements or statistics used as a basis for actions such as reasoning, discussion, or calculation. We want to pay specific attention to the word "basis" here, as it scopes out the fact that data can be, quite easily, manipulated. If we don't want to be succeptible to this manipulation, then we should make sure we turn that data into information. We see that Information's entry is "knowledge obtained from investigation, study, or instruction," and that is really what's at play when we're making smart, data-driven decisions:

### Fallacies and Manipulation

The first step in being a conscious member of the data community is to identify and refute bad data appropriately. There are many resources for the many, varried ways to manipulate data, but we'll look at only the most common and most relevant ones here.

##### Confirmation Bias

The biggest consequence, good or bad, of the information age is that data is available at the click of a mouse or wake-word in almost everybody's homes. It certainly is good for us to have access to more, but that means that we now have to be aware of confirmation bias as we interpret this data. If you're looking for the answer to a question that you don't have an opinion on, even slight differences in phrasing will affect the content of your results. Take Coffee and Acne for example, in a simple Google search question we see very different results based on the choice of cause/cure:
 
 Cure            |  Cause
:-------------------------:|:-------------------------:
![](attachment:cure.png)  |  ![](attachment:cause.png)

We see that even in a case where you're not trying to put your bias into the search, that your results can be skewed. This can be resolved by looking at aggregates of data instead of single-sources. This makes way for "pop science" where a study finds that X causes/cures Y, where both X and Y are common conditions. We see more how these unfortunate statistical anomolies become daytime news stories here.


##### Data Dredging

Data dredging is the effect when, intentional or not, you have too many variables or opportunities for correlation for a single dataset. There is a famous example of this, when the website fivethirtyeight did a report on the specific field of nutition, as the data available has issues already. They noted the specific issues of how they generate "huge data sets with many, many variables," which are most suceptible data dredging. [The article](https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/) finds a few particularly remarkable relationships, where they found the following links from the repondant data:

![](attachment:Screen%20Shot%202020-05-23%20at%208.59.31%20PM.png)

You'll (hopefully) not be surprised to know that there is no correlation in real life between cabbage and innie bellybuttons or tomatoes and Judaism. This is a function, as the article explains, of over 27,000 regressions being used over less than 60 complete responses, where they expect a 5% false positive rate. Why 5%? well, that goes into the next topic, p-hacking.


##### P-Hacking and Cherry Picking

You might have noticed the far-right column in the above picture has a "p-value" label. This is the value that statisticians use to measure the risk of a false positive. In general, studies with a p-value of less than 0.05, or 5%, is able to be published. It is a hard burden to prove in some situations, but in some situations it is absurdly easy. Linked in the above article is an [interactive piece](https://projects.fivethirtyeight.com/p-hacking/) by fivethirtyeight that allows you to use p-hacking in action. This one comes with a bonus of letting you lean into confirmation bias as well, as it is contains economic and  political data from the last ~70 years. You get to cherry pick the types of political data to use and the ways to measure a "good" economy, and the right panel allows you to see how significant your results are. It should go without saying that if you cherry pick enough, you can prove any conclusion you want to.

This phenomenon is very possible in the transfer of information at any stage from data sources to scientists to reporters to consumers, again intentional or unintentional. We see this taken to a comedic level with XKCD's exploration of jelly beans:

![xkcd.png](attachment:xkcd.png)


What you should take away here is that 