# Data Driven Decisions

In this piece, we'll get down to the core of what data means, how we can extract knowledge from information from data, and how to use it as a savvy buisnessperson, researcher, or scientist. You won't need any preparation, but there will be code available for the programmers experienced and aspiring alike to use and modify. You will walk away from this page with a better understanding of what your data needs or needs not and how to most effectively to use it.

# Part 1: Information vs Data

Webster's online dictionary defines Data as factual measurements or statistics used as a basis for actions such as reasoning, discussion, or calculation. We want to pay specific attention to the word "basis" here, as it scopes out the fact that data can be, quite easily, manipulated. If we don't want to be succeptible to this manipulation, then we should make sure we turn that data into information. We see that Information's entry is "knowledge obtained from investigation, study, or instruction," and that is really what's at play when we're making smart, data-driven decisions:

## Fallacies and Manipulation

<!--  --> cite: https://www.datasciencecentral.com/profiles/blogs/data-fallacies-to-avoid-an-illustrated-collection-of-mistakes

The first step in being a conscious member of the data community is to identify and refute bad data appropriately. There are many resources for the many, varried ways to manipulate data, but we'll look at only the most common and most relevant ones here.

#### Confirmation Bias

The biggest consequence, good or bad, of the information age is that data is available at the click of a mouse or wake-word in almost everybody's homes. It certainly is good for us to have access to more, but that means that we now have to be aware of confirmation bias as we interpret this data. If you're looking for the answer to a question that you don't have an opinion on, even slight differences in phrasing will affect the content of your results. Take Coffee and Acne for example, in a simple Google search question we see very different results based on the choice of cause/cure:
 
 Cure            |  Cause
:-------------------------:|:-------------------------:
![](https://i.imgur.com/z1FkMNB.png)  |  ![](https://i.imgur.com/qIPZvc2.png)

We see that even in a case where you're not trying to put your bias into the search, that your results can be skewed. This can be resolved by looking at aggregates of data instead of single-sources. This makes way for "pop science" where a study finds that X causes/cures Y, where both X and Y are common conditions. We see more how these unfortunate statistical anomolies become daytime news stories here.


#### Data Dredging

Data dredging is the effect when, intentional or not, you have too many variables or opportunities for correlation for a single dataset. There is a famous example of this, when the website fivethirtyeight did a report on the specific field of nutition, as the data available has issues already. They noted the specific issues of how they generate "huge data sets with many, many variables," which are most suceptible data dredging. [The article](https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/) finds a few particularly remarkable relationships, where they found the following links from the repondant data:

![](https://i.imgur.com/CoKDCuc.png)

You'll (hopefully) not be surprised to know that there is no correlation in real life between cabbage and innie bellybuttons or tomatoes and Judaism. This is a function, as the article explains, of over 27,000 regressions being used over less than 60 complete responses, where they expect a 5% false positive rate. Why 5%? well, that goes into the next topic, p-hacking.


#### P-Hacking and Cherry Picking

You might have noticed the far-right column in the above picture has a "p-value" label. This is the value that statisticians use to measure the risk of a false positive. In general, studies with a p-value of less than 0.05, or 5%, is able to be published. It is a hard burden to prove in some situations, but in some situations it is absurdly easy. Linked in the above article is an [interactive piece](https://projects.fivethirtyeight.com/p-hacking/) by fivethirtyeight that allows you to use p-hacking in action. This one comes with a bonus of letting you lean into confirmation bias as well, as it is contains economic and  political data from the last ~70 years. You get to cherry pick the types of political data to use and the ways to measure a "good" economy, and the right panel allows you to see how significant your results are. It should go without saying that if you cherry pick enough, you can prove any conclusion you want to.

This phenomenon is very possible in the transfer of information at any stage from data sources to scientists to reporters to consumers, again intentional or unintentional. We see this taken to a comedic level with XKCD's exploration of jelly beans:

![](https://i.imgur.com/k2dDf9e.png)


What you should take away here is that data manipulated is not meaningful, and I argue it is not even information, as it does not represent knowledge in any real way. Let's not be too cynical now, there _is_ meaningful information in the world, and we want to separate it from this darker part of the data sphere.





## Meaningful Information

Here we have information, not misleading or malicious, and want to determine what to do with it. So then, what makes your information worth producing and using?

#### "Useful" Information

When we say "useful", what we want is something that is SMART. SMART is an acronym that is generally used for goal-setting, where it stand for:

- S: Specific
- M: Measurable
- A: Attainable
- R: Relevant
- T: Timely

<!--  --> cite https://www.mindtools.com/pages/article/smart-goals.htm

These are the things that you want in goal setting for sure, but we also want them in information, and it's no coincidence (p > 0.05). We want information that supports the goals you have when using it. Otherwise, information is as good as trivia, which is not bad necessarily, but useless by definition. Very briefly point by point:

- Specific: this information should be scoped to answer the question(s) you have, as information too broad risks being less applicable to the situation
- Measurable: an excellent metric for what to do with your data, and the subject of the next part
- Attainable: similar to specificity, information too specific might be too difficult to build a solution for, or not worth the development time
- Relevant: if your information is not about your subject, it won't be a good basis for solutions
- Timely: while data being too old doesn't mean that it's wrong, it makes the correlation to today weaker and harder to build for. Likewise, a projection too far in the future might not be appropriate for your current project or plans.


#### "Strong" Information

While this sounds like a dataset for powerlifting competions, it actually reflects the next stage in data-driven decision making that we need to address, what do we do with the data?

Consider the following information:
> Men 18-25 years old in the last 6 months were more likely to purchase jeans than men aged 25-40. 

What does a company that produces jeans do with this information? Probably they would increase brand marketing to men who are 18-25, but by how much? The information makes a distinction (more likely vs less likely), not a measurement (X percent more likely). If the information was instead:
> Men 18-25 years old in the last 6 months were 10 times more likely to purchase jeans than men aged 25-40.

Ad agencies would be jumping from their seats to start hiring male actors between 18 and 25! This touches on how useful information really is, and, as a bonus, is generally a good way to tell how good the question being asked in a survey was.



## Datasets

On our journey to data-lightenment, we do need to start somewhere. I've enjoyed sharing comics and other cherry-picked factoids for you all, but let's now practice what we preach.

#### Covid-19 data

The ongoing pandemic is something that has had lots of impact on the world, as I'm sure you have already noticed. The ways that it has had an impact is measurable, both to the human body and to our human society, and many high-stakes decisions depend on understanding this data. The sources online are variant, with online repositories like [Kaggle](https://www.kaggle.com/datasets) and [Google Research](https://datasetsearch.research.google.com/) and official sources like [data.gov](http://www.data.gov) and the [CDC](https://www.cdc.gov/datastatistics/index.html).

There are many data visualizations and interactives as well as experts scrutinizing and summarizing this data for us all, thankfully. If you are looking for a kit project or want to practice with some vizualization tools, however, this might be a good place to start working.

One of these that is particularly well-made is the [arcgis dashboard](https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6), which aggregates data and allows you to see the impact on your local county as well as the national/international trends. It does this without leaving room for p-hacking by allowing inappropriate filtering or manipulation, but still answering the questions it seeks to address. You can see the trendline for every state, as well as any country by selecting it in the list on the left. See here:

![Covid](https://i.imgur.com/KGWmlY8.jpg)



#### Census data

The founding fathers of America were many things, but statisticians they we not. They did not have a particular fascination with regression, machine learning models, or gradient decent, and we shall shame them for that. However, they did appreciate the value of good data, err, data at least. From Article 1, section 2 of the U.S. Constitution:

> Representatives and direct Taxes shall be apportioned among the several States which may be included within this Union, according to their respective Numbers, which shall be determined by adding to the whole Number of free Persons, including those bound to Service for a Term of Years, and excluding Indians not taxed, three fifths of all other Persons. The actual Enumeration shall be made within three Years after the first Meeting of the Congress of the United States, and within every subsequent Term of ten Years, in such Manner as they shall by Law direct.

<!-- Cite: https://www.census.gov/history/pdf/Article_1_Section_2.pdf -->

<!-- I don't want to make historical judgements on the moral quality of this data, but the statistical quality of the data is prevalent here. We see that there is 10-year periodic data-gathering, a full count of "free Persons", and a multiple on "all other Persons." This means that the ratification of the 14th amendment (that rescinded the three-fifths compromise) in 1868 should be measurable in the differences in the 1860 and 1870 censuses.

I DID NOT SEE THE CORRELATION I SUPPOSED HERE... I WILL INVESTIGATE FURTHER, BUT THIS SEEMS LIKE AN INTERESTING FOLLOW UP ARTICLE IF THE MYSTERY IS RESOLVED IN AN INTERESTING WAY-->

What all of that text means is that every 10 years, falling on the years ending in 0, there is a nationwide tally of the people who live in the country. There is some historical context here of course as well, including the exception to Indian reservations that are not _really_ part of the U.S. in lots of ways, and the three-fifths compormise that was removed in the 14th amendment.

So what do we do with this? Well, there is a tally of people living in the U.S. since its founding, and that data is aggregated and made available to the public. The individual entries are removed to protect people's privacy, but the country is big enough that the aggregates can often be quite granular. A nice repository of information is [available online](https://www.census.gov/population/www/censusdata/hiscendata.html) and is used every day for things like redistricting, allocating federal funds, and business decisions. The last of these is probably most applicable to us here today, as the types of business decisions feed us into the types of conclusions we want to draw from this information, turning knowledge into conclusions.


