# Statistics

**Statistics** is the science of changing your mind under uncertainty. 

## Statistics

What’s a **statistic**? It’s any old way of mushing up our data. Yup. 100% technically correct definition. 

Making decisions based on facts (**parameters**) is hard enough as it is, but sometimes we don’t even have the facts we need. Instead, what we know (our **sample**) is different from what we wish we knew (our **population**). That’s what it means to have uncertainty.

A statistical approach only makes sense when there’s a mismatch between the information you want and the information you have.

You opted for statistics because 

- (1) your decision is important — otherwise you’d prefer data-mining for a faster path to inspiration — and 
- (2) the data you have doesn’t cover all the entities you’re interested in, so you’re trying to make an Icarus-like leap from your sample to your population. If you can’t even specify where you’re leaping, expect a big splat! 

## Population, Sample, Observation, Statistic, Parameter, Estimate, Estimator 

In statistics, a **population** is the collection of all items that you are interested in (for the purpose of making a decision rigorously).

By writing down a description of your population, you’re agreeing that only the population, the whole population, and nothing but the population is interesting for your decision. If coming up with your population of interest sounds daunting, remember that it’s up to you to pick what you want to be interested in. There’s no incorrect choice, as long as it’s specific and thorough.

In a real project, the population description involves plenty of fineprint. Alas, decision-makers don’t always realize that thinking deeply about this is their job.

A **sample** any collection of items from the population. The sample is the data you have and the population is the data you wish you had.

An **observation** is a measurement from one single item in a sample.

A **statistic** is any way of mushing up sample data (ie. mean, median, sum, min, max, etc.).

A **parameter** summarizes the population for you. If we knew the parameter, we’d be home right now. It’s the fact that we’re looking for, but unfortunately facts are not always available. Since we cannot compute the parameter, we can only make a best guess about it using a **statistic**.

An **estimate** is just a fancy word for best guess about the true value of a parameter. It’s the value your guess takes, while an **estimator** is the formula you use for arriving at that number.

If you have all the information about your population. You can finish up by using analytics — just go ahead calculate the average. Then the **statistic** is the **parameter** because your **sample** is the **population**. You are dealing with pure facts. Thanks to having perfect and complete data, no complicated calculating is required.

When we don’t have facts, all we can hope for is combining data with assumptions to make reasonable decisions.

### Advice for those who work with decision-makers

If you see a vague population description, set up a picket line until the decision-maker does their homework. The project isn’t ripe for fancy calculations yet.

When decision-makers don’t realize that thinking deeply is their job, remind them. This goes beyond population definition. There are a lot of tasks the decision-maker has to complete before your math can be useful. Spending all weekend rigorously chasing down some half-baked question a decision-maker drops on your desk is a well-known rookie mistake, but I see so many junior data scientists falling for it repeatedly.

### Advice for decision-makers

Ask your buddy from Legal to help you out — they’re probably better at thinking through your population definition than you are. Law school might not call it statistical thinking, but it teaches this bit better than a stats PhD program does.

For the DIY version, rely on your inner lawyer: next time you’re defining a population, ask yourself, “Is it airtight? Would a lawyer put their stamp of approval on this… or should I go think about it a little harder?”

## Bayesian vs. Frequentist Statistics

**Statistics** is the science of changing your mind under uncertainty. What might your mind be set to? A default action or a prior belief.

**Bayesian statistics** is the school of thought that deals with incorporating data to update your beliefs. Bayesians like to report results using `credible intervals` (two numbers which are interpreted as, “I believe the answer lives between here and here”).

**Frequentist statistics** deals with changing your mind about actions. You don’t need to have a belief to have a `default action`, it’s simply what you’re committed to doing if you don’t analyze any data. Frequentist (a.k.a. classical) statistics is the one you’re more likely to encounter in the wild and in your STAT101 class, so let’s keep it classical for the rest of this article.

## Hypothesis Testing

A hypothesis is a description about how the universe might look, but it doesn’t have to be true. We’ll be figuring out whether our sample makes our hypothesis look ridiculous to determine whether we should change our minds.

The **null hypothesis** describes all worlds where doing the `default action` is a happy choice; the **alternative hypothesis** is all other worlds. If I convince you - with data! - that you don’t live in the null hypothesis world, then you had better change your mind and take the alternative action.

For example: “We can walk to class together (default action) if you usually take under 15 minutes to get ready (null hypothesis), but if the evidence (data) suggests it’s longer (alternative hypothesis), you can walk by yourself because I’m outta here (alternative action).”

All of **hypothesis testing** is all about asking: `does our evidence make the null hypothesis look ridiculous?` **Rejecting the null hypothesis** means we learned something and we should change our minds. Not rejecting the null means we learned nothing interesting, just like going for a hike in the woods and seeing no humans doesn’t prove that there are no humans on the planet. It just means we didn’t learn anything interesting about humans existing. Does it make you sad to learn nothing? It shouldn’t, because you have a lovely insurance policy: you know exactly what action to take. If you learned nothing, you have no reason to change your mind, so keep doing the default action.

So how do we know if we learned something interesting, something out of line with the world in which we want to keep doing our default action? To get the answer, we can look at a p-value or a confidence interval.

The **p-value** says, “If I’m living in a world where I should be taking that default action, how unsurprising is my evidence?” The lower the p-value, the more the data are yelling, “Whoa, that’s surprising, maybe you should change your mind!”

To perform the test, compare that p-value with a threshold called the **significance level**. This is a knob you use to control how much risk you want to tolerate. It’s your maximum probability of stupidly leaving your cozy comfy default action. If you set the significance level to 0, that means you refuse to make the mistake of leaving your default incorrectly. Pens down! Don’t analyze any data, just take your default action. (But that means you might end up stupidly NOT leaving a bad default action.)

<img src="images/pvalue.png" alt="P-value" style="width: 800px;"/>

From: [Statistics for people in a hurry](https://towardsdatascience.com/statistics-for-people-in-a-hurry-a9613c0ed0b)

## Confidence Intervals

A **confidence interval** is simply a way to report your hypothesis test results. To use it, check whether it overlaps with your null hypothesis. If it does overlap, learn nothing. If it doesn’t, change your mind.

Only change your mind if the confidence interval doesn’t overlap with your null hypothesis.

While a confidence interval’s technical meaning is little bit weird, it also has two useful properties which analysts find helpful in describing their data: 

- (1) the best guess is always in there and 
- (2) it’s narrower when there’s more data. 

Beware that both it and the p-value weren’t designed to be nice to talk about, so don’t expect pithy definitions. They’re just ways to summarize test results. (If you took a class and found the definitions impossible to remember, that’s why. On behalf of statistics: it’s not you, it’s me.)

What’s the point? If you do your testing the way I just described, the math guarantees that your risk of making a mistake is capped at the significance level you chose (which is why it’s important that you, ahem, choose it... the math is there to guarantee you the risk settings you picked, which is kind of pointless if you don’t bother to pick‘em).

The math is all about building a toy model of the null hypothesis universe. That’s how you get the p-value.

The p-value and confidence interval are ways to summarize all that for you so you don’t need to squint at a long-winded description of a universe. They’re the endgame: use them to see whether or not to leave your default action. Job done!

## Power Analysis

Hang on, did we do our homework to make sure that we actually collected enough evidence to give ourselves a fair shot at changing our minds? That’s what the concept of **power** measures. It’s really easy not to find any mind-changing evidence... just don’t go looking for it. The more power you have, the more opportunity you’ve given yourself to change your mind if that’s the right thing to do. **Power** is the probability of correctly leaving your default action.

When we learn nothing and keep doing what we’re doing, we can feel better about our process if it happened with lots of power. At least we did our homework. If we had barely any power at all, we pretty much knew we weren’t going to change our minds. May as well not bother analyzing data.

**Power analysis** is a way to check how much power you expect for a given amount of data. You use it to plan your studies before you begin. Use power analysis to check that you budgeted for enough data before you begin.

Uncertainty means you can come to the wrong conclusion, even if you have the best math in the world.

## Errors

What is statistics not? Magical magic that makes certainty out of uncertainty. There’s no magic that can do that; you can still make mistakes. Speaking of mistakes, here’s two mistakes you can make in **Frequentist statistics**. (Bayesians don’t make mistakes. Kidding! Well, sort of.)

**Type I error** is foolishly leaving your default action. Hey, you said you were comfortable with that default action and now thanks to all your math you left it. Ouch! **Type II error** is foolishly not leaving your default action. (We statisticians are so creative at naming stuff. Guess which mistake is worse. Type I? Yup. So creative.)

- Type I error is changing your mind when you shouldn’t.
- Type II error is NOT changing your mind when you should.

`Type I error is like convicting an innocent person and Type II error is like failing to convict a guilty person`. These two error probabilities are in balance (making it easier to convict a guilty person also makes it easier to convict an innocent person), unless you get more evidence (data!), in which case both errors become less likely and everything becomes better. That’s why statisticians want you to have more, more, MOAR data! Everything becomes better when you have more data.

More data means more protection against coming to the wrong conclusion.

What’s **multiple comparisons correction**? You’ve got to do your testing in a different, adjusted way if you know you plan to ask multiple questions of the same dataset. If you keep putting innocent suspects on trial over and over again (if you keep fishing in your data) eventually something is going to look guilty by random accident. The term **statistically significant** doesn’t mean something important happened in the eyes of the universe. It simply means we changed our minds. Perhaps incorrectly. Curse that uncertainty!

What’s a **Type III error**? It’s kind of a statistics joke: it refers to correctly rejecting the wrong null hypothesis. In other words, using all the right math to answer the wrong question. Don’t waste your time rigorously answering the wrong question. Apply statistics intelligently (and only where needed).

A cure for asking and answering the wrong question can be found in Decision Intelligence Engineering, the new discipline that looks at applying data science to solving business problems and making decisions well. By mastering decision intelligence, you’ll build up your immunity to Type III error and useless analytics.

## Summary

In summary, statistics is the science of changing your mind. There are two schools of thought. The more popular one - Frequentist statistics - is all about checking whether you should leave your default action. Bayesian statistics is all about having a prior opinion and updating that opinion with data. If your mind is truly blank before you begin, look at your data and just go with your gut.

It’s a lie that you always need statistics; you don’t. If you’re just trying to make a best guess to get inspired, analytics is the best option for you. Shrug off those p-values, you don’t need the unnecessary stress.

Instead, you can choose to live by these principles: More (relevant) data is better and your intuition is pretty good for making best guesses, but not for knowing how good those guesses are... so stay humble.

## TODO:

- !!! http://greenteapress.com/thinkstats/
- !!! Computational statistics with Python: https://people.duke.edu/~ccc14/sta-663/

- Statistical-Inference-for-Everyone: https://github.com/bblais/Statistical-Inference-for-Everyone
- Bootstrap Confidence Intervals and Permutation Hypothesis Testing: https://codingdisciple.com/bootstrap-hypothesis-testing.html
- The Statistical Bootstrap and Other Resampling Methods: https://www.burns-stat.com/documents/tutorials/the-statistical-bootstrap-and-other-resampling-methods-2/

- Introduction to statistics: https://github.com/rouseguy/intro2stats
- Statistical data analysis in Python: https://www.youtube.com/watch?v=DXPwSiRTxYY&feature=youtu.be

- https://github.com/HeinrichHartmann/Statistics-for-Engineers
- Visualizing distributions: http://seaborn.pydata.org/tutorial/distributions.html
- Plotly: Basic Statistics in Python: https://plot.ly/python/basic-statistics/#visualize-the-statistics

- regression analysis using the StatsModels package with Quandl: http://www.turingfinance.com/regression-analysis-using-python-statsmodels-and-quandl/

- linear regression tutorial: http://connor-johnson.com/2014/02/18/linear-regression-with-python/

- ridge and lasso regression in Python: https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

- logistic regression: https://nbviewer.jupyter.org/github/tfolkman/learningwithdata/blob/master/Logistic%20Gradient%20Descent.ipynb

- 6 Easy Steps to Learn Naive Bayes Algorithm (with codes in Python and R): https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/

# Computational Statistics

- Computational Statistics in Python: 
    - http://people.duke.edu/~ccc14/cspy/index.html#
    - http://people.duke.edu/~ccc14/sta-663-2017/
- http://www.mpia.de/~calj/compstat_ss2015/main.html
- https://www.amazon.com/Computational-Statistics-Computing-James-Gentle/dp/0387981438/
- https://github.com/AllenDowney/CompStats
- https://github.com/cliburn/Computational-statistics-with-Python
- https://www.youtube.com/watch?v=VR52vSbHBAk&feature=youtu.be
- http://ptweir.github.io/pyresampling/

- zob. ebooks in downloads

- Probabilistic Programming & Bayesian Methods for Hackers: http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/

## Hypothesis Testing

Hypothesis tests are statistical tests that are used to determine whether there is enough evidence in a sample of data to infer that a particular condition is true for the entire population.

The two central concepts of these tests are the null hypothesis and the alternative hypothesis, but also the p-value is fundamental to hypothesis testing. These things are very hard to understand when you’re new to the field, and it will require some effort to grasp the alpha value or significance level for your p-value and what makes the difference between rejecting or failing to reject the null hypothesis.

- https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html
- http://scipy-lectures.org/packages/statistics/index.html#hypothesis-testing-comparing-two-groups
- http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-24.html

- https://github.com/bblais/Statistical-Inference-for-Everyone
- https://www.inferentialthinking.com/chapters/01/1/intro.html

Probability distributions
- https://bigdata-madesimple.com/how-to-implement-these-5-powerful-probability-distributions-in-python/
- https://www.datacamp.com/community/tutorials/probability-distributions-python
- http://www.dannowitz.co/blog/2015/10/26/distribution-fitting-defining-the-underlying-truth

## Statistical Modeling And Fitting in Python

Now that you've gotten the hang of hypothesis testing and distributions, you can first review or go deeper into how you can make statistical models and fit distributions to data.

Statistical models approximate that what generates your data and can be used in data analysis to summarize data, to predict, and to simulate. In other words, it’s a representation of complex phenomena that generated the data, and that can be used for summaries, predictions or simulations.

This, however, entails that you also need to be able to find out whether your data fits that model.

To provide the best fit between the model and the data estimation can be used. Estimation is concerned with making inferences about a population, based on information obtained from a sample. Next to hypothesis testing, it’s a way of learning something about the population from the sample.

- https://github.com/fonnesbeck/statistical-analysis-python-tutorial
- https://www.youtube.com/watch?v=DXPwSiRTxYY&feature=youtu.be
- https://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/parameter_estimation_techniques/maximum_likelihood_estimate.ipynb
- https://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/parameter_estimation_techniques/max_likelihood_est_distributions.ipynb

- https://machinelearningmastery.com/how-to-code-the-students-t-test-from-scratch-in-python/
- https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/

normality test: https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/

## Bayesian Statistics

Bayesian statistics is a theory that expresses the evidence about the true state of the world in terms of degrees of belief known as Bayesian probabilities. Sometimes, you will want to take a Bayesian approach to data science problems.

- https://nbviewer.jupyter.org/github/tfolkman/learningwithdata/blob/master/Bayes_Primer.ipynb
- https://github.com/jakevdp/ESAC-stats-2014/tree/master/notebooks
- Tutorial: Bayesian Statistical Analysis in Python: https://github.com/fonnesbeck/scipy2014_tutorial
    - https://www.youtube.com/watch?v=vOBB_ycQ0RA&feature=youtu.be
- https://pyvideo.org/scipy-2014/pymc-markov-chain-monte-carlo-in-python.html
- https://www.quantstart.com/articles/Bayesian-Linear-Regression-Models-with-PyMC3
- http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/
- https://www.amazon.com/gp/product/1449370780/
- Probabilistic-Programming-and-Bayesian-Methods-for-Hackers: https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

## Markov Chains

Simply stated, Markov chains are mathematical systems that hop from one "state" to another. These states can be a situation or set of values. That means that you have a list of states available and, on top of that, a Markov chain tells you the probability of hopping, or "transitioning," from one state to any other state.

- https://www.youtube.com/watch?v=VR52vSbHBAk&feature=youtu.be
- https://people.duke.edu/~ccc14/sta-663/MCMC.html

Interesting series of movies and blog posts by Cassie Kozyrkov (Google):

- [Stat Thinking - 001 - What is statistics?](https://www.youtube.com/watch?v=OJt-k9h9pmk&list=PLRKtJ4IpxJpBxX2S9wXJUhB1_ha3ADFpF&index=1)
- [Statistics Savvy Self-Test](https://hackernoon.com/statistics-savvy-self-test-25c2ef4cf73f)
- [Incompetence, delegation, and population](https://hackernoon.com/incompetence-delegation-and-population-95ebeb9beb93)
- [Never start with a hypothesis](https://towardsdatascience.com/hypothesis-testing-decoded-for-movers-and-shakers-bfc2bc34da41)


- [The simplest explanation of machine learning you’ll ever read](https://hackernoon.com/the-simplest-explanation-of-machine-learning-youll-ever-read-bebc0700047c)
- [Machine learning — Is the emperor wearing clothes?](https://hackernoon.com/machine-learning-is-the-emperor-wearing-clothes-59933d12a3cc)
- [Advice for finding AI use cases](https://hackernoon.com/imagine-a-drunk-island-advice-for-finding-ai-use-cases-8d47495d4c3f)
- [Why businesses fail at machine learning](https://hackernoon.com/why-businesses-fail-at-machine-learning-fbff41c4d5db)
- [Explaining supervised learning to a kid (or your boss)](https://towardsdatascience.com/explaining-supervised-learning-to-a-kid-c2236f423e0f)
- [Top 10 roles in AI and data science](https://hackernoon.com/top-10-roles-for-your-data-science-team-e7f05d90d961)


- [Data-Driven? Think again](https://hackernoon.com/data-inspired-5c78db3999b2)
